Operation of Distributed Battery Considering Demand Response Using Deep Reinforcement Learning in Grid Edge Control

: Battery energy storage systems (BESSs) are able to facilitate economical operation of the grid through demand response (DR), and are regarded as the most signiﬁcant DR resource. Among them, distributed BESS integrating home photovoltaics (PV) have developed rapidly, and account for nearly 40% of newly installed capacity. However, the use scenarios and use efﬁciency of distributed BESS are far from sufﬁcient to be able to utilize the potential loads and overcome uncertainties caused by disorderly operation. In this paper, the low-voltage transformer-powered area (LVTPA) is ﬁrstly deﬁned, and then a DR grid edge controller was implemented based on deep reinforcement learning to maximize the total DR beneﬁts and promote three-phase balance in the LVTPA. The proposed DR problem is formulated as a Markov decision process (MDP). In addition, the deep deterministic policy gradient (DDPG) algorithm is applied to train the controller in order to learn the optimal DR strategy. Additionally, a life cycle cost model of the BESS is established and implemented in the DR scheme to measure the income. The numerical results, compared to deep Q learning and model-based methods, demonstrate the effectiveness and validity of the proposed method.


Introduction
In the past ten years, incentive-based DR has developed rapidly in the form of load shedding and transfer, which can greatly improve the flexibility of the grid. According to the National Energy Administration (NEA), BESS capacity will exceed 100 GW·h by 2030 [1]. On the one hand, residential and commercial BESS, etc., are some of the best DR resources, and are increasing sharply with the prosperity of distributed PV, accounting for more than 50% of the newly installed capacity of PV [2]. On the other hand, daily operation of BESS is designed only for the storage of photovoltaic power and saving electricity cost, without consideration of idle use, such as DR programs. Thus, it is necessary to investigate an optimal operation method for BESS in consideration of DR.
However, demand response mechanisms at home and abroad are not yet perfect, and demand response resources are still scarce compared to peak loads. Grid edge control technology provides the potential for exciting transformations in the power industry, creating more choices, higher efficiency, more comprehensive and more efficient decarbonization for customers, and better economic benefits for stakeholders in the value chain [3]. Grid edge control is at the critical point of the adoption curve. Residents, industry and regulators are preparing to connect to digitally distributed resources. Therefore, grid edge control technology could be feasible way to expand resources in a manner that corresponds to demand.
By reducing peak load demand and transferring load demand to low-price and offpeak times, price-based DR is conducive to reducing customer bills [4]. Recent studies have mainly focused on arranging home power consumption using price-based DR. In [5], Nilsson A et al. controlled smart home appliances using home energy management system (HEMS) and tested the energy-saving potential of smart homes in Sweden. The results showed that the impact of different households on energy consumption varies greatly, indicating that households have a high degree of independence in response to demand. In incentive-based demand response, each customer receives compensation for participating in power demand according to the unified price determined by the utility company [6]. For example, in [7], the authors aimed to maximize the comprehensive benefits of demand response service providers and users, improve the demand response model for power users, and solve the strategy using a deep Q network. In fact, due to the different sensitivities of consumers to prices, a unified price is not able to stimulate all demand response resources. Therefore, ref. [8] established a Stackelberg game model in which a single electricity retailer and multiple users responded based on incentive demand. In this model, when the spot market electricity price was higher than its selling price, the retail e-commerce company developed a demand response subsidy strategy to reduce the loss of electricity sales. In the corresponding period, the user determined the response power according to the subsidy price set by the electricity retailer to obtain additional profits. Incentive-based DR was proposed, acting as an important supplement to price-based DR. However, these papers did not integrate the two DR program. In addition, in the existing literature, DR has always been regarded as an independent procedure, making decisions when there is no information about the operation of the low-voltage distribution network. However, in fact, load reduction or load transfer inevitably change the variables of the station power system, such as active power, increasing its uncertainty. Therefore, it is necessary to integrate the information regarding the power operation of the station area in order to reduce the uncertainty in demand response decision making. For example, in an LVTPA, changes in load cause changes in the three-phase unbalance, which affects the power loss and power quality of the transformer [9].
However, applying conventional algorithms to schedule the appliances of several consumers with mixed-integer variables is difficult due to increased computational complexity and high-dimensional data. For instance, [10] scheduled large-scale residential appliances for DR using a computational algorithm with the cutting-plane method, but did not consider the convergence or calculation time. Recently, a massive number of studies on residential DR have illustrated that reinforcement learning (RL) can tackle the mixed integer programming [11]. The field of deep reinforcement learning (DRL) makes up an active area of research, due to the fact that it is able to find the best implementation strategy by means of "trial and error" in the external environment. Refs. [12,13] implemented batch RL to optimize the operation strategy of heating ventilation and air conditioning, and found an open-loop schedule was required to participate in the day-ahead market. Batch RL can be used to collect empirical data to train the RL agent, and this is popular and critical in areas where it is difficult to build a simulator, or where simulation costs are very high. In addition, the Sara algorithm is an on-policy RL approach, and was applied in thermal comfort control for vehicle cabins based on MDP in [14]. In contrast, Q-learning is an off-policy RL approach that save the learned information in the Q matrix. In [15], the authors focused on the DR of residential and small commercial buildings and studied the optimal energy consumption and storage using Q-learning, the most popular method of RL. Despite of its superior convergence and robustness, Q-learning still suffers from the limitations of continuous states and actions [16,17]. Furthermore, underlying the deep Q-learning networks (DQNs), which combine deep networks and Q-learning by means of the value function [18], the deep deterministic policy gradient (DDPG) algorithm, which is based primarily on the actor-critic method [19], is able to overcome the weakness of DQN in a huge action space. In [20], DDPG was implemented in the state of charge (SOC) management of multiple electrical energy storage systems, and enabled continuous action and smoother SOC control. However, when the application scenario includes a lot of variables, DDPG performs worse than the multi-agent deep deterministic policy gradient (MADDPG), where each agent represents a demand unit.
In conclusion, the gaps not covered by the current research can be listed as follows.
(1) The BESS, which is regarded as the best load for DR, is not used fully in the existing DR scheme. (2) DR mechanisms and sources are not fully developed, thus limiting the performance of DR programs. (3) The current DR models do not consider the relationship between DR and LVTPA. (4) There is still a lack of algorithms to solve large-scale DR problems efficiently. Therefore, inspired by the previous works, this paper proposes a novel DRL-based grid edge controller for BESS, aiming to increase the operation benefits of BESS and three-phase unbalance (TU). The main contributions are summarized as follows: (1) Compared to the current DR management, a novel autonomous DR method that considers the BESS within LVTPA is proposed, significantly enhancing the capability of DR. (2) The proposed method takes idle utilization of BESS into consideration, rather than normal utilization. (3) The proposed method enables the LVTPA to access to information about the local BESS in order to perform global optimization, significantly improving the efficiency of power consumption. (4) The proposed DRL-based grid edge controller combines incentive-based edge control with price-based edge control and achieves a better performance compared to conventional methods.
The remainder of this paper is organized as follows. Section 2 details the general structure of the DRL-based grid edge controller and describes the LVTPA. Section 3 formulates the operation problem as a Markov Game, and the optimal strategy is learned by applying the multi-agent deep deterministic policy gradient (MADDPG) algorithm. The numerical study is given in Section 4. Section 5 concludes the paper.

Typical Structure of LVPTA for Implementing BESS Demand Response
This section presents the structure of the proposed DRL-based DR controller for residential loads in LVTPA. As shown in Figure 1, a typical LVTPA mainly comprises one low-voltage transformer, one grid edge controller, and several home energy management systems. In a three-phase power system, phase ABC, which carries different loads, especially where the load distribution is unbalanced, easily leads to unbalanced three-phase operation in LVTPA. and smoother SOC control. However, when the application scenario includes a lot of variables, DDPG performs worse than the multi-agent deep deterministic policy gradient (MADDPG), where each agent represents a demand unit.
In conclusion, the gaps not covered by the current research can be listed as follows.
(1) The BESS, which is regarded as the best load for DR, is not used fully in the existing DR scheme. (2) DR mechanisms and sources are not fully developed, thus limiting the performance of DR programs. (3) The current DR models do not consider the relationship between DR and LVTPA. (4) There is still a lack of algorithms to solve large-scale DR problems efficiently. Therefore, inspired by the previous works, this paper proposes a novel DRL-based grid edge controller for BESS, aiming to increase the operation benefits of BESS and threephase unbalance (TU). The main contributions are summarized as follows: (1) Compared to the current DR management, a novel autonomous DR method that considers the BESS within LVTPA is proposed, significantly enhancing the capability of DR.
(2) The proposed method takes idle utilization of BESS into consideration, rather than normal utilization. (3) The proposed method enables the LVTPA to access to information about the local BESS in order to perform global optimization, significantly improving the efficiency of power consumption. (4) The proposed DRL-based grid edge controller combines incentive-based edge control with price-based edge control and achieves a better performance compared to conventional methods.
The remainder of this paper is organized as follows. Section 2 details the general structure of the DRL-based grid edge controller and describes the LVTPA. Section 3 formulates the operation problem as a Markov Game, and the optimal strategy is learned by applying the multi-agent deep deterministic policy gradient (MADDPG) algorithm. The numerical study is given in Section 4. Section 5 concludes the paper.

Typical Structure of LVPTA for Implementing BESS Demand Response
This section presents the structure of the proposed DRL-based DR controller for residential loads in LVTPA. As shown in Figure 1, a typical LVTPA mainly comprises one low-voltage transformer, one grid edge controller, and several home energy management systems. In a three-phase power system, phase ABC, which carries different loads, especially where the load distribution is unbalanced, easily leads to unbalanced three-phase operation in LVTPA.     The modeling of BESS is shown in Figure 2 and is mathematically expressed by Expressions (1) and (2) [16].
where P rated is the maximum charge and discharge active power of BESS, P BESS is the current BESS active power, and S soc, max and S soc, min are the maximum and minimum values of SOC, respectively. SOC(t n ) is the state of the battery at t n ; P BESS (t n ) is the charging and discharging power at t n . Please note that the charging and discharging operations cannot be performed at the same time. η is the charging and discharging efficiency; B r is the rated capacity of the battery; σ(t n ) represents the state at t n where 1 represents the charging state, 0 represents disconnection from the power grid, and −1 represents the discharging state.

BESS States
The modeling of BESS is shown in Figure 2 and is mathematically expressed by Expressions (1) and (2) [16].
where Prated is the maximum charge and discharge active power of BESS, PBESS is the current BESS active power, and Ssoc, max and Ssoc, min are the maximum and minimum values of SOC, respectively. SOC(tn) is the state of the battery at tn; PBESS(tn) is the charging and discharging power at tn. Please note that the charging and discharging operations cannot be performed at the same time. η is the charging and discharging efficiency; Br is the rated capacity of the battery; σ (tn) represents the state at tn where 1 represents the charging state, 0 represents In Expression (2), h(t n ) is a decision variable indicating the state of the electric vehicle, where 1 represents the charging state, −1 represents the discharging state, and 0 represents the disconnected state. h is the set of decision variables in which the element is h(t n ). λ dw is the SOC threshold for discharging operation. λ emg is a SOC threshold indicating whether a BESS needs to be charged. If SOC (t n ) ≤ λ emg , the charging operation must be carried out immediately. The operational strategy of the battery should ensure that the SOC of the BESS is always greater than this threshold. On the basis of the user's daily night travelling data, the frequency of a user's personal car usage is classified into high-frequency (with λ emg to be 0.9-1.0), intermediate-frequency (with λ emg to be 0.6-0.9), and low-frequency (with λ emg to be 0.25-0.6).

Energy Storage Life Cycle Costs and Benefits
The life cycle costs of energy storage on the user side mainly include the one-time fixed investment cost C inv of the energy storage, and the total operation and maintenance cost C ope ; the benefits include the recovery value B rec at the end of the energy storage life cycle, and the peak and valley arbitrage B TOC of the installation of energy storage during the full life cycle. Demand response total revenue is B DR . F is the full life cycle benefit of energy storage.
Equation (3) is the total operation and maintenance cost of the energy storage.
In Equations (4) and (5), c e is the unit capacity cost; c p is the unit power cost; c om is the annual operation and maintenance cost coefficient per unit capacity; E max is the rated maximum capacity of energy storage; P max is the rated power of energy storage.
The benefits of the entire life cycle of energy storage include energy storage recovery value, full cycle peak and valley arbitrage of energy storage, and demand response benefits, as shown in Equations (6)-(8), respectively.
In Equations (6)- (8), θ is the recovery rate of energy storage; c i is the electricity price at time i; p c,i,t, , p d,i,t are the energy storage charge and discharge power of BESS i at time t, respectively. For discharge power, the sampling interval ∆t i is 15 min; T is the total number of days in the full life cycle of the energy storage; p DR, i is the reported demand response power of the energy storage participation of BESS i; g is the response speed coefficient; l DR is the total number of times that energy storage participates in demand response during the life cycle.

Energy Storage Operation Constraints
The life loss of energy storage batteries is closely related to throughput, and reducing throughput can prolong service life. In order to make more reasonable use of energy storage, Ref. [21] compared actual user load data with the peak and valley prices of electricity, and found that the daily throughput of energy storage batteries was limited, which not only reduces the throughput of energy storage, it also greatly limits the number of charge-discharge state transitions within a day of energy storage.
where m is the equivalent number of charge and discharge of energy storage; E max is the rated maximum capacity of energy storage; and SOC max and SOC min are the maximum BESS operates in order to obtain benefits. As shown in Figure 3, these benefits exist in three respects. Firstly, by storing electricity at low-price times and discharging at high-price times, consumers receive the benefit of saving electricity costs. Secondly, by storing electricity at low-price times and discharging to the power grid at high-price times, consumers receive the benefit of spread. Thirdly, by storing electricity when photovoltaic power is sufficient and discharging when there is no photovoltaic power, consumers receive the benefit of saving electricity cost.
where m is the equivalent number of charge and discharge of energy storage; Emax is the rated maximum capacity of energy storage; and SOCmax and SOCmin are the maximum and minimum values of the state of charge of energy storage, respectively. The values in this article are 0.9 and 0.1; nday is the number of days. BESS operates in order to obtain benefits. As shown in Figure 3, these benefits exist in three respects. Firstly, by storing electricity at low-price times and discharging at highprice times, consumers receive the benefit of saving electricity costs. Secondly, by storing electricity at low-price times and discharging to the power grid at high-price times, consumers receive the benefit of spread. Thirdly, by storing electricity when photovoltaic power is sufficient and discharging when there is no photovoltaic power, consumers receive the benefit of saving electricity cost.

Physical Constraints of Energy Storage Batteries
SOCi is the state of charge of energy storage at time i; η ch and η dis are the charging and discharging efficiencies of energy storage, respectively.
swc,i and swd,i are 0-1 variables to indicate the charging and discharging state of energy storage; Pmax is the rated power of energy storage. Equations (12)- (14) ensure that the energy storage is not in the charging and discharging state at the same time, and that charging and discharging power do not exceed the rated power.

Demand Response Constraints
If energy storage participates in demand response on day t, it must meet the conditions for effective demand response. Equations (15) and (17) restrict the load situation after energy storage has participated in demand response.

Physical Constraints of Energy Storage Batteries
SOC i is the state of charge of energy storage at time i; η ch and η dis are the charging and discharging efficiencies of energy storage, respectively. 0 sw c,i + sw d,i 1 (12) 0 p c,i sw c,i P rated (13) 0 p d,i sw d,i P rated (14) sw c,i and sw d,i are 0-1 variables to indicate the charging and discharging state of energy storage; P max is the rated power of energy storage. Equations (12)- (14) ensure that the energy storage is not in the charging and discharging state at the same time, and that charging and discharging power do not exceed the rated power.

Demand Response Constraints
If energy storage participates in demand response on day t, it must meet the conditions for effective demand response. Equations (15) and (17) restrict the load situation after energy storage has participated in demand response.
In Formula (15)-Formula (17), k is the response time of the demand response day; j is the corresponding time of the baseline; p c,k and p c,j are the charging power in the corresponding period of energy storage; p d,k and p d,j are the discharge power in the corresponding period of energy storage; Load k is the load participating in the demand response period; Load j is the load at the corresponding time 5 days before the response date; p DSM is the optimal response power reported by the user; and Load pre max is the maximum Energies 2021, 14, 7749 7 of 18 peak load of the user in the previous year. Equation (15) indicates that the maximum load during the response period does not exceed the baseline maximum load, Equation (17) is the average load constraint during the response period, and Equation (13) restricts the range of the agreed response power.

Power Consumption Constraints of Consumers
In order to satisfy the daily demand of consumers using other stored power, the operation of BESS must consider the following sequence constraints.
where SOC con is the power consumption demand without photovoltaic generation. SOC str is the power storage demand under photovoltaic generation.

Methodology
In this section, a Markov Game is designed for DR, and is played by the grid edge controller. The MADDPG algorithm is applied to train the grid edge controller to learn the optimal control strategy. Detailed descriptions are provided as Figure 4.

max DSM max
In Formula (15)-Formula (17), k is the response time of the demand response day; j is the corresponding time of the baseline; pc,k and pc,j are the charging power in the corresponding period of energy storage; pd,k and pd,j are the discharge power in the corresponding period of energy storage; Loadk is the load participating in the demand response period; Loadj is the load at the corresponding time 5 days before the response date; pDSM is the optimal response power reported by the user; and pre max Load is the maximum peak load of the user in the previous year. Equation (15) indicates that the maximum load during the response period does not exceed the baseline maximum load, Equation (17) is the average load constraint during the response period, and Equation (13) restricts the range of the agreed response power.

Power Consumption Constraints of Consumers
In order to satisfy the daily demand of consumers using other stored power, the operation of BESS must consider the following sequence constraints.
where SOCcon is the power consumption demand without photovoltaic generation. SOCstr is the power storage demand under photovoltaic generation.

Methodology
In this section, a Markov Game is designed for DR, and is played by the grid edge controller. The MADDPG algorithm is applied to train the grid edge controller to learn the optimal control strategy. Detailed descriptions are provided as Figure 4.

Markov Decision Process for DR
A discrete time MDP consisting of three elements, combined with the model presented in this paper, is described below.

Markov Decision Process for DR
A discrete time MDP consisting of three elements, combined with the model presented in this paper, is described below.

State
In this paper, two types of state are defined for the grid edge controller to perform: day-ahead scheduling and real-time load reduction. For day-ahead scheduling, the state is defined as follow: For price-based DR, the state s i,j is as shown in Formula (20), where T represents the daily time slots, and p i,j is the charge or discharge power at every time slot j of user i. The target state is of the satisfaction of all power consumption demand during one day.
For incentive-based DR, the state s i,j is as shown in Formula (21) consumer, and l represents the time slot. The target state is to achieve the load management value during one DR period.

Action
The grid edge controller also has two types of action, corresponding to the two states as mentioned above. For day-ahead scheduling, the action space is defined as follow: For price-based DR, the action space is defined as shown in Formulas (8) and (22), where p m,j represents the active power of agent m.
For incentive-based DR, the action space is defined as From the above definition of Formula (20), we can derive: For any positive integer z, when 0 ≤ t 0 < t 1 < · · · < t z is satisfied, we have: Hence, the day-ahead load scheduling is an independent increment process that must satisfy the definition of MDP [22]. In the same way, that the definition of the state and action of incentive-based DR can be proved.

Reward
To guide the grid edge controller to learn the optimal DR strategy, the reward functions for day-ahead scheduling and real-time load reduction are defined as follows.
For price-based DR, reward r i,j (t) is defined as shown in Formula (26), where ρ(t) is the electricity price in time slot y, t m , i represents shiftable appliance working time, and δ is the parameter of TU.
For incentive-based DR, reward r r,t is defined as shown in Formula (29), where κ m is the m-th consumer, and ς is an adjustable parameter that represents the importance of TU. r r,k = p m,j κ m + (δ s s,k+1 ,TPU − δ s s,k ,TU )ζ (29)

Objective Function
This paper presents two objective: day-ahead scheduling and real-time load reduction. For price-based DR, the objective function of the controller is defined as shown in Formula (30) to maximize the energy benefits and reduce the TU of the LVTPA.
For incentive-based DR, the objective function of the controller is defined as shown in Formula (18) to maximize the DR benefits and reduce the TU of the LVTPA.

Multi-Agent Deep Deterministic Policy Gradients
In the context of DRL, a computerized agent learns to take actions at a discrete time step t to maximize a numerical reward from the environment, as shown in Figure 5.
For incentive-based DR, reward rr,t is defined as shown in Formula (29), where κm is the m-th consumer, and ς is an adjustable parameter that represents the importance of TU.

Objective Function
This paper presents two objective: day-ahead scheduling and real-time load reduction.
For price-based DR, the objective function of the controller is defined as shown in Formula (30) to maximize the energy benefits and reduce the TU of the LVTPA.
For incentive-based DR, the objective function of the controller is defined as shown in Formula (18) to maximize the DR benefits and reduce the TU of the LVTPA.

Multi-Agent Deep Deterministic Policy Gradients
In the context of DRL, a computerized agent learns to take actions at a discrete time step t to maximize a numerical reward from the environment, as shown in Figure 5. In the intelligent DR system designed in this paper with the station area as the core, the energy management system is a single agent that manages the electricity consumption arrangements of every BESS, and communicates with others to obtain information regarding its own actions. The improved features of the MADDPG algorithm are as follows: (1) Each agent has its own goals and behavior constraints.
(2) Each agent can interact with the environment and change the state of the environment. (3) Each agent can optimize itself when information is incomplete. The calculation of the whole system is asynchronous and concurrent. In the intelligent DR system designed in this paper with the station area as the core, the energy management system is a single agent that manages the electricity consumption arrangements of every BESS, and communicates with others to obtain information regarding its own actions. The improved features of the MADDPG algorithm are as follows: (1) Each agent has its own goals and behavior constraints.
(2) Each agent can interact with the environment and change the state of the environment.
(3) Each agent can optimize itself when information is incomplete. The calculation of the whole system is asynchronous and concurrent.
As shown in Figure 6, MADDPG is able to use centralized training and distributed applications to fully improve the optimization efficiency of multiple agents. The detailed pseudocode is as follows in Algorithm 1: Algorithm 1. MADDPG for grid edge control.
1. Initialize the parameters of critic networks Q i,j (s, ← a i,j |Φ i,j ) and the actor networks µ i,j (o i,j θ i,j ) with weights Φ i,j and θ i,j for each agent (i, j) randomly. 2. Randomly initialize each agent target network parameters with critic networks and actor 3. Clear replay buffer D. 4. For episode =1 to N do 5. Randomly initialize a random process χ for every exploration episode. 6. Observe initial state s.
Select action a i,j = µ i,j o i,j θ i,j + χ for each agent (i, j).

9.
Execute actions ← a i,j and calculate rewards ← r i,j and next state s .

10.
Store transition (s, ← a i,j , ← r i,j , s ) in replay buffer D.

11.
Update the state observation s ← s 12.
For agent (i, j), i=1 to I, j=1 to J do

13.
Sample a minibatch of K samples randomly. (s k , a k l,j , r k l,j ,s k ). 14.
Set calculating function y k

15.
Update the critic network by minimizing loss: Update the actor network using the sampled and calculated policy gradient: End for 18.
Softly update the target network parameters for each agent (i, j): As shown in Figure 6, MADDPG is able to use centralized training and distributed applications to fully improve the optimization efficiency of multiple agents. The detailed pseudocode is as follows in Algorithm 1:

Simulation Environment
In this paper, a simulation is carried out on the basis of two-year operation data in a three-phase power system including six households who take part in the DR program. The agents were programmed using a personal computer with six Intel cores i7-8700@3.20GHz, 12 threads, and 16 GB memory. The maximum training episode was set to be 20,000. Python 3.6.2 and Tensorflow 2.1 were adopted to program the MADDPG algorithm. In addition, the adam optimizer was applied to update the parameters of the critic network. Other parameters adopted in the MADDPG algorithm are shown in Table 1. The time-of-use prices are given in Table 2. The incentive-based DR subsidy is 8 CNY/kW·h. In addition, the PV feed-in tariff is 0.6 CNY/kW·h. Table 3 shows the other scenario data and benefits.

Numerical Results Discussion
The following section analyzes the case from three perspectives: benefits, three-phase unbalance, and algorithm performance.

Analysis of Cost
This paper compares three types of BESS in Table 4: lithium ion is the cheapest battery, while also having the shortest cycle charge and discharge times. As a result, its unit energy call cost is the highest. Conversely, lithium iron titanate hatteries have the longest service life and the lowest unit energy call cost. However, in this work, the lithium iron phosphate battery was tested in the case study due to its popularity.

Analysis of Benefits
Due to space limitations, three of the six consumers are compared in this paper. The scheduling of consumer I is shown in Figure 7. It is noticeable that BESS charges in the valley time at 3 kW. Thus, EV can save CNY 2.46 per hour. BESS charges at a low price during 23:00-03:00 and discharges during the periods 07:00-08:00 and 20:00-23:00. During peak times, BESS is fully charged in the 8:00-12:00 period by the PV. When the electricity price increases during the 20:00-23:00 period, BESS is arranged to supply the household power consumption.

Analysis of Benefits
Due to space limitations, three of the six consumers are compared in this paper. The scheduling of consumer I is shown in Figure 7. It is noticeable that BESS charges in the valley time at 3 kW. Thus, EV can save CNY 2.46 per hour. BESS charges at a low price during 23:00-03:00 and discharges during the periods 07:00-08:00 and 20:00-23:00. During peak times, BESS is fully charged in the 8:00-12:00 period by the PV. When the electricity price increases during the 20:00-23:00 period, BESS is arranged to supply the household power consumption. In this LVTPA, the power consumption and the PV power values of the consumers were different. Among them, consumer I did not participate in incentive-based DR. As shown in Figure 8, consumer I had a relatively small photovoltaic capacity compared to its power consumption. Therefore, it has no capacity for incentive-based DR. Without optimization using the MADDPG-based gird controller, consumers I, II, and III only considered storing the remaining electrical energy in BESS and using it in the absence of photovoltaic power generation. With optimization using the proposed method in this paper, BESS was arranged to charge at 00:00-02:00 during the valley price and discharge at 08:00-12:00 and 20:00-23:00 during the peak price. PV still generated power to the grid when photovoltaic power was adequate. Meanwhile, BESS still stored the remaining electricity. Similarly, consumers II and III, as shown in Figures 9 and 10, were charged at low price and discharged at high price, respectively, with 2 kW and 3 kW. The BESS of consumers II and III were able to obtain a subsidy during the incentive-based DR period (12:00-14:00).  In this LVTPA, the power consumption and the PV power values of the consumers were different. Among them, consumer I did not participate in incentive-based DR. As shown in Figure 8, consumer I had a relatively small photovoltaic capacity compared to its power consumption. Therefore, it has no capacity for incentive-based DR. Without optimization using the MADDPG-based gird controller, consumers I, II, and III only considered storing the remaining electrical energy in BESS and using it in the absence of photovoltaic power generation. With optimization using the proposed method in this paper, BESS was arranged to charge at 00:00-02:00 during the valley price and discharge at 08:00-12:00 and 20:00-23:00 during the peak price. PV still generated power to the grid when photovoltaic power was adequate. Meanwhile, BESS still stored the remaining electricity. Similarly, consumers II and III, as shown in Figures 9 and 10, were charged at low price and discharged at high price, respectively, with 2 kW and 3 kW. The BESS of consumers II and III were able to obtain a subsidy during the incentive-based DR period (12:00-14:00).

Analysis of Benefits
Due to space limitations, three of the six consumers are compared in this paper. The scheduling of consumer I is shown in Figure 7. It is noticeable that BESS charges in the valley time at 3 kW. Thus, EV can save CNY 2.46 per hour. BESS charges at a low price during 23:00-03:00 and discharges during the periods 07:00-08:00 and 20:00-23:00. During peak times, BESS is fully charged in the 8:00-12:00 period by the PV. When the electricity price increases during the 20:00-23:00 period, BESS is arranged to supply the household power consumption. In this LVTPA, the power consumption and the PV power values of the consumers were different. Among them, consumer I did not participate in incentive-based DR. As shown in Figure 8, consumer I had a relatively small photovoltaic capacity compared to its power consumption. Therefore, it has no capacity for incentive-based DR. Without optimization using the MADDPG-based gird controller, consumers I, II, and III only considered storing the remaining electrical energy in BESS and using it in the absence of photovoltaic power generation. With optimization using the proposed method in this paper, BESS was arranged to charge at 00:00-02:00 during the valley price and discharge at 08:00-12:00 and 20:00-23:00 during the peak price. PV still generated power to the grid when photovoltaic power was adequate. Meanwhile, BESS still stored the remaining electricity. Similarly, consumers II and III, as shown in Figures 9 and 10, were charged at low price and discharged at high price, respectively, with 2 kW and 3 kW. The BESS of consumers II and III were able to obtain a subsidy during the incentive-based DR period (12:00-14:00).   As shown in Figure 11, every consumer only acquires a small benefit from electricity cost savings and PV electricity earnings. Using MADDPG in gird edge control, the electricity benefit increases significantly. For example, Figure 11 shows that, for consumer I, the earnings increased from −4.32 to 2.23. Moreover, participation in incentive-based DR is really valuable for BESS owners, and can bring CNY 48 per day at least.   As shown in Figure 11, every consumer only acquires a small benefit from electricity cost savings and PV electricity earnings. Using MADDPG in gird edge control, the electricity benefit increases significantly. For example, Figure 11 shows that, for consumer I, the earnings increased from −4.32 to 2.23. Moreover, participation in incentive-based DR is really valuable for BESS owners, and can bring CNY 48 per day at least.  As shown in Figure 11, every consumer only acquires a small benefit from electricity cost savings and PV electricity earnings. Using MADDPG in gird edge control, the electricity benefit increases significantly. For example, Figure 11 shows that, for consumer I, the earnings increased from −4.32 to 2.23. Moreover, participation in incentive-based DR is really valuable for BESS owners, and can bring CNY 48 per day at least.  As shown in Figure 11, every consumer only acquires a small benefit from electricity cost savings and PV electricity earnings. Using MADDPG in gird edge control, the electricity benefit increases significantly. For example, Figure 11 shows that, for consumer I, the earnings increased from −4.32 to 2.23. Moreover, participation in incentive-based DR is really valuable for BESS owners, and can bring CNY 48 per day at least.

Analysis of Three-Phase Unbalance
As shown in Figure 12, the MADDPG-based grid controller can adjust the power of BESS to reduce the TU. The degree of TU is decreased by 10.3% and the power loss is reduced by 15.1%, as shown in Figure 12.

Analysis of Three-Phase Unbalance
As shown in Figure 12, the MADDPG-based grid controller can adjust the power of BESS to reduce the TU. The degree of TU is decreased by 10.3% and the power loss is reduced by 15.1%, as shown in Figure 12.

Analysis of Algorithm Performance
As shown in Figure 13, the average reward of each agent keeps increasing throughout the training. After 20,000 rounds of training, the rewards that each agent receives are concentrated around one convergence, which means that the agents have learned to hold the power at an arbitrarily predefined value by adjusting the output of BESS under randomly generated scenarios. This proves that the proposed method has good convergence stability.  Figure 14 shows the performance comparison of the three main reinforcement learning algorithms, i.e., MADDPG (adopted in this paper), DDPG, and DQN. MADDPG outperforms the other two reinforcement learning algorithms, with faster and more precise convergence.

Analysis of Algorithm Performance
As shown in Figure 13, the average reward of each agent keeps increasing throughout the training. After 20,000 rounds of training, the rewards that each agent receives are concentrated around one convergence, which means that the agents have learned to hold the power at an arbitrarily predefined value by adjusting the output of BESS under randomly generated scenarios. This proves that the proposed method has good convergence stability. As shown in Figure 12, the MADDPG-based grid controller can adjust the power of BESS to reduce the TU. The degree of TU is decreased by 10.3% and the power loss is reduced by 15.1%, as shown in Figure 12.

Analysis of Algorithm Performance
As shown in Figure 13, the average reward of each agent keeps increasing throughout the training. After 20,000 rounds of training, the rewards that each agent receives are concentrated around one convergence, which means that the agents have learned to hold the power at an arbitrarily predefined value by adjusting the output of BESS under randomly generated scenarios. This proves that the proposed method has good convergence stability.  Figure 14 shows the performance comparison of the three main reinforcement learning algorithms, i.e., MADDPG (adopted in this paper), DDPG, and DQN. MADDPG outperforms the other two reinforcement learning algorithms, with faster and more precise convergence.   Figure 14 shows the performance comparison of the three main reinforcement learning algorithms, i.e., MADDPG (adopted in this paper), DDPG, and DQN. MADDPG outperforms the other two reinforcement learning algorithms, with faster and more precise convergence. Energies 2021, 14, x FOR PEER REVIEW 15 of 18 Figure 14. Convergence of the cumulative mean rewards. Figures 15 and 16 show a performance comparison with the model-based control method in strong light and low light. The model-based method applies convex optimization techniques to calculate the strategy. Both methods successfully maintain good power scheduling. The performance gap is small, but the average computational time of the proposed method is much shorter than that of the conventional model-based method. Especially in large-scale power systems, the proposed method is more efficient than the conventional model-based method.

Conclusions
This work proposed a multi-agent grid edge controller based on MADDPG for DR management of distributed BESS in LVTPA. The controller was integrated in an automatic grid edge control system, designed to improve the utilization and increase the DR benefits of consumers. The case study demonstrated that the proposed method is able to Figure 14. Convergence of the cumulative mean rewards. Figures 15 and 16 show a performance comparison with the model-based control method in strong light and low light. The model-based method applies convex optimization techniques to calculate the strategy. Both methods successfully maintain good power scheduling. The performance gap is small, but the average computational time of the proposed method is much shorter than that of the conventional model-based method. Especially in large-scale power systems, the proposed method is more efficient than the conventional model-based method.  Figures 15 and 16 show a performance comparison with the model-based control method in strong light and low light. The model-based method applies convex optimization techniques to calculate the strategy. Both methods successfully maintain good power scheduling. The performance gap is small, but the average computational time of the proposed method is much shorter than that of the conventional model-based method. Especially in large-scale power systems, the proposed method is more efficient than the conventional model-based method.

Conclusions
This work proposed a multi-agent grid edge controller based on MADDPG for DR management of distributed BESS in LVTPA. The controller was integrated in an automatic grid edge control system, designed to improve the utilization and increase the DR benefits of consumers. The case study demonstrated that the proposed method is able to  Figures 15 and 16 show a performance comparison with the model-based control method in strong light and low light. The model-based method applies convex optimization techniques to calculate the strategy. Both methods successfully maintain good power scheduling. The performance gap is small, but the average computational time of the proposed method is much shorter than that of the conventional model-based method. Especially in large-scale power systems, the proposed method is more efficient than the conventional model-based method.

Conclusions
This work proposed a multi-agent grid edge controller based on MADDPG for DR management of distributed BESS in LVTPA. The controller was integrated in an automatic grid edge control system, designed to improve the utilization and increase the DR benefits of consumers. The case study demonstrated that the proposed method is able to

Conclusions
This work proposed a multi-agent grid edge controller based on MADDPG for DR management of distributed BESS in LVTPA. The controller was integrated in an automatic grid edge control system, designed to improve the utilization and increase the DR benefits of consumers. The case study demonstrated that the proposed method is able to reasonably arrange power consumption, thus increasing the total electricity benefits by 40% and reducing TU by 10.3%. In addition, compared to other RL, the proposed method has a better performance with respect to convergence. Compared to the model-based method, MADDPG possesses the same accuracy and displays a superior calculation time under the two different light conditions. These results demonstrate that the proposed DR program has great potential to motivate residential consumers to participate in the DR program. In future work, the proposed DR program can be extended to multiple appliances, including air conditioners.