Multi-Agent Reinforcement Learning Approach for Residential Microgrid Energy Scheduling

: Residential microgrid is widely considered as a new paradigm of the home energy management system. The complexity of Microgrid Energy Scheduling (MES) is increasing with the integration of Electric Vehicles (EVs) and Renewable Generations (RGs). Moreover, it is challenging to determine optimal scheduling strategies to guarantee the efﬁciency of the microgrid market and to balance all market participants’ beneﬁts. In this paper, a Multi-Agent Reinforcement Learning (MARL) approach for residential MES is proposed to promote the autonomy and fairness of microgrid market operation. First, a multi-agent based residential microgrid model including Vehicle-to-Grid (V2G) and RGs is constructed and an auction-based microgrid market is built. Then, distinguish from Single-Agent Reinforcement Learning (SARL), MARL can achieve distributed autonomous learning for each agent and realize the equilibrium of all agents’ beneﬁts, therefore, we formulate an equilibrium-based MARL framework according to each participant’ market orientation. Finally, to guarantee the fairness and privacy of the MARL process, we proposed an improved optimal Equilibrium Selection-MARL (ES-MARL) algorithm based on two mechanisms, private negotiation and maximum average reward. Simulation results demonstrate the overall performance and efﬁciency of proposed MARL are superior to that of SARL. Besides, it is veriﬁed that the improved ES-MARL can get higher average proﬁt to balance all agents.


Motivation
In recent years, a microgrid-based family energy framework has increasingly attracted attention. This emerging residential energy system contains distributed Renewable Generations (RGs), household load appliances, and Energy Storage Units (ESUs). The application of residential microgrid reduces the user's dependence on the main grid and improves the autonomy and flexibility of the family power system [1]. In the practical field, resident users, RGs and ESUs constitute a small and independent microgrid market [2,3]. It's essential to formulate an intelligent and effective residential Microgrid Energy Scheduling (MES) mechanism for coordinating and balancing the benefits of all members, meanwhile, guaranteeing the members' self-decision ability and information security.
Besides, Electric Vehicles (EVs) are becoming more and more widespread in resident life due to its various advantages. EVs can effectively solve the pollution problem from traditional energy; in addition, the energy consumption cost of EVs is cheaper than gasoline vehicles [4,5]. Moreover, Vehicle-to-Grid (V2G) mode allows EVs to discharge energy to the power grid for participating in market scheduling and earning profit [6,7].
Considering the integration of EVs and RGs, residential MES becomes more complicated due to high uncertainties from EVs and RGs, such as RGs output and EVs' usage attributes [8,9]. Therefore, some challenges exist in residential MES for previous studies shown in Section 1.2. First, it's difficult to build a precise scheduling model that can attend to all the uncertain parameters. Second, the common centralized dispatch methods need a completely open market environment and could be problematic with a large system and multiple constraints. Third, most works are myopic solutions considering the current objectives, instead of a long-term optimization.
As a solution to confront these challenges, a model-free machine learning method, Reinforcement Learning (RL) has got a good effect on complicated decision-making problems. However, most RL methods adopt Single-Agent RL (SARL) to obtain the optimal policy, some defects exist in SARL whether the algorithm mode is centralized or decentralized. In centralized SARL, first, microgrid control center gathers all participants' information and implements RL tasks centrally, each participant passively performs learning results from control center, rather than learning autonomously; second, participants need to upload necessary information to control center, thus part of private information may be at risk of leaking; third, the penetration of EVs and RGs will increase computational complexity of SARL, even lead to the curse of dimensionality. On the other hand, in decentralized SARL, each participant can learn and make decisions according to individual information and environment, but SARL is a kind of selfish learning to maximize self-reward, instead of considering overall profit and supply-demand balance of microgrid [10]; besides, the learning process of each participant is based on itself's available information, they can't obtain other agents' some confidential information such as providers' quotes and demand response behaviors of users, therefore, the agents' learning processes are imprecise and there are no interactions among agents.
To address all the above issues, this paper presents a Multi-Agent RL (MARL) approach for residential MES. MARL is a distributed RL in multi-agent environment that can be seemed as a combine of RL and game theory [11]. Although the MARL framework is applicable to residential MES to construct a distributed microgrid RL architecture, some limitations restrict the application of existing MARL methods in the microgrid field. First, most MARL algorithms require agents to replicate other agents' value functions and to calculate the equilibria for all joint-actions, which are computationally expensive. Then, if agents' information is not be fully shared (incomplete information game), it's difficult to obtain a definite equilibrium solution [12]. Finally, the solved equilibrium solutions may not be unique, how to select the optimal equilibrium to balance all agents' rewards and to ensure the convergence of MARL are noteworthy [11,13]. In this paper, we present an Equilibrium Selection-based MARL (ES-MARL) method, an optimal Equilibrium Selection (ES) is adopted according to two mechanisms, that is private negotiation with each agent and maximum average benefit method.

Related Works
Several studies about residential MES considering EVs or RGs have been published using various approaches [14][15][16][17][18][19][20]. For example, a virtual power plant based on linear programming was used as a combination of wind generators and EVs to schedule the market in [14]. In [15], a dynamic programming-based economic dispatch for community microgrid is formulated. The authors in [16] proposed a hierarchical control method to achieve coordination scheduling integrating EVs and wind power. In [17], an EV coordination management algorithm was presented to minimize the load variance and to enhance the distribution network's efficiency. A game theory-based retail microgrid market was built and a Nikaido-Isoda Relaxation approach was adopted to get the optimal solution [18]. A hierarchical control framework for microgrid energy management system with RGs and an ESU is proposed [19]. Ref. [20] studies about a day-ahead scheduling of integrated home-type microgrid and adopts a mixed-integer linear programming algorithm to achieve optimal energy management.
These previous studies cannot take all challenges in MES into account. Therefore, RL-based method is adopted as solution. RL approach learns the optimal policies through trial-and-error mechanism, that does not depend on the prior knowledge of system model information, and RL has been widely used in energy scheduling of smart grid and EVs [12,[21][22][23]. For instance, ref. [12] focused on the smart grid market based on double auction, an adaptive RL was used to find the Nash Equilibrium (NE) of energy trading game with incomplete information. The authors in [21] proposed an RL-based dynamic pricing and energy consumption scheduling algorithm for the microgrid system. In [22], a batch RL approach was adopted in residential demand response to make a day-ahead consumption plan. In [23], authors raised RL-based real-time power management to solve the power allocation for hybrid energy storage system in a plug-in hybrid EV.
Moreover, in this paper, a MARL method is used for sequential decision making in multi-agent environment where traditional SARL is difficult to deal with. MARL has been adopted in some fields, such as vehicle routing problem [24] and thermostatically loads modeling [25]. The most universal MARL is equilibrium-based MARL, whose framework accords with Markov games and the evaluation of the learning process is based on all agents' joint behaviors, the equilibrium concept from game theory is introduced to denote optimal joint action [26][27][28][29][30]. The earliest proposed equilibrium-based MARL was the Minimax-Q algorithm which considered two agents' zero-sum game, two agents try to maximize and minimize one reward function [26]. Authors in [27] proposed Nash-Q algorithm for non-cooperative general-sum Markov games, and the NE solution was adopted to define value function. In [28], a friend-or-foe Q-learning algorithm was presented for obtaining different solutions based on agents' relationships. In [29], Correlated-Q learning was proposed base on correlated equilibrium solution, which allows for the possibility of dependencies in the agents' optimization. In [30], authors introduced a "Win or Learn Fast" (WoLF) mechanism to form a variable learning rate based MARL method.
These papers have made contributions on the domain of microgrid energy scheduling or RL algorithm. Through the analysis and improvement of these studies, in this paper, a MARL approach is adopted for management and decision of distributed microgrid market.

Contributions
To sum up, the principal contributions of this paper are summarized as follows.

•
A framework for residential MES with V2G system is built. All participants in the microgrid and the auction-based microgrid market mechanism are modeled. • MARL algorithm is introduced for the first time into a residential microgrid. The RL model (states, actions, and immediate reward) of each agent is formulated.

•
An improved ES-MARL is proposed, the Equilibrium Selection Agent (ESA) calculates the corresponding equilibrium solution by negotiating with agents and selects the optimal solution based on maximum average reward.

Organization
The structure of this paper is as follows. In Section 2, we present the microgrid model and market mechanism. In Section 3, we summarize the MARL theory and the microgrid agents' design of MARL. In Section 4, we propose the ES-based MARL method. Section 5 presents the simulation results and analyses. We conclude the paper in Section 6.

Residential Microgrid Description
With the development of RGs and EVs, the residential microgrid has been increasingly used. For a urban residential area, the fluctuations of daily load curves are high and the distributions of residents, RGs and EVs connecting are stochastic; moreover, participants have higher requests for autonomy energy management from the perspective of economy and privacy protection; besides, residential microgrid should consider more about environmental concerts and power safety. In this paper, a distributed framework for residential microgrid which is more applicative to meet above requirements is adopted. As depicted in Figure 1, the 9-node residential microgrid is built based on multi-agent system [31,32], all participants are modeled as profit-aware intelligent agents with abilities of autonomous learning and decision-making; agents in the market should comply with microgrid market mechanism and follow market scheduling based on global optimization goal. Based on the roles in the market, microgrid agents are divided into different clusters: RGs are independent supplier agents; users are consumer agents and can join in microgrid scheduling through demand response; a unified management for EVs acts as supplier agents or consumer agents according to overall charging/discharging action; besides, manager agents (e.g., market operator and equilibrium selector) are set to maintain market operation. The primary objective for the microgrid in grid-connected mode is to achieve maximum autonomy, that is to provide the necessary load demand with the minimum dependency on Utility Grid (UG). Besides, agents' benefits should be considered to realize a unanimously acceptable balance. All agents' concrete models are described as follows.

Electric Vehicles
EVs are both power consumers and power suppliers in the microgrid market. Considering the negative effects of V2G, extra battery degradation cost and the impact on subsequent travel should be considered to evaluate the net profit of V2G. Distinct from stationary ESUs, EVs' charging/discharging actions will be affected by owners' traveling habits and stochastic behaviors of EVs usage. Besides, specific operational and technical constraints of EVs should be noticed.
(1) EV Travel Behaviors and Constraints: EVs charging/discharging scheduling should take travel demand as premise. The customary travel habits (e.g., arrival time, departure time and travel distance) follow a similar pattern based on the owner's intentions, besides, the random characters of travel behaviors are considerable [33].
The State-Of-Charge (SOC) of EV i at slot t is defined as soc i t , n e is the current number of EVs, i ∈ [1, 2, 3, · · · , n e ]. The constraints of EVs SOC is: where soc min and soc max are minimum and maximum limits of EV battery SOC, respectively. The charging/discharging limitations are: where v i t is EV i' charging/discharging power at slot t, v i t is positive when EV is charging and is negative when EV is discharging. v id max = r i d t and v ic max = r i c t, where r i c and r i d are EV i' battery charging and discharging rate. Then, we have: where b i m is EV i' battery capacity. If EV i will leave at slot t, SOC should satisfy the travel demand as: where soc i dis is EV i' minimum SOC for departure. Remarkably, only if EVs adopting slow-charge can participate in the microgrid market; if EVs need urgent travel, the fast-charge will be chosen. The EV owner's travel habits are from data statistics, arrival time and departure time follow normal distribution; and travel distance follows a log-normal distributed.
(2) EV Battery Degradation Cost: The life of EV battery declines along with repeating charging/discharging cycles. For lithium iron phosphate battery adopted in this paper, low temperature weaken performance and high temperature curtails battery life. The use of battery in moist condition should be avoided. Besides, a prolonged period of overvoltage during long travel may stress the battery [34]. In sum, we consider that EVs battery degradation depends on the number of cycles. According to [35], the battery degradation cost function can be approximately expressed as: where C v i t is battery degradation cost of v i t , C i b represents battery cost, k is the slope of the linear approximation of the battery life as a function of the cycles.
(3) EV Aggregator and EV Secondary Scheduling: In our model, an EV Aggregator (EVA) is adopted to manage all EVs' participation in the market. The introduction of EVA facilitates the model extendibility and adjustment when the number of EVs changes, besides, the reduction of agents number can improve the convergence speed of learning algorithm.
Base on [33], all local EVs participating in the microgrid market (primary market) form a secondary scheduling system of EVs, EVA is the manager of the secondary scheduling system. At the beginning of each slot, EVA arranges each EV's charging/discharging amount based on the optimal global charging/discharging action from the primary market, besides, some rules should be obeyed as follows. • EVs' travel demand should be satisfied, first of all, EVA considers EVs' travel demand two hours ahead of departure and guarantees SOC is more than soc i dis . • EVA arranges EVs charging/discharging sequence according to SOC level, if soc i t < soc cha min , this EV can't discharge and should be arranged to charge, where soc cha min is charge warning limit.

•
The total charge/discharge amount from the primary market is the scheduling objective of the secondary system, the sum of EVs' charge/discharge should be equal to the total amount.

Users' Loads
The residential appliances keep approximately steady in identical period during the same season, therefore, users' load demand can be predicted accurately. User i's load demand at slot t is written as min and d i t,max are minimal essential demand and maximal available demand, respectively. n u is the number of users, i ∈ [1, 2, 3, · · · , n u ]. According to operating profiles, the household load can be categorized into three types as follows [36].
(1) fixed loads: this kind of load demand can't be changed to guarantee the devices in working order, such as web servers and medical instruments. User i' critical loads profile is denoted as d i, f t . (2) curtailable loads: the demand can be cut down for reducing consumption, such as heating or cooling devices. d i,c Users can regulate the schedules of curtailable loads and shiftable loads to control the load consumption; meanwhile, load curtailment and shift will reduce the user's consumption satisfaction. In this paper, a utility function U(l t i ) which represents user's satisfaction is adopted as: where ω indicates users' action (a larger ω implies a larger utility); β denotes the utility saturation. The utility function should be non-decreasing and concave. Similarly, in the learning process of MARL, a User Aggregator (UA) is introduced to centrally arrange the demand responses of all users.

Renewable Generations
RGs' generation capacity can be derived from accurate short-term forecasts via historical data and environmental data. The random characteristics of RGs' generation are represented as stochastic normal distribution based on forecast results.
From [37], the generation of PV g pv t is closely affected by the weather factors, g pv t is determined as: where η is conversion rate of PV array (%); S pv is area of PV array (m 2 ); I t is solar irradiance which is from the Beta probability density function (kW/m 2 ) [1]; T is the average temperature during t ( • C).
The main element influencing the output of WT is wind speed V t . The WT generation g wt t is a piecewise function of V t denote as below [38]: where V in and V o f f are the cut-in wind speed and cut-off wind speed, respectively; V r is the rated wind speed; P wt r is the rated power of WT.

Transaction with Utility Grid
Microgrid keeps supply-demand balance through energy trade with UG, the transaction prices between UG and microgrid are fixed by UG. Here, UG adopts a real-time price mode that the price is variable at each time slot. The trade price between UG and microgrid market is summarized as a tuple p u t = (p u,p As suggested by [33,40], a real-time microgrid market is constructed based on a one-side dynamic bidding model. In the microgrid market, supplier agents provide their quotes with bid price and supply amount; consumer agents submit their energy demand which can be optimized by adjusting their consumption behaviors. The above information of sellers and consumers aggregate in a non-profit Market Operator Agent (MOA), whose duty is to make the market clearing price p m t and to calculate each agent's energy sell/purchase amount. The microgrid market operates per time slot t, abiding the following principles.

•
In the market clearing process, MOA sorts sellers' quotes in increasing order of prices, then the demand will be matched according to the ranking and respective supply. The bid price of the last adopted quote is defined as the marginal price, that is the market clearing price p m t .

•
If the energy supply is lacking to support the load demand, MOA will purchase energy from UG. The purchase price from UG is higher than the sellers' bids. The additional expenditure will be charged averagely by all purchasers.

•
If the supply exceeds the demand, the surplus energy will be sold to UG. If more than one seller offers the market clearing price, their energy sold to microgrid and UG are arranged based on the same proportion of their supply.

MARL Method for Microgrid Market
For a multi-agent based microgrid system, the most important task is to generate agents' distributed strategies to schedule their behaviors in the market. Moreover, it's significant to ensure all agents' benefits balance. In this section, a MARL method is introduced to solve this issue about how to generate and coordinate the autonomous strategies for all agents.

Overview of MARL
RL algorithm is an unsupervised machine learning method for sequential decision problems. In SARL, agents interact with the environment by executing actions, the environment then feeds back an immediate reward to evaluate the selected action. Transfer to MARL, the relationships with both cooperations and competitions exist among agents, and agents' rewards are influenced by other agents' states and actions. SARL is built on the framework of Markov Decision Process (MDP), but in MARL, the framework is Markov game, which is the combination of MDP and game theory [41,42].

Definition 1. Markov
Game. An n-agent (n ≥ 2) Markov game is a tuple S, A 1 , . . . , A n , r 1 , . . . , r n , P , where n is the number of autonomous agents; S is the state space of system and environment; A i is the action space of agent i, i ∈ [1, 2, . . . , n], then the joint-action space is defined as A = A 1 ×. . .× A n and a joint-action a = (a 1 , a 2 , . . . , a n ); r i is the immediate reward function of agent i; P is the transition function, denotes the probability distribution when an action is executed, the current state transfers to a new state, As shown in Figure 2, compared with SARL, the main distinction of MARL is that the reward function and state transition for agents are based on the joint-action a. In a pure strategy game, the agents' joint-action is defined as a = (a 1 , a 2 , . . . , a n ), in MARL, when a is applied and the state transfers from s to s , an action evaluation Q-function for agent i, Q i (s, a) is defined as: where V i (s ) is the maximum discounted cumulative future reward starting from state s ; 0 ≤ γ <1 is the discount factor, which indicates the weight of future reward. In SARL, agent i's goal is to find an optimal action policy a i * (s) to maximize Q-function, but the goal in MARL is to find an optimal joint-action a * (s) to coordinate all the Q-functions {Q i (s, a)} n i=1 to a global equilibrium. A concept of equilibrium solution from game theory is introduced into MARL. The generally used equilibrium solution is NE [27]. Definition 2. Nash Equilibrium. In a pure strategy Markov game, a NE solution is defined as a joint-action a * = (a 1 * , . . . , a i * , . . . , a n * ), satisfying the following criterion for all n-agents: An intuitive comprehension about NE is that, for all agents, if other agents don't change their actions a −i , agent i can't improve its utility function Q i ( a) by changing itself's action a i , where a −i is the joint-action of all agents except agent i. At time slot t, the iterative modified formula of Nash-Q MARL is expressed as: In general, the major structures of other equilibrium-based MARL algorithms are similar to Nash-Q MARL, the main difference is various selected equilibrium in the learning process.

Agents Design for Microgrid MARL
The equilibrium-based MARL is applicable in a mixed task (competitions coexist with cooperations) multi-agent system, in the residential microgrid market, there are both competitive relations (e.g., suppliers' quotes) and cooperative relations (e.g., sellers and purchasers collaborate to achieve supply-demand balance and microgrid autonomy). Detailed formulations of MARL used in residential MES are introduced in this section.
First of all, all agents' MDP models are designed as follows.

EV Aggregator Agent
In the primary microgrid market, EVA agent participates in the market as a centralized agent of all EVs. EVA decides the total charging/discharging in the market based on MARL results. In EVA's learning process, the current local EVs number, SOC and travel demand of EVs should be considered.
State: The state-base for EVA agent at slot t is defined as: where soc t is the average SOC of local EVs; nev t is the number of local EVs, which connect with microgrid. According to t, we can get the UG price p u t . Action: EVA's actions include total charging/discharging power v t and EVA's quote p eva t . v t > 0, EVA acts as a purchaser; v t < 0, EVA serves as a seller; v t = 0 means that there is no energy trade. Only when v t < 0, p eva t is existent. The action-base of EVA agent is denoted as: Considering that only if the EV is connected to microgrid (non-movement state), the charge/discharge action can be executed [43], so the action of EVA is constrained by average SOC soc t , local EVs' number nev t and travel demand (denoted by soc dis ). The total charging/discharging power v t is confined as: Besides, when soc t ≤ soc min +min{v id max /b i m } n e i=1 , EVA can only select energy purchase actions; when soc t ≥soc max −max{v ic max /b i m } n e i=1 , EVA can only select energy sell actions. Reward: According to statistics, more than 90% of EV users will charge the SOC up to 60% before leaving. The user's anxiety due to the worry about exhausting energy on the road aggravates increasingly with the decline of SOC. Therefore, the immediate reward function of EVA should combine economy, battery degradation, and user's anxiety, which is defined as: r eva response action based on the results of MARL, then the demand response task is assigned averagely to all users in the secondary scheduling.
State: The state-base of UA agent at slot t is defined as: where t is current time slot; d t is cumulative load demand of all users without demand response. Action: The action-base of UA agent at slot t is defined as: where l c t is the total ratio of load demand curtailment; l s t is the cumulative shiftable load demand.
d c t and d s t are total demand of fixed load, curtailable loads and shiftable loads, respectively.
Reward: Similar to Equation (7), the actual cumulative load demand is defined as UA agent's immediate reward function r ua t is defined as the difference between users' total utility function and energy purchase expenses.
where l m t + l u t = l t , l m t and l u t are the energy purchase from microgrid market and UG, respectively.

RG Agents
There are two kinds of RGs, Photo-Voltaic (PV) and Wind Turbine (WT), the output of RGs is based on the short-term forecasts, and random distributions with forecast values will be adopted to embody the uncertainties of RGs' generation. All of the RGs' generation will be put into the market at the current time slot.
State: The state of RG agents is current time slot t.
Action: The actions of RG agents are denoted as: where p pv t /p wt t are the quote prices of PV agent and WT agent. Reward: The RG agents' immediate reward functions are profit functions as: where for the tuple rg=(pv/wt), g rg,m t is the portion sold to microgrid; g rg,u t is the portion sold to UG, g rg,m t +g rg,u t =g rg t . C rg is the generation cost function, which is considered as a quadratic function as: where c 1 , c 2 , c 3 are pre-determined parameters which are different for PV and WT, and c 1 , c 2 , c 3 ≥ 0.

MARL Method for Residential MES
Based on agents' MDP model designs, the equilibrium-based MARL is adopted. The general framework of agents' learning process for microgrid MARL is shown as Algorithm 1.
At line 3 of Algorithm 1, -greedy policy denotes that the agent selects a random action with a probability of 1− and selects a joint action which makes system achieve equilibrium with a probability of . Φ i t (s i t+1 ) is the expected value of the equilibrium in state s i t+1 .

Input:
Agent set N; state space S i ; action space A i ; learning rate α i ; discount factor γ i ; joint action space A. 1: Initialize Q i (s i , a) ← 0; initialize state s i t ∈ S i , and t = 0; 2: repeat 3: For each time slot t, each agent selects a i t ∈ A i with -greedy policy to form a joint-action a t ∈ A; 4: MOA calculates the market clearing price p m t and the energy should be traded of each agent; 5: Each agent obtains the experience (s i t , a t , r i t , s i t+1 ); 6: Each agent updates the Q-matrix Q i t (s i t , a t ):

Output:
The optimal Q-matrix Q i (s i , a) for each agent.

Improved Equilibrium-Selection-Based MARL
As mentioned in Section 1.1, some limitations existing in common MARL algorithms, are summarized as follows.

•
In the learning process of common MARL, agent updates and saves other agents' value functions in each iteration step, that will cause a huge computation work, even in a small scale environment with two or three agents.

•
In order to work out the equilibrium solution, MARL needs agents to share their states, actions and value functions, including some privacy information, which is unrealistic in some situations.

•
In each learning iteration step, there is perhaps more than one equilibrium solution, different equilibria bring different updates of value functions, which may lead to non-convergence of the algorithm. Besides, it's hard to ensure fairness for selecting an equitable equilibrium because agents' rewards are different with different equilibria.
Therefore, we present an ES-MARL algorithm to address these issues. We set up an Equilibrium Selection Agent (ESA), whose function is to separately negotiate with all agents to get the equilibria solution set and to select the optimal equilibrium based on the maximum average benefit method.

Negotiation for Equilibrium Solution Set
In an incomplete information game, agent's reward information is incompletely public to other agents, agents can't obtain other agents' value functions to compute the equilibria. To solve this problem, according to [11], ESA is adopted as a neutral negotiator to communicate with each agent privately to obtain their potential equilibrium set following these steps:

1.
At the beginning of slot t, agent i finds its potential NE set Z i ne and sends it to ESA (concrete steps for finding potential NE set are shown in Algorithm 2).

2.
ESA selects the joint-action a j ∈ A, which meets the criteria ∀Z i ne , a j ∈ Z i ne , into the final equilibrium set Z e , the selected a j is the pure strategy NE solution of game Q 1 (s t ), . . . , Q n (s t ).

3.
If there is no joint-action satisfying NE, ESA selects a j whose number of satisfying a j ∈ Z i ne k j , is the most and k j > 0.5n (more than half agents get to equilibrium), then adds a j into Z e . Now, the equilibria set Z e is the candidate set by negotiations, the element number of Z e may be more than one.

Input:
Agent set N; current state s i ∈ S i ; joint-action space A; agent number n; weight factor β. Calculate k j , which is the number of a j ∈ Z i ne ; 13: end if 14: if Z e = ∅ then 15: Z e is the final pure strategy NE set; 16: else 17: if k j > 0.5n then 18: a j ← arg max a j k j ; 19: Z e ← Z e ∪{ a j }, Z e is the final suboptimal set; 20: else 21: Z e ← A; 22: end if 23: end if 24: end for 25: for each a j ∈ Z e do 26: Calculate the value of joint benefit function: 27: end for 28: a * ← arg max a j J;

Output:
The optimal equilibrium a * .

Equilibrium Selection Based on Maximum Average Reward
If the element number of Z e is more than one, the update of Q-function shown as Equation (13) will get different values based on different selected equilibria. In this paper, a maximum average reward method is adopted to help ESA selecting the optimal equilibrium to guarantee algorithmic efficiency and fairness.
Here, we introduce an average reward function J which denotes all agents' average reward as: ESA selects the optimal equilibrium (joint action a * ), whose corresponding value of J is maximum. The selected optimal equilibrium is used in MARL to update Q-functions. The ES process is shown in Algorithm 2.
Based on the optimal equilibrium joint-action a * , an improved ES-MARL algorithm is shown in Algorithm 3. Private negotiations between ESA and other agents can avoid redundant updates of Q-functions for each agent and protect the privacy information; meanwhile, the optimal ES process can combine all agents' benefit and promote global welfare. The fairness, safety, and efficiency of the microgrid market are guaranteed with ES-MARL.

Input:
Agent set N; state space S i ; action space A i ; learning rate α i ; discount factor γ i ; joint action space A. 1: Initialize Q i (s i , a) ← 0; initialize state s i t ∈ S i , and t = 0; 2: repeat

Output:
The optimal Q-matrix Q i (s i , a) for each agent.

Overall Process of Proposed ES-MARL Approach
To sum up, we present a flowchart shown in Figure 3 about proposed ES-MARL approach for microgrid energy scheduling, including the MARL training process and MARL application process. From Figure 3, we can see in the training process of ES-MARL algorithm, the ES procedure (shown in Section 4, Algorithm 3) is responsible to connect all agents and select optimal joint-action for agents; then the learning process (show in Section 3) is based on agents' MDP models and microgrid model to perform the Q-function iteration of each agent; the learning result is each agent's optimal Q-function. Adopting the learning results into practical microgrid market operation (the market model is shown in Section 2), each agent selects current optimal action to participate in the market based on respective current state information and the optimal Q-function.

Simulation Results and Analysis
In this section, three parts of simulations are conducted to evaluate the proposed MARL algorithm for residential MES. First, the performances of MARL and SARL for MES are compared; then, the effect of proposed ES-MARL is verified and Nash-Q algorithm is used as comparison; finally, the secondary scheduling system of EVs is simulated. In our microgrid model, an urban residential district is considered. The RGs in the microgrid include one PV and one WT. RGs' daily forecast outputs are extracted from the historical data of a certain area. In microgrid market operation simulations, stochastic models of RGs' generation are used, RGs' actual generation values are generated from probability distributions based on the forecast outputs, the probability distributions of PV and WT are based on [44]; the generation cost functions for PV and WT are C pv t = 0.1g 2 t + g t and C wt t = 0.1g 2 t + 0.5g t . Figure 4a shows the real-time energy purchase price from UG; fiducial forecast values of PV output, WT output; and users' total daily load demand. The number of users is 3; user utility function is U(d t i ), where the interval of ω is [1,4], and the value of ω is high or low correspond to load peak period or load trough period, and β = 0.5.
In our microgrid, there are 10 EVs, EVs' parameters are shown in Table 1. Besides, from [45], the number of arriving EVs or departing EVs at slot t follows normal distribution N (n arr t , 1 2 ) and N (n are standard values shown as Figure 4b. EVs battery SOC bound is between 0.2 and 0.9; the SOC of arrivals is sampled from N (0.5, 0.177 2 ); the travel distance of EV D follow a log-normal distribution ln D ∼ N (1.79, 1.09 2 ). All simulations are conducted using Matlab 2018a on the personal computer with Inter Core i7-6700 CPU @3.40GHz. -greedy strategy is adopted in MARL for action selection strategy, = 1/ ln t; other learning parameters in RL are shown in Table 2.

Each Agent's Benefit in Different RL Methods
To verify that MARL structure is more applicable than SARL structure in the microgrid market, we simulate the operation of the adopted microgrid model based on the learning results of proposed ES-MARL and various SARL configurations. In SARL, agents can only use public information and their information to learn optimal strategies, and they estimate other agents' privacy information according to experience knowledge. Besides, the agent's learning objective in SARL aims to maximize individual benefit. As contrasts, five SARL configurations are used: in the former four systems, only one kind of agent has learning ability (SARL-PV only, SARL-WT only, SARL-UA only and SARL-EVA only), microgrid market operates with the optimal strategy of learning ability agent, other agents adopt fixed action based on current time slots; in the last SARL, all agents have individual learning ability (SARL-all agents), the market works with selfish optimal strategies from each agent's SARL. We separately evaluate four agent's daily profit in different configurations, the results of stochastic microgrid operation lasting for 30 days are shown in Figures 5-8. In Figures 5 and 6, daily profits of RGs Pro rg = 24 ∑ t=0 r rg t ; in Figure 7, daily welfare of UA W ua = 24 ∑ t=0 n u ∑ i=1 r ua t ; in Figure 8, daily total expense of EVA agent is the difference between total electricity purchase cost and total sale income. From Figures 5-8, we can see if only one agent has RL ability, the profit of this agent is always highest (or expense is lowest), agent with RL ability can make the optimal decisions based on current market state, but other agents' fixed actions are not optimal for increasing their profit. Moreover, the effects of the proposed ES-MARL method keep in second place for all the four agents. Considering the benefit balance of all agents, it is reasonable that the result of MARL is not as good as selfish SARL for only one agent. However, the global performance of MARL for all agents is optimal. Besides, there are some other notable results. The profits for all agents in SARL-all agents case are the worst, the reason is that in this case, all agents make actions based on their selfish learning results which only care about the self-benefit, this will lead to the imbalance of market and reduce global benefits. We also can find that the curves of SARL-PV only, SARL-WT only and SARL-UA only are more stable (EVA without learning ability), a reasonable explanation is that the randomness of EVs is far more than other members, with RL learning ability, EVA performs actions with the random states of EVs, so the market scheduling results will fluctuate.

Overall Performance in Different Microgrid Configurations
For an MES, market fairness and microgrid independence are the two most important indicators. Fairness is to guarantee all participants' benefits achieving equilibrium; microgrid independence aims to realize the supple-demand balance and reduce dependency on UG. Therefore, two global indexes, agents' average profit and daily energy purchase from UG, are introduced to evaluate overall performance. Agents' average profit indicates the overall benefit of microgrid operation; daily energy purchase from UG indicates the dependence level of microgrid on UG. Moreover, to verify the algorithm validity in general cases, two different microgrid configurations are adopted for operation with different RL methods. Microgrid configuration 1: one PV, one WT, 3 users and 10 EVs; microgrid configuration 2: 3 PV, 2 WT, 8 users and 20 EVs. The results are shown in Figures 9-11.
As depicted in Figure 9, agents' average profit is highest with ES-MARL in the two configurations, the value keeps between 30 and 40 for microgrid 1 and 50 to 70 for microgrid 2. This result indicates that ES-MARL produces the best performance to maximize global benefit comparing to other SARL methods. In addition, we can see that the average profits in SARL-PV only, SARL-WT only and SARL-UA only are almost close to zero, the reason is that the demand of EVs' charge is higher than RGs' supply, if EVA has no learning ability to make optimal decisions, the charging cost of EVs is expensive, therefore, the average benefit is offset by EVs' charge expense in these three cases.    Figures 10 and 11 illustrate the microgrid has the best independence with ES-MARL, the energy purchases from UG in the two configurations are both minimal. The learning process of MARL is based on joint learning, sellers and buyers can reach to appropriate equilibrium to balance the market, agents learn from each other to decrease the supply-demand gap, the microgrid doesn't need to buy more expensive energy from UG. The lower half of Figures 10 and 11 (energy purchase from UG are higher in these cases) also show the importance of EVA agent's learning ability. Combined Figures 10  and 11, the ES-MARL method has the best performance in energy trade with UG; besides, different microgrid models don't affect the final performance of our approach.

Performance of Improved ES-Based MARL
In this section, several simulations are conducted to evaluate our improved ES-based MARL algorithm. Here, we use Nash-Q MARL algorithm as a comparison, Nash-Q algorithm is the most commonly used MARL. The main difference between Nash-Q and proposed ES-MARL is the equilibrium selection in value function update. We study the algorithm performance from two aspects: performance in the learning process and application effect in residential MES. The RL parameters in two algorithms are set as same as Table 2.
First, four agents' learning performances of the iterative process are shown in Figures 12 and 13. In Figures 12 and 13, the label of the x-axis is episode, which denotes one state transition period from initial state to terminate state, in this paper, one episode is equal to one day (0:00-23:00). The label of the y-axis is Q-value, which is the updated value of Q-function in the current slot. From the results of four agents' Q-values, we can see, the Q-values of ES-MARL is higher than that of Nash-Q throughout the learning process. A bigger Q-value means that the agent's current reward is higher, so ES-MARL can gain a better strategy to increase reward than Nash-Q. Besides, from the curves' trends of four the agents, there is a clear gap in the convergence rates between two algorithms. The state-spaces of the four agents are different (PV's and WT's are smallest, EVA's is biggest), therefore, convergence speeds are accordingly different. For PV and WT, ES-MARL reaches a stable value when episode = 1.5×10 4 , however, Nash-Q reaches a stable value when episode = 2×10 4 . For UA and EVA, the convergence episode of ES-MARL is about 2×10 4 and 3×10 4 , but when episode > 3.5×10 4 , the curve of Nash-Q trends to a stable value. The above results show that the convergence speed of ES-MARL is faster.  Then, the results shown in Figures 14-17 illustrate the application performances of the two algorithms when their learning results are adopted in the microgrid market operation. In Figures 14-16, we simulate a one-day microgrid operation with ES-MARL and Nash-Q. The data points of all simulations are calculated by the average value of 100 Monte Carlo experiments. Figure 14 shows the hourly profit of PV and WT. The profits of PV and WT with ES-MARL are superior than with Nash-Q in most hours. The total profits are higher for ES-MARL. Figure 15 depicts the results for UA agent, including two indicators, total energy purchase expense and total welfare (difference between users' total utility function and total expense), the results show that with ES-MARL, the expense is lower and welfare is higher for UA agent. EVs are both seller and purchaser in the market, the total charge expense and discharge profit of EVA agent are shown in Figure 16. The performance of ES-MARL is better than Nash-Q for less expense and higher profit. To summarize the above analysis, the application performance of ES-MARL is overall better than Nash-Q for all agents. Finally, turn to energy trade between microgrid and UG, consider that energy trade is not the same in the specific hour for different days due to random parameters, we simulate microgrid operation for 30 days as shown in Figure 17. The amount of energy purchase from UG of ES-MARL is less than that of Nash-Q, ES-MARL can reduce microgrid's dependency on UG compared with Nash-Q. Meanwhile, the microgrid will sell back more energy to UG after adopting ES-MARL.

A Simulation Case of EVs Secondary Scheduling System
EVs are important components of a residential microgrid, V2G system plays a crucial role to keep the balance of the microgrid market and enhance microgrid independence. In primary microgrid market, EVA agent represents all EVs to participate in the market, and a secondary scheduling system is set to manager each EV. In this section, we simulate a random secondary scheduling process of EVs for one-day. According to specific EVs' parameters, the primary microgrid market operates to obtain optimal actions for EVA agent in each slot, then EVA arranges each EV's charging/discharging action. The ten EVs are from different types as shown in Table 1. The set is as follows: Nissan Leaf: EV1-EV4; Buick Velite6: EV5-EV7; BYD Yuan: EV8-EV10. Table 3 shows the EVs' existence state and departure plan. The label "in" denotes that the EV is at home connecting with microgrid in current time; the label "out" means the EV will depart in the next hour (1: "yes"; 0: "no"). The initial time of this case is 0:00. The simulation result of EVs secondary scheduling is shown in Table 4. "Total charge/discharge of EVA" is the optimal decision result from the primary microgrid market. soc is EV's SOC state at the beginning of this hour; v t is the charge/discharge amount, v t > 0 means charge, v t < 0 means discharge, the unit of v t is kWh. The minimum SOC for departure is 0.8 a.m. and 0.6 p.m. The charge warning limit soc cha min = 0.3. From Tables 3 and 4, the following conclusions can be drawn. First, at each hour, the sum of EV's v t is equal to the total charge/discharge amount of EVA, which means the secondary scheduling conforms to optimal action from the primary microgrid market. Then, the charging/discharging sequence is according to the SOC level state, when soc < 0.3, the EV is charged immediately. Besides, when EV will depart in current hour, the soc is almost high than the minimum SOC for departure, for example, EV2 will leave at 7:00, so EV2's soc at 7:00 reach to 0.846 (here, the value of v t is "-" that denotes EV departs at this hour); moreover, EV2 is arranged to deeply charge at 5:00 (two hours ahead). These results can verify the efficiency of EVs' secondary scheduling system.

Conclusions
In this paper, we concentrate on the energy scheduling of residential microgrid. The integrated residential microgrid system including RGs, power users and EVs V2G is constructed on the multi-agent structure and an auction-based microgrid market mechanism is built to adapt microgrid participants' demands for distributed management and independent decision.
In order to generate the optimal market strategy for each participant and to guarantee the balance of all participants' benefit and microgrid supply-demand, we introduce a model-free MARL approach for each agent. Through MARL, agents can consider both farsighted self-interest and the actions of other agents to make decision in a dynamic and stochastic market environment. Moreover, we present a novel ES-MARL algorithm to improve the privacy security, fairness, and efficiency of MARL. There are two cardinal mechanisms in ES-MARL, one is private negotiation between ESA and each agent, which can protect private information and reduce computational complexity; another is the maximum average reward method to select a global optimal equilibrium solution.
Three parts of simulations have been carried out: (1) the comparison results between MARL and SARL verify that MARL is more appropriate for distributed microgrid scheduling to ensure agents individual benefits and overall operation objective; (2) the simulations with proposed ES-MARL and classic Nash-Q MARL are conducted, the results show that our proposed approach can achieve better performance of learning process and microgrid application; (3) a case study about 10 EVs charging/discharging scheduling demonstrates the effectiveness of secondary EVs scheduling system.
In conclusion, this work adopts an improved MARL approach for residential microgrid market scheduling. The learning results can enable agents to autonomously select strategy for promoting benefit; meanwhile, the microgrid system can coordinate all participant's demands and achieve a high autonomy under the equilibrium-based learning process.