Optimal Scheduling of Microgrid Based on Deep Deterministic Policy Gradient and Transfer Learning

: Microgrid has ﬂexible composition, a complex operation mechanism, and a large amount of data while operating. However, optimization methods of microgrid scheduling do not effectively accumulate and utilize the scheduling knowledge at present. This paper puts forward a microgrid optimal scheduling method based on Deep Deterministic Policy Gradient (DDPG) and Transfer Learning (TL). This method uses Reinforcement Learning (RL) to learn the scheduling strategy and accumulates the corresponding scheduling knowledge. Meanwhile, the DDPG model is introduced to extend the microgrid scheduling strategy action from the discrete action space to the continuous action space. On this basis, this paper holds that a microgrid optimal scheduling TL algorithm on the strength of the actual supply and demand similarity is proposed with a purpose of making use of the existing scheduling knowledge effectively. The simulation results indicate that this paper can provide optimal scheduling strategy for microgrid with complex operation mechanism ﬂexibly and efﬁciently through the effective accumulation of scheduling knowledge and the utilization of scheduling knowledge through TL.


Introduction
Microgrid is a small-scale power grid, composed of distributed power generation, load, energy storage devices, and energy conversion devices, which can effectively improve the stability and power quality of a large number of distributed power sources connected to the main grid, and realize the flexible application of distributed power generation [1]. However, the intermittence and instability of distributed generation make energy management more difficult. How to manage the energy of microgrid efficiently is a challenge for microgrid operation and scheduling.
Classical mathematical methods and heuristic algorithms are frequently used to solve the optimal scheduling problem of microgrid. The classical mathematical method has advantages in solving speed and convergence [2], but it is easy to fall into local optimization or even fail when dealing with complex nonlinear, discontinuous objective functions and constraints [3,4]. In contrast, the heuristic algorithm is less dependent on the mathematical model and is easier to deal with nonlinear problems, so it has been widely used in different optimization problems of power systems [5], but the parameter setting of the heuristic algorithm is more random and the result is greatly affected by it. Microgrid has flexible composition, a complex operation mechanism, and a large amount of data while operating, however, the above methods do not effectively accumulate and utilize the scheduling knowledge.
Transfer Learning, as an effective means to reuse knowledge, has shown excellent performance in image recognition, text classification, emotion classification, and so forth [6].
However, its application in the field of power systems is still in the exploratory stage. At present, scholars have made achievements in power system supply and demand interactive real-time scheduling [7], power system decentralized carbon energy composite flow optimization [8], economic risk scheduling [9], and so forth. In the above research, TL is frequently combined with Reinforcement Learning (RL) to achieve the purpose of knowledge accumulation and knowledge updating. With deep reused knowledge of TL, RL has been provided strong support. As an important theoretical branch in machine learning, RL has strong abilities of self-learning and memory, in which its agent can interact with the environment to obtain the feedback to guide the action selection, then learn the best strategy and accumulate experience and knowledge. At present, it has been studied in power system security and stability control [10], automatic generation control [11], voltage and reactive-power control optimization [12], optimal power flow control [13], interaction of supply and demand [14], power market [15], power information network [16], and so on. In the microgrid scheduling problem, Liu et al. [17] studied the application of RL in the cooperation of wind power and energy storage. This study shows that RL has good adaptability to the uncertainty and complex constraints of the problem. However, the state and action space are discretized in the study, which leads to errors in the optimization results. Wang et al. [18] and Zhang et al. [19] proposed an economic scheduling model based on RL for the main grid-connected operation and island operation of microgrid, respectively. They used the deep neural network to approximately express the continuous state space, so the error caused by the discretization of the state space and the "Curse of Dimensionality" caused by the excessive state space was improved, but the action space was still discrete, so the best optimal scheduling strategy could not be obtained.
In this paper, we study the microgrid optimal scheduling method based on deep deterministic policy gradient and transfer learning. The optimized scheduling model is proposed, which takes the minimum microgrid operating cost as the objective function. The study includes three parts: (1) the framework and learning process of deep deterministic policy gradient, (2) knowledge transfer rules in transfer learning, and (3) the combination of deep deterministic policy gradient and Transfer Learning. Finally, the feasibility and correctness of methodology was verified in line with simulation, in which Deep Deterministic Policy Gradient (DDPG) extends the traditional RL from discrete action space to the continuous action space. This method can effectively reduce the error caused by discretization of traditional RL, while the actual supply and demand similarity-based TL utilizes the scheduling knowledge effectively.

(1) Solar Power Generation
The solar photovoltaic panel output is given by this expression: where P pv t is the output from solar power generation at time step t; η PV is the conversion efficiency of the solar photovoltaic panel; A s is the solar photovoltaic panel array area; R s (t) is the radiation intensity of solar photovoltaic panel at time step t.
(2) Wind Power Generation The wind power generation output can be approximately expressed by this expression [20]: where V s is the wind speed through the wind turbines at time step t; V ci is the start-up wind speed; V r is the rated wind speed; V co is the cut-out wind speed; P r is the rated output of wind power generation.
(3) Diesel Generator As a controllable component, diesel generator can provide electricity when the power supply of uncontrollable components is insufficient, and reduce the dependence of microgrid on the electricity of main grid. The fuel cost model of diesel generator can be approximately expressed by this expression: where P die t is the diesel generator output at time step t; a, b and c are the cost factors of diesel generator. (4) Battery SOC of battery at each time is determined by the previous moment SOC and exchange power of battery, it can be expressed by this expression: where SOC t is the SOC of battery at time step t; SOC t−1 is the SOC of battery at time step t−1; P ess t−1 is the exchange power at time step t−1; P ess t−1 > 0 and P ess t−1 < 0 are means battery charge and discharge respectively, P ess t−1 = 0 denotes battery does not act; η and ξ are the charge and discharge efficiency of battery respectively; ∆t is the length of each time step on battery act; S ess is the battery capacity.
In order to ensure the normal operation of the battery and extend its lifetime, the exchange power and SOC are constrained: (a) Exchange power constraint 0 < P ess t−1 < P ch.max P ess t−1 > 0 0 < P ess t−1 < P dis.max P ess t−1 < 0 where P ch.max and P dis.max are maximum charge power and discharge power respectively. (b) SOC constraint According to the physical limitation on battery, If the battery is over charge or over discharge, it will affect the lifetime of the battery, thus the SOC of the battery needs to be controlled within its own limit. Set SOC min and SOC max as the minimum and maximum limited SOC of the battery. The limits of SOC at time step t is given by: (5) Load Load refers to the sum of all kinds of electrical equipment electric power consumed at a certain time, the changing trend of load curve relate to user behavior habits. At time step t, The load can be expressed as P load t .

Objective Function
In this paper, the optimization goal is to minimize the microgrid operating cost. The objective function is given by: The F 1 is the fuel cost of diesel generator, F 2 is the transaction cost of the transaction power between the microgrid and main grid.
where T is the scheduling cycle; α buy t is the price of purchasing one unit of power from the main grid to microgrid at time step t; α sell t is the price of selling one unit of power from microgrid to the main grid at time step t; ∆t is the scheduling interval; P grid t is the transaction power between the microgrid and the main grid; P grid t < 0 means microgrid sells power to the main grid, β= 0; P grid t > 0 means microgrid buys power from the main grid, β= 1. As Equation (10), P grid t can be calculated by P load t , P pv t , P wt t , P die t and P ess t .
The transaction power is calculated by the formula does not include the network loss, which is cannot reflect the actual transaction power, thus this paper considers use the conversion coefficient λ expression the network loss.

Optimal Scheduling Method Based on Deep Deterministic Policy Gradient and Transfer Learning
The renewable energy output and load demand are affected by climate and user behavior habits, respectively. Although they have strong uncertainty, the sudden change probability of climate and user behavior habits in the same area or adjacent areas is relatively small. Therefore, the actual supply and demand curve in microgrid on similar days of same area or adjacent areas are very similar. Hence, this paper considers the effective accumulation and utilization of scheduling knowledge through using similarity to provide a priori knowledge for microgrid optimal scheduling. TL can establish knowledge connections for scheduling task groups with similarity; at the same time, the RL strong abilities of memory and self-learning can provide support for the learning, updating, and accumulation of knowledge. When combining with TL, it can realize the effective accumulation and utilization of scheduling knowledge. The method schematic is shown in Figure 1.

DDPG
RL is an artificial intelligence algorithm. In RL, an agent (agent is our artificial intelligence) based on state takes actions within a true or virtual environment, relying on feedback from rewards to find out the foremost suitable policy to achieve its goal. Figure 2 shows the principle of RL.
However, the traditional RL cannot deal with the continuous action space, thus, this paper introduces the DDPG of deep RL as a method to solve the microgrid optimal scheduling problem and combines TL to realize the utilization of scheduling knowledge. DDPG is a policy learning method that integrates a deep learning neural network into Deterministic Policy Gradient (DPG) [21]. DPG is an improved policy learning method based on policy gradient in RL. The policy gradient describes the optimal policy of each step state through the probability distribution function, and the action selection is based on the probability distribution, while the DPG directly obtains the definite value of the decision action at each moment through the policy function, that is, The DDPG network structure is shown in the Figure 3, It consists of two parts: the actor network and critic network. DDPG uses the actor network µ(s|θ A ) and the critic network Q(s, a|θ C ) to approximate the policy function µ(s) and state-action value function Q(s, a) respectively. θ A and θ C are the network weights of the actor network and the critic network respectively. The main idea is to generate the action under the guidance of the actor network, and the critic network uses the state-action value function to evaluate the action, then guides the update of its own network and actor network weights through the evaluation. The critic network uses Temporal-Difference to learn the state-action value function, so the loss function of the critic network can be defined as: where Q(s, a|θ C ) is state-action value function obtained by the agent through the critic network, represents the future cumulative reward of the agent after executing the action a in its current state s. As the same, Q(s − , a − |θ C ) represents the future cumulative reward of the agent after executing the action a − in the next state s − . All execution actions are generated through the actor network. r is the immediate reward obtained when the agent makes a transition from state s to state s − perform action a in current time. γ is the discount factor of the cumulative reward value in the future.
The optimization goal of the critic network is given by: network weights update mode: The α C is a scalar step size, called the learning rate of critic network. The action generated by the actor network is measured by the evaluation of the critic network. The measure function is given by: The purpose of the actor network is to learn the optimal policy, that the action generated by the actor network can get the maximum cumulative reward value in the future. Therefore, the optimization goal of the actor network is given by: update weights using the chain rule of gradient: The α A is a scalar step size, called the learning rate of actor network. In order to avoid the risk of overestimating, as shown in Figure 4, the DDPG network framework constructed in this paper, adopts the same double network structure as DDQN [22,23], that is, the actor network and critic network simultaneously construct two networks with the same structure but different weights, namely Evaluate net and Target net, The double network structure separates the generation of action a and a − ; the calculation of state-action value Q(s, a|θ C ) and r + γ(Q(s − , a − |θ C ). At the same time, the updating mode of network weights was changed. Evaluate net is updated every time a state transition is performed, and Target net is updated in Soft update [19] mode.

TL
TL makes use of the idea of draw inferences about other cases from one instance. TL will effectively use the knowledge learned from the old tasks to similar but different new tasks, so as to improve the utilization of knowledge and the efficiency of new task learning. In TL, the old task is generally called the source domain, and the new task is called the target domain. The knowledge learned in the source domain is affected by the characteristics of the source domain. In the process of knowledge transfer and reuse, the knowledge transfer rules are very important, especially considering the characteristic relationship between the source domain and the target domain. When the knowledge selection is not appropriate, the knowledge transfer may cause some interference to the target domain, resulting in negative transfer and reduction of learning efficiency in the target domain.

Knowledge Transfer Rules
In the microgrid optimal scheduling, considering the similarity between tasks as the basis for selecting knowledge transfer in the source domain, the rules are formulated as follows: (1) According to the characteristics of the source domain, an appropriate similarity evaluation function is selected to evaluate the characteristic correlation between the source domain and the target domain. (2) For the target domain, according to the similarity evaluation function, the similarity between the target domain and the number of N source domains is calculated. The higher the value, the higher the similarity between the target domain and the source domain, which means that source domain knowledge is more instructive to target domain learning. (3) Selecting the source domain with the highest similarity for knowledge transfer.

Similarity Evaluation Function
On the similarity evaluation function, this paper, we use the inverse number of Euclid Distance as the evaluation similarity function to reflect the actual supply and demand curves similarity between the target domain and source domain. P m (t)(m = 1, · · · , N) and P obj (t) denotes the actual supply and demand in N source domains and target domain at each time respectively. The similarity r m can be calculated by P m (t)(m = 1, · · · , N) and P obj (t), as shown in the following Equation (18):

State-Action Space
The microgrid optimal scheduling based on DDPG can be formalized as a partially observable Markov decision process, where the microgrid is considered as an agent that interacts with its environment. In this paper, The state space S consists of P wt , P pv , P load and SOC of battery, it can be expressed by: where P wt , P pv and P load are affected by climate and user behavior habits respectively; which are uncontrollable components and can be obtained by prediction. The battery is a controllable component, SOC of battery is determined by its own dynamic characteristics, as shown in the Equation (4).
As the controlled components of the microgrid, the operating power of the battery and diesel generator directly affects the scheduling strategy of the microgrid, so the action space is composed of the action power space of the battery and diesel generator. Action space A can be expressed by:

Reward Function
The effective setting of the reward function can provide correct guidance for the action selection of the agent, in order to obtain the desired goal. The reward function in this paper corresponds to the instantaneous reward at time t, which is obtained by the addition of the operating cost of the microgrid r1 t (a t ) and the penalty r2 t (a t ) caused by the battery violating the constraint.
The k is the penalty coefficient for violating the constraint. The instantaneous reward r t (a t ) is given by:

Algorithm Flow
The algorithm flow of the microgrid optimal scheduling method proposed in this paper is shown in Figure 5. The whole process consists of two parts: source domain learning and target domain learning, in which source domain learning adopts the DDPG to accumulate microgrid scheduling knowledge, while target domain learning adopts the TL and DDPG to utilize microgrid scheduling knowledge.

Simulation
In this paper, solar power generation, wind power generation, diesel generator, battery, load, and energy conversion device are included in the microgrid model, which has an example for simulation. The experimental data of the solar photovoltaic panel output and load are based on the radiation intensity data and user consumption of GitHub Project [24]. The wind power generation output is based on the wind speed data of Wind Energy Database Project. The capacity of the battery is 175 kWh, the charge and discharge efficiency are 0.9, the maximum exchange power is 30 kW, the minimum SOC of battery is 0.2, the maximum SOC of battery is 0.9, and the initial SOC of battery in this simulation is 0.4. In the DDPG, the actor network has two hidden layers, and they have 50 neurons and 20 neurons, respectively. The activation function is the Rectified Linear Unit (RELU) function. The hidden layer structure of the critic network is the same actor network, in which the variable learning rate and the variable discount coefficient are adopted in the training, and the initial learning rate of the actor network and critic network are set to 0.005. The initial value of discount coefficient factor is 0.9.
The simulation sets up two experiments: source domain learning and target domain learning, which verify the effectiveness of DDPG in continuous action space, and the effective accumulation and utilization of scheduling knowledge based on DDPG and TL, respectively.
Based on the consideration of the actual operation, the electricity price adopts the time-sharing unitary electricity price model [18], is shown as Table 1. The neural network input is the microgrid observation information extracted from the experimental data set: the solar photovoltaic panel output, the wind power generation output, load, and the SOC of battery complete the learning of source domain and target domain according to the flow in Section 3.4.
In order to verify the effectiveness on reducing the discretization error, and obtaining excellent scheduling strategy, the proposed method in this paper and the method in [19] are used in the source domain learning experiment. By using the method based on RL in [19], named DDQN, the battery action space and diesel generator action space are discrete, which brings more error because the discrete action space cannot flexibly match the unbalanced power between renewable energy output and load demand. However, by using the proposed method in this paper, named DDPG, both battery action space and diesel generator action space are continuous, which reduces the error because the continuous action space can flexibly match the unbalanced power between renewable energy output and load demand.
(1) DDQN, the power of the battery, and the diesel generator are discretized to 13 and 5 fixed actions respectively, so the action space is set as A = {a 1 , a 2 , · · · , a 13×5 }. In order to further verify the superiority of transfer learning, we designed a comparative experiment (using TL and without using TL). The best source domain can be obtained according to the knowledge transfer rules in Section 3.2.3. Then, the scheduling knowledge in the best source domain is used for knowledge transfer. In addition, two source domains randomly selected for knowledge transfer are compared to analyze the TL performance in different similarity source domains.

Source Domain Learning
In the source domain learning, one-year knowledge accumulation is carried out. However, in order to analyze the performance of the scheduling method based on DDPG, this paper takes a typical day as an example to analyze the performance of scheduling strategy based on DDPG. Figure 6 shows the scheduling strategy of a typical day in different methods.  Figure 7 shows clearly the differences between DDQN and DDPG in battery action, diesel generator action, transaction power. According to the Figures 6 and 7, it can be concluded that during the whole scheduling cycle, the exchange power of the battery and the output of the diesel generator in DDPG are more flexible, and the transaction power between the microgrid and the main grid in DDPG is less than DDQN. Between 0:00-7:00 and 11:00-14:00, the actual supply of renewable energy in the microgrid exceeds the load demand. At this time, neither DDQN nor DDPG have action on the diesel generator, and both DDQN and DDPG have absorbed excess energy by battery charging. When the battery capacity reaches the limit, the battery remains idle in two methods. Compared with the discrete actions in DDQN, the choice of action in DDPG is more flexible, and DDPG also has less trading power than DDQN. Between 7:00-10:00 and 14:00-0:00, the actual supply of renewable energy in the microgrid is lower than the load demand. At this time, both DDQN and DDPG use a battery and diesel generator to meet the energy shortfall. As shown in Figure 7, compared with DDQN, the diesel generator output and the transaction power between the microgrid and the main grid in DDPG are less. This is because the continuous action space improves the flexibility of action selection, enhances the reliability of the microgrid itself, reduces the dependence on the main grid, and further reduces the operation cost of microgrid. It can be seen from Table 2 that in two methods, DDPG obtains the lowest microgrid operating cost: 142.75 RMB. Experiment verifies the effectiveness of the microgrid optimal scheduling method based on the DDPG, and shows that the continuous action space setting can improve the flexibility of action selection, thus reducing the operating cost of the microgrid.

Target Domain Learning
In this part, we set the adjacent area scheduling task as the target domain task, and verify the superiority of transfer learning in utilization of scheduling knowledge. The similarity between the target domain and the source domain are evaluated by Equation (18). As shown in Figure 8 Figure 9 shows the scheduling strategy of target domain obtained by target domain learning. During the scheduling cycle, when the output of renewable energy exceeds the load demand, the battery is charged as much as possible within the constraint range; when the output of renewable energy is lower than the load demand, battery discharge cooperates with diesel generator to meet the energy shortfall. In addition, the main grid is also mobilized to absorb the unbalanced power. The scheduling strategy is fully in line with the actual operation, which proves that the microgrid optimal scheduling method based on DDPG and TL proposed in this paper is feasible.  Figure 10 shows the learning performance of transfer learning for scheduling knowledge in different similarity source domains. It can be observed that when knowledge transfer is not used, learning converges to epoch = 505. When using knowledge transfer, the agent can quickly lock the optimal strategy interval at the initial stage of training. After fine-tuning training, an agent for target domain learning in the source domain 330 with the highest similarity achieves convergence at epoch = 152, while for the agent that carried out the knowledge transfer on the source domain 65 in which the similarity is middle, the relative advantage of convergence rate is small. An agent for knowledge transfer to the source domain 274 with less similarity, the convergence result has deviation; the strategy obtained is inferior to the agent without using TL because the similarity between the target domain and the source domain is low, so the knowledge validity of the source domain cannot be guaranteed. It can be concluded that the similarity between the target domain and the source domain is positively related to the effectiveness of knowledge. The higher the similarity, the higher the effectiveness of knowledge and the better the target domain reuses transfer knowledge.

Conclusions
Since the optimization methods of microgrid scheduling do not effectively make good use of the scheduling knowledge effectively at present, aiming to solve this problem, this paper proposes a method in which there is optimal scheduling of microgrid based on DDPG and TL.
The findings are listed as follows.
(1) This paper provides an optimal scheduling strategy for microgrid with complex and changeable operation mode flexibility and efficiency through the effective accumulation of scheduling knowledge and the utilization of scheduling knowledge through knowledge transfer. (2) The DDPG model is introduced into RL, and the action space of traditional RL is extended from discrete space to continuous space. (3) A microgrid optimal scheduling TL algorithm based on the actual supply and demand similarity is proposed and the effective utilization of scheduling knowledge achieved the transfer of scheduling knowledge.
The scheduling model in this paper does not consider the system power flow constraints and verifies its practicability in large-scale systems; therefore, improving the scheduling model and studying on the state space establishment of large-scale system are the further works.   Power generated by PV P load The load demand P grid Transaction power between microgrid and main network P wt Power generated by WT P die Power generated by diesel P ess Change Power of battery SOC The state of charge β Univariate variable of P grid , β= 1, P grid > 0, β= 0, P grid < 0 P obj Difference power between renewable energy output and load demand at each time in the target domain P m Difference power between renewable energy output and load demand at each time in m source domain, m = 1, 2 · · · , N r m The similarity between target domain and source domain V s Wind speed R s Solar radiation intensity