Optimal Operation of a Microgrid with Hydrogen Storage Based on Deep Reinforcement Learning

: Microgrid with hydrogen storage is an effective way to integrate renewable energy and reduce carbon emissions. This paper proposes an optimal operation method for a microgrid with hydrogen storage. The electrolyzer efﬁciency characteristic model is established based on the linear interpolation method. The optimal operation model of microgrid is incorporated with the electrolyzer efﬁciency characteristic model. The sequential decision-making problem of the optimal operation of microgrid is solved by a deep deterministic policy gradient algorithm. Simulation results show that the proposed method can reduce about 5% of the operation cost of the microgrid compared with traditional algorithms and has a certain generalization capability.


Introduction
Renewable energy, such as wind and solar energy, is essential for the energy decarbonization [1]. Microgrid is an important form for renewable energy integration to the power systems [2]. Hydrogen energy is another type of clean and low-carbon energy. The combustion product of hydrogen is water with zero-carbon emissions [3]. For microgrid systems with high renewable energy integration, hydrogen energy can be used as a longterm energy storage to improve the utilization of renewable energy and reduce carbon emissions. The renewable energy is intermittent and random, and brings great challenges to the operation of the microgrids [4].
To address the economic dispatch problem in microgrids containing hydrogen storage, a mixed integer nonlinear dispatch model for a microgrid with 100% renewable energy generation is proposed in [5], and the GAMS solver is used to optimize the operation strategy of hydrogen storage and improve the economic efficiency of the microgrid in the day-ahead market. In [6], a nonlinear scheduling model for a microgrid containing fuel cell and hydrogen storage systems is proposed and the CONOPT solver is used to optimize the energy purchase cost of the microgrid. In [7], an optimization model to schedule an islanded microgrid with various resources, including photovoltaic generation and hydrogen energy system, is proposed. The problem is represented as a mixed integer linear program problem and solved by CPLEX. In [8], the retail price problem of the electricity energy retailer that owns plug-in electric vehicles and hydrogen storage systems is proposed. The proposed model is verified by simulation using GAMS. In [9], the harmony search algorithm is used to optimize the hydrogen production capacity of the hydrogen storage in the microgrid to reduce the operating cost. In [10], a hybrid AC-DC microgrid model containing electric vehicles and hydrogen fuel cells is presented, and the operating scheme is optimized using an improved teacher learning algorithm. In [11], the genetic algorithm is used to optimize the life cycle cost of the microgrid containing hybrid electric-hydrogen energy storage. In [12], the particle swarm algorithm is used to solve the multi-objective energy management problem of renewable energy microgrid containing electric-hydrogen hybrid energy storage to improve the system efficiency.
The conventional mathematical programing algorithms in the above literature are computationally efficient. However, these methods tend to be trapped in local optima when the problem is nonlinear and nonconvex. The heuristic algorithms have better global optimization capability, but suffer from slow convergence and poor generalization. In addition, the above literature mainly focuses on the day-ahead scheduling problem of microgrid, and relies on the accurate predictions of renewable energy and load.
Deep reinforcement learning is a machine learning method with the ability to perceive the environment and address uncertainties. Currently, deep reinforcement learning has been used to achieve certain results in several areas, such as reactive power optimization [13,14], electric vehicles [15,16], and power markets [17,18]. In terms of optimal operation, the deep reinforcement learning algorithms is used in [19] to solve the energy management problem of residential energy system with electricity, heat, and gas demand. In [20], a microgrid scheduling model is proposed and deep reinforcement learning algorithms is adopted to reduce the power purchase cost. However, this literature fail to consider the impact of hydrogen energy storage system on the microgrid operation. In [21], a coordinated control method for electrochemical and hydrogen energy storage in microgrid based on deep reinforcement learning is proposed. However, the hydrogen storage model is simple and ignores the electrolyzer efficiency characteristics, which has significant influence on the operation of microgrid. Moreover, only sub-optimal solution can be found because of the discretization of the action space.
In this paper, an optimal operation model for a microgrid with hydrogen energy storage system is developed. The efficiency-power model of the electrolyzer is established based on linear interpolation to evaluate the operating cost of the electrolyzer. The objective of the optimal operation model is to reduce the operation cost and guarantee the safety of the system. The deep deterministic policy gradient (DDPG) algorithm is used to optimize the operation scheme of the microgrid. The DDPG algorithm can deal with the continuous action space problem and obtains better operation scheme compared with the conventional algorithms. Additionally, the trained DDPG model is used in new scenarios. The simulation results show that the DDPG algorithm has generalization capabilities.
The main contributions can be summarized as: • A refined model represents the electrolyzer efficiency characteristics based on the linear interpolation method is proposed; • An optimal operation model for a microgrid with hydrogen storage is proposed. The electrolyzer efficiency characteristics model is incorporated into the optimal operation model;

•
The DDPG algorithm is adopted to solve the optimal operation model, which has a continuous action space.

Model of the Microgrid System
A microgrid can increase the integration of renewable energy and reduce the carbon emissions of the whole energy system. In this paper, an islanded microgrid was constructed. The structure of the microgrid is shown in Figure 1. The microgrid included the load, a microturbine, a photovoltaic (PV) generation device, a battery energy storage system (BESS), and a hydrogen storage system. The hydrogen storage system consisted of an electrolyzer, a hydrogen storage tank, and a solid oxide fuel cell (SOFC). The hydrogen storage system [22] can provide regulation capability to the microgrid and improve the system reliability.

Electrolyzer Efficiency
Electrolyzer efficiency el η is the efficiency of the hydrolysis reaction a perature and pressure. The electrolyzer efficiency [23] consists of voltage and current efficiency i η as below: The current efficiency, also known as Faraday efficiency, can be expre where I is the stack current of the electrolyzer.
Voltage efficiency is the ratio between the theoretical decomposition vo and the actual decomposition voltage, which can be expressed as

Electrolyzer Efficiency
Electrolyzer efficiency η el is the efficiency of the hydrolysis reaction at constant temperature and pressure. The electrolyzer efficiency [23] consists of voltage efficiency η v and current efficiency η i as below: The current efficiency, also known as Faraday efficiency, can be expressed as: η i = 96.5e 0.09/I−75.5/I 2 (2) where I is the stack current of the electrolyzer. Voltage efficiency is the ratio between the theoretical decomposition voltage of water and the actual decomposition voltage, which can be expressed as where U tn is the theoretical decomposition voltage, which is generally 1.482 V; U el is the actual decomposition voltage. Under the pressure p of 1.01 × 10 5 Pa, U el depends on the unit current density during the electrolysis of water, as below: where j is the unit current density; T is the working temperature of the electrolyzer; U rev is the reversible voltage of the electrolytic water; U ohm is the voltage drop caused by the resistance of electrolyte; U h 2 and U o 2 are the hydrogen overpotential and oxygen overpotential generated by the electrolytic water, respectively. U rev , U ohm , U h 2 and U o 2 are determined by U rev (T, p) = 1.5184 − 1.5421 × 10 −3 T + 9.523 × 10 −5 T ln T + 9.84 × 10 −8 T 2 (5) U o 2 (j, T) = RT α a n a F ln j j ao (8) Electronics 2022, 11, 196 4 of 22 where R i is the resistance of the electrolyte; R is the universal gas constant, F is the Faraday constant; α a and α c are the charge transfer coefficients of anode and cathode, respectively; j ao and j co are the exchange current densities of anode and cathode, respectively; n a and n c are the electron transfer numbers of anode and cathode, respectively. The input power of the electrolyzer P el is related to the electrolyzer current I as follows The relation between the input power and the electrolyzer efficiency can be obtained by Equations (1)- (9). However, the relation is complicated and contains logarithmic calculations. Thus, it is difficult to find the corresponding electrolyzer efficiency based on the input power of the electrolyzer in the microgrid scheduling problem.
In order to simplify the electrolyzer efficiency characteristics model, this paper firstly obtained η el and the corresponding P el according to j. Then, the electrolyzer efficiency characteristic curve was obtained based on η el and P el , as shown in Figure 2. Twenty points on the efficiency characteristic curve were taken as the original data to form the data table. When solving the scheduling problem, the electrolytic cell efficiency corresponding to P el can be quickly found by looking up the table and linear interpolation, as shown below: where P 0 and P 1 are the two power values nearest to P el in the data table; η 0 and η 1 are the corresponding electrolyzer efficiencies of P 0 and P 1 in the data table.
a c The input power of the electrolyzer el P is related to the electrolyzer current lows el el P U I = The relation between the input power and the electrolyzer efficiency can b by Equations (1)- (9). However, the relation is complicated and contains loga culations. Thus, it is difficult to find the corresponding electrolyzer efficiency b input power of the electrolyzer in the microgrid scheduling problem.
In order to simplify the electrolyzer efficiency characteristics model, this p obtained el η and the corresponding el P according to j. Then, the electrolyze characteristic curve was obtained based on el η and el P , as shown in Figure  points on the efficiency characteristic curve were taken as the original data data table. When solving the scheduling problem, the electrolytic cell effici sponding to el P can be quickly found by looking up the table and linear inter shown below: where 0 P and 1 P are the two power values nearest to el P in the data table; are the corresponding electrolyzer efficiencies of 0 P and 1 P in the data table. When the input power and the efficiency of the electrolyzer are determin drogen production power of the electrolyzer can be calculated according to Eq  When the input power and the efficiency of the electrolyzer are determined, the hydrogen production power of the electrolyzer can be calculated according to Equation (11): P el,out = η el P el (11) where P el,out is the hydrogen production power of the electrolyzer. Different from the conventional fixed efficiency model of electrolyzer, the hydrogen production power was obtained by multiplying the power consumption of electrolyzer and the respective efficiency obtained from the electrolyzer efficiency characteristic model. The total cost F in all scheduling periods of a day is set as the objective function. This objective function not only covers the economic benefits of microgrid, but also takes into account the environmental benefits of microgrid, as below: where T is the whole dispatching cycle; t is the time step, and the scheduling interval is 1 h; C MT (t) is the operating cost of the microturbine at time t; C MT co 2 (t) is the CO 2 emission cost of the microturbine at time t; C bat (t), C el (t) and C f c (t) are the operation costs of the BESS, electrolyzer, and fuel cell, respectively. The above operation costs can be determined by where δ 2 , δ 1 , and δ 0 are the power generation cost coefficients of microturbine; ∆t is the scheduling interval; c bat , c el , and c f c are the operation and maintenance cost coefficients of BESS, electrolyzer, and fuel cell, respectively; λ MT co 2 is the CO 2 emission coefficient of microturbine; c co 2 is the carbon emission price of carbon trading market; P MT t is the power generation of microturbine at time t; P b t is the charging or discharging power of BESS at time t, and a positive value of P b t means the BESS is charged. Otherwise, BESS is discharged; P el t and P f c t are the input power of electrolyzer and output power of fuel cell at time t, respectively.

Constraints
Generally, in order to ensure the overall working efficiency of the hydrogen storage system, the electrolyzer and fuel cell cannot work at the same time. Therefore, the input power of the electrolyzer is regarded as the charging power of the whole hydrogen storage system, and the discharging power of the fuel cell is regarded as the discharging power of the whole hydrogen storage system, as below: where P h 2 t is the charging/discharging power of the hydrogen storage system at time t, and a positive value of P h 2 t means the hydrogen storage system is charged. Otherwise, the hydrogen storage system is discharged.
In addition to economic efficiency, the operation safety of microgrid also needs to be guaranteed. The operation constraints of microgrid are as follows: 1.
Power balance The microgrid in this study is off grid. The power balance of the microgrid mainly relies on the output power of PV generation and microturbine. The imbalance power is regulated by BESS and hydrogen storage system. The power balance equation is where P PV t , P curt t , P load t and P loss t are the available PV generation, curtailment of PV generation, load power, and curtailment of load at time t, respectively.

2.
Operating power constraints To ensure the safety of the devices in microgrid, the operating power constraints are as below: where P MT max , P b max , P el max and P f c max are the upper power limits of microturbine, BESS, electrolyzer, and fuel cell, respectively; P MT min , P b min , P el min and P f c min are the lower power limits of microturbine, BESS, electrolyzer, and fuel cell, respectively.

Energy storage capacity
In order to avoid overcharging and over-discharging of energy storage, the states of charge (SOCs) of energy storage can be constrained as: where S b t is the SOC of BESS at time t; SOC b max and SOC b min are the upper and lower limits of SOC of BESS, respectively; S h 2 t is the SOC of hydrogen storage system at time t; SOC h 2 max and SOC h 2 min are the upper and lower limits of the SOC of hydrogen storage system. The SOCs of the two energy storage devices can be calculated by the following equations: where η b and ζ b are the charging and discharging efficiencies of BESS, respectively; η h 2 and ς h 2 are the efficiencies of electrolyzer and fuel cell, respectively; E b and E h 2 are the capacities of BESS and hydrogen storage tank, respectively. Because the operating cost of microturbine is a quadratic function, the objective function is nonlinear. All of the constraints are linear. Thus, the whole model is a quadratic programing model that is nonlinear.

Deep Reinforcement Learning
Deep reinforcement learning is a data-driven method and can be used in highdimensional sequential decision-making problem. The deep reinforcement learning model can be trained offline and applied online [24]. Thus, deep reinforcement learning is suitable for the application of the optimal operation of the microgrid. The block diagram of optimal operation of microgrid with deep reinforcement learning is shown in Figure 3.
Electronics 2022, 11, x FOR PEER REVIEW can be trained offline and applied online [24]. Thus, deep reinforcement learning ble for the application of the optimal operation of the microgrid. The block dia optimal operation of microgrid with deep reinforcement learning is shown in Fig

Reinforcement Learning
Reinforcement learning is the learning process where an intelligence agent with the environment in order to maximize the cumulative reward. The schem gram of reinforcement learning is shown in Figure 4.  Q-learning is one of the main algorithms of reinforcement learning. Q-learn uates the merit of an action by the state action value function and obtains the policy by solving the optimal action value function. The action value function is ca as below: is the value function of the state action at the kth iteration u state t s ; γ is the decay rate; k r is the reward value under the action t a a iteration; ' a is the arbitrary action that can be selected for the state 1 t s + .

Reinforcement Learning
Reinforcement learning is the learning process where an intelligence agent interacts with the environment in order to maximize the cumulative reward. The schematic diagram of reinforcement learning is shown in Figure 4.

Reinforcement Learning
Reinforcement learning is the learning process where an intellige with the environment in order to maximize the cumulative reward. gram of reinforcement learning is shown in Figure 4.  Q-learning is one of the main algorithms of reinforcement learnin uates the merit of an action by the state action value function and o policy by solving the optimal action value function. The action value fu as below: is the value function of the state action at the kth i state t s ; γ is the decay rate; k r is the reward value under the ac iteration; ' a is the arbitrary action that can be selected for the state s Q-learning is one of the main algorithms of reinforcement learning. Q-learning evaluates the merit of an action by the state action value function and obtains the optimal policy by solving the optimal action value function. The action value function is calculated as below: where Q k (s t , a t ) is the value function of the state action at the kth iteration under the state s t ; γ is the decay rate; r k is the reward value under the action a t at the kth iteration; a is the arbitrary action that can be selected for the state s t+1 .

Deep Deterministic Policy Gradient Algorithm
Conventional reinforcement learning methods, such as Q-learning, perform well in problems with small discrete spaces. However, when dealing with continuous state variable tasks, the number of states using discretization method increases exponentially as the dimensionality of the space increases. This results in the curse of dimensionality. With the development of machine learning, deep learning is combined with reinforcement learning to solve the curse of dimensionality problem. In this paper, the DDPG algorithm was used to solve the microgrid optimal operation problem. The DDPG algorithm [20] consists of two independent neural networks fitting the policy function and the action-value function. The two neural networks are called the policy network and the evaluation network.
In addition, two target networks were used for the policy network and evaluation network to add stability to training. The network parameters of the strategy network, evaluation network, target strategy network, and target evaluation network are θ π , θ Q , θ π and θ Q , respectively. The strategy network and the evaluation network were updated with the corresponding learning rates for the parameters. The evaluation network was updated by minimizing the loss function as below: where E is expectation; y t is target Q value; Q and π are target Q value and target strategy, respectively. The policy network parameters were updated by sampling the policy gradient as: ∇ θ π π = ∇ a Q(s, a θ Q ) s=s t ,a=π(s t ) ∇ θ π π(s θ π ) s=s t After the parameters of the strategy network and evaluation network were updated, the parameters of the two target networks were updated through soft update technique as below: where τ is the soft update co-efficient. In order to enhance the ability to explore the environment, random noise υ t needs to be added to the actions as:

State Space
The state space needs to include the factors that impact the strategy. For the optimal operation of the PV-hydrogen energy system, the parameters of the state space include the power generation of PV, the load, the SOC of BESS, and the SOC of hydrogen storage. Therefore, the state space contains four states and can be expressed as where S t is the state space, which is the input of the policy network. Thus, the dimension of the input layer of the DDPG policy network is 4.

Action Space
The decision variables of the microgrid operation optimization include the output of microturbine, the charging and discharging power of BESS, the charging and discharging power of hydrogen storage system, curtailment of PV generation, and curtailment of load power at time t. In order to avoid a high dimension action space of deep reinforcement learning, where the agent has difficulty of exploring the feasible solution, the action space of the microgrid operation optimization problem is expressed by microturbine output and hydrogen storage system charging/discharging power as: After the agent selects the action, the values of other decision variables were determined by the following rules. Firstly, the unbalanced power of electric energy after the agent selects the action was calculated according to Equation (37): where P extra t is the unbalanced power of the system. When the unbalanced power is positive, it indicates that the power generation of the system is large. At this scenario, the BESS is set to charge power. When the unbalanced power is negative, which represents that the power generation of the system is insufficient, and BESS is set to discharge power. Since output power of BESS is affected by the constraints of SOC, the maximum charging and discharging power under the current SOC can be calculated by the following formula: where P cha,max t is the maximum allowable charging power under the SOC at time t; P dis,max t is the maximum allowable discharge power under the SOC at time t.
The charging and discharging power of BESS were calculated by comprehensively considering the power limits and SOC constraints, as shown in Equation (39). Finally, the curtailment of PV generation and the curtailment of load power of the system were calculated according to the charging/discharging power of BESS and the imbalance power of the system, as shown in Equations (41) and (42).
The flowchart is shown in Figure 5: At each scheduling time t, the action vector a t with dimension 2 is generated by the strategy network under the state s t . Therefore, the output layer dimension of the policy network is 2. Since s t and a t are both inputs of the evaluation network, the input layer dimension of the evaluation network is 6. At each scheduling time t, the action vector t a with dimension 2 strategy network under the state t s . Therefore, the output layer dim network is 2. Since t s and t a are both inputs of the evaluation netw dimension of the evaluation network is 6.

Reward Function
The goal of the intelligence in the learning process is to maxim optimal policy must satisfy the constraints of the microgrid model. T need to be reasonably transformed into the reward function. The equi strained by the upper and lower limits of the action space. The SOC are met in the decision-making process. Therefore, it is only necess constraints of the hydrogen storage system to the reward function in t function as: is the hydrogen storage system SOC penalty function.

Reward Function
The goal of the intelligence in the learning process is to maximize the reward. The optimal policy must satisfy the constraints of the microgrid model. Thus, the constraints need to be reasonably transformed into the reward function. The equipment power is constrained by the upper and lower limits of the action space. The SOC constraints of BESS are met in the decision-making process. Therefore, it is only necessary to add the SOC constraints of the hydrogen storage system to the reward function in the form of a penalty function as: where D 1 is the hydrogen storage system SOC penalty function. The microgrid operates in an off-grid mode. In order to reduce the amount of load shedding and PV curtailment to improve the utilization of renewable energy, the costs of load shedding and PV curtailment are added into the reward function as part of the microgrid operation cost: where D 2 represents the total cost of load shedding and PV curtailment; ρ is the cost coefficient.
Since the objective of the proposed model is to minimize the microgrid operating cost, the reward function for each dispatch period contains the power system operating cost F t , the hydrogen storage SOC penalty function D 1 , and the cost of load shedding and PV curtailment D 2 .
Moreover, deep reinforcement learning is a process of maximizing the cumulative reward, so the operating cost in the reward function needs to be expressed as a negative value as:

Process of the Optimal Operation Method
The flowchart of the proposed optimal operation method of the microgrid based on DDPG is shown in Figure 6.
Electronics 2022, 11, x FOR PEER REVIEW Moreover, deep reinforcement learning is a process of maximizing reward, so the operating cost in the reward function needs to be expresse value as:

Process of the Optimal Operation Method
The flowchart of the proposed optimal operation method of the micr DDPG is shown in Figure 6.

Simulation Environment
The microgrid used for study is shown in Figure 1. The parameters efficiency characteristics are shown in Table 1, and the power limits and c of the equipment in the microgrid are shown in Table 2 [25]. The capacity o storage tank is 200 kWh. The electrochemical storage capacity is 2.9 kWh of the fuel cell is 0.65, and the charging and discharging efficiencies of the e storage are both 0.95. The microturbine cost parameters 2 δ , 1 δ , 0 δ USD/kW 2 , 0.03677 USD/kW, and 0.06829 USD/kW, respectively; The cost ef shedding and PV curtailment is 0.3152 USD/kWh. The factor of CO2 e

Simulation Environment
The microgrid used for study is shown in Figure 1. The parameters of electrolyzer efficiency characteristics are shown in Table 1, and the power limits and cost parameters of the equipment in the microgrid are shown in Table 2 [25]. The capacity of the hydrogen storage tank is 200 kWh. The electrochemical storage capacity is 2.9 kWh. The efficiency of the fuel cell is 0.65, and the charging and discharging efficiencies of the electrochemical storage are both 0.95. The microturbine cost parameters δ 2 , δ 1 , δ 0 are 0.001166 USD/kW 2 , 0.03677 USD/kW, and 0.06829 USD/kW, respectively; The cost efficiency of load shedding and PV curtailment is 0.3152 USD/kWh. The factor of CO 2 emission is 724 kg/kW, and the carbon emission price in the carbon trading market of Beijing in China is 0.009079 USD/kg. The data of PV and load are from [26]. The curves of PV generation and load forecast for a typical day are shown in Figure 7.

Simulation Results of Electrolyzer Efficiency Characteristics
In order to study the effect of the electrolyzer efficiency characteristic on t crogrid operation scheduling, the capacity of the hydrogen storage tank is set as 10 In the case where the efficiency characteristic is not considered, the efficiency of th trolyzer is set as a constant that is 0.65 from the literature [21].
The scheduling scheme of the constant efficiency case is applied to the more ac electrolyzer model considering efficiency characteristic. Additionally, the SOCs o and hydrogen storage are shown in Figure 8b. In contrast, the simulation results usi efficiency characteristic model are shown in Figure 8a. The microgrid operation co

Simulation Results of Electrolyzer Efficiency Characteristics
In order to study the effect of the electrolyzer efficiency characteristic on the microgrid operation scheduling, the capacity of the hydrogen storage tank is set as 10 kWh. In the case where the efficiency characteristic is not considered, the efficiency of the electrolyzer is set as a constant that is 0.65 from the literature [21].
The scheduling scheme of the constant efficiency case is applied to the more accurate electrolyzer model considering efficiency characteristic. Additionally, the SOCs of BESS and hydrogen storage are shown in Figure 8b. In contrast, the simulation results using the efficiency characteristic model are shown in Figure 8a. The microgrid operation costs under different electrolyzer models are shown in Table 3.  As shown in Table 3, the operating cost under the model considering electrolyzer efficiency characteristic is minimum. The constant efficiency models result in larger operating costs. It can be seen from Figure 8 that the maximum SOC of hydrogen storage under constant efficiency model is much less than 1. Under the model consider efficiency characteristic, the SOC of hydrogen storage reaches 1 at certain time steps. This means that adopting the model with efficiency characteristics can better utilize the hydrogen storage capacity and further reduce the operating cost.  As shown in Table 3, the operating cost under the model considering electrolyzer efficiency characteristic is minimum. The constant efficiency models result in larger operating costs. It can be seen from Figure 8 that the maximum SOC of hydrogen storage under constant efficiency model is much less than 1. Under the model consider efficiency characteristic, the SOC of hydrogen storage reaches 1 at certain time steps. This means that adopting the model with efficiency characteristics can better utilize the hydrogen storage capacity and further reduce the operating cost.

Simulation Results of DDPG Algorithm
Deep reinforcement learning needs to train a neural network in a short time and use it for action decision making and value estimation. Thus, deep reinforcement learning usually has a relatively shallow network to ensure fast training. Moreover, a too deep and wide neural network structure can easily lead to over-fitting. Finally, the network structure with two hidden layers is adopted through experiments. The strategy network in the DDPG algorithm in this study consists of a 4-dimensional input layer, 2 hidden layers with 64 neurons, and an output layer for actions. The evaluation network consists of a 4-dimensional input layer for states, a 2-dimensional input layer for actions, 2 hidden layers with 64 neurons, and an output layer for outputting Q values. The structure of neural network is shown in Figure 9. usually has a relatively shallow network to ensure fast training. Moreover, a too deep and wide neural network structure can easily lead to over-fitting. Finally, the network structure with two hidden layers is adopted through experiments. The strategy network in the DDPG algorithm in this study consists of a 4-dimensional input layer, 2 hidden layers with 64 neurons, and an output layer for actions. The evaluation network consists of a 4dimensional input layer for states, a 2-dimensional input layer for actions, 2 hidden layers with 64 neurons, and an output layer for outputting Q values. The structure of neural network is shown in Figure 9.  The decay rate of the DDPG algorithm γ is 0.9. The learning rate of the strategy network is 0.0001. The learning rate of the evaluation network is 0.001. A total of 64 samples are selected for each learning process. The size of the experience pool is 10,000. The standard deviation of Gaussian noise is 1. The standard deviation of Gaussian noise is reduced to 0.9995 times of the original for each scheduling period during the learning process. Additionally, the number of iterations is set as 2000.
The reward value curve during the training of the algorithm is shown in Figure 10. It can be seen that, after 1000 rounds, the reward value is basically stable, and the algorithm converges. The operating cost of the microgrid is USD 5.29. The decay rate of the DDPG algorithm γ is 0.9. The learning rate of the strateg work is 0.0001. The learning rate of the evaluation network is 0.001. A total of 64 sam are selected for each learning process. The size of the experience pool is 10,000. The s ard deviation of Gaussian noise is 1. The standard deviation of Gaussian noise is red to 0.9995 times of the original for each scheduling period during the learning pr Additionally, the number of iterations is set as 2000.
The reward value curve during the training of the algorithm is shown in Figu It can be seen that, after 1000 rounds, the reward value is basically stable, and the rithm converges. The operating cost of the microgrid is USD 5.29. From Figure 11a, we can see that, in the time period from 8:00 to 17:00, the PV g ation increases and the BESS starts to charge. The electrolyzer also produces hydr and the BESS stops acting after it is fully charged. In the time period from 18:00 to when the PV generation decreases and the load demand is high, the hydrogen stor mainly used for power generation at these time steps because the hydrogen storage larger capacity. From Figure 11a, we can see that, in the time period from 8:00 to 17:00, the PV generation increases and the BESS starts to charge. The electrolyzer also produces hydrogen, and the BESS stops acting after it is fully charged. In the time period from 18:00 to 23:00 when the PV generation decreases and the load demand is high, the hydrogen storage is mainly used for power generation at these time steps because the hydrogen storage has a larger capacity.
The reward value curve during the training of the algorithm is sh It can be seen that, after 1000 rounds, the reward value is basically sta rithm converges. The operating cost of the microgrid is USD 5.29. From Figure 11a, we can see that, in the time period from 8:00 to 17 ation increases and the BESS starts to charge. The electrolyzer also pr and the BESS stops acting after it is fully charged. In the time period fr when the PV generation decreases and the load demand is high, the hy mainly used for power generation at these time steps because the hydro larger capacity.  Figure 11b shows the curtailment of PV generation and load. A positiv the microgrid has excess generation, resulting in curtailment of PV generat value means the load is more than the generation and a part of load is s seen, there is no load shedding in the microgrid, and all load demands are there is curtailment of PV generation during the time steps from 11:00 to 1

Performance Evaluation
In order to test the performance of the proposed optimal operation microgrid, the proposed algorithm is compared with the genetic algorithm the interior point method [28]. The DDPG algorithm is implemented in Py TensorFlow framework. The interior point method and GA are conducte The interior point method is implemented using the 'fmincon' function in tion toolbox. The genetic algorithm is implemented using the 'ga' function. results are shown in Figures 12 and 13. Table 4 summarizes the operati microgrid using the three methods.  Figure 11b shows the curtailment of PV generation and load. A positive value means the microgrid has excess generation, resulting in curtailment of PV generation. A negative value means the load is more than the generation and a part of load is shed. As can be seen, there is no load shedding in the microgrid, and all load demands are met. However, there is curtailment of PV generation during the time steps from 11:00 to 17:00.

Performance Evaluation
In order to test the performance of the proposed optimal operation method for the microgrid, the proposed algorithm is compared with the genetic algorithm (GA) [27] and the interior point method [28]. The DDPG algorithm is implemented in Python using the TensorFlow framework. The interior point method and GA are conducted in MATLAB. The interior point method is implemented using the 'fmincon' function in the optimization toolbox. The genetic algorithm is implemented using the 'ga' function. The simulation results are shown in Figures 12 and 13. Table 4 summarizes the operating costs of the microgrid using the three methods.
Method2: Optimize the operation of the microgrid using the GA; 3.
Method3: Optimize the operation of the microgrid using the interior point method.
It can be seen that the operating costs of method 2 and method 3 are higher than that of method 1. In method 2 and method 3, the expensive microturbine is used for too many times. In contrast, in method 1, the cheap fuel cell is used more often. In total, the operating cost of method 1 is the least. The operating cost not only covers the economic benefits of the microgrid system operation, but also takes into account the environmental benefits of the microgrid. Through the simulation experiment, the proposed DDPG method has the minimum operating cost compared to the traditional methods. microgrid using the three methods.
1. Method1: Optimize the operation of microgrids using DDPG algorit 2. Method2: Optimize the operation of the microgrid using the GA; 3. Method3: Optimize the operation of the microgrid using the interior   It can be seen that the operating costs of method 2 and method 3 are hig of method 1. In method 2 and method 3, the expensive microturbine is used times. In contrast, in method 1, the cheap fuel cell is used more often. In tota ing cost of method 1 is the least. The operating cost not only covers the econ of the microgrid system operation, but also takes into account the environm of the microgrid. Through the simulation experiment, the proposed DDPG the minimum operating cost compared to the traditional methods.

Generalization Analysis
To investigate the generalization of the DDPG algorithm in new scenarios, the already trained DDPG model is tested in new winter and summer scenarios, since the load and PV generation curves differ in shapes. The load and PV generation curves are shown in Figure 14. The trained model was used for the new scenarios and the results are shown in Figure 15.
From the simulation results, it can be seen that in winter the PV generation is not enough to support the load demand. The hydrogen storage system is discharged most of the time, and the microturbine is put into use at peak hours from 16:00 to 22:00. Since the PV generation power in winter is low, there is no PV curtailment in winter.
In summer, the PV generation is high. The load demand can be met under the regulation of BESS and the hydrogen storage system. From 9:00 to 17:00, the PV generation is larger, and the hydrogen storage system is in the charging state. From 17:00 to 23:00, the load peak is at peak hours, and the hydrogen storage system is in the discharging state.
To compare with the DDPG algorithm, the GA method is applied to the new scenarios, and the results are shown in Table 5.
ing cost of method 1 is the least. The operating cost not only covers the ec of the microgrid system operation, but also takes into account the environ of the microgrid. Through the simulation experiment, the proposed DD the minimum operating cost compared to the traditional methods.

Generalization Analysis
To investigate the generalization of the DDPG algorithm in new s ready trained DDPG model is tested in new winter and summer scenario and PV generation curves differ in shapes. The load and PV generation c in Figure 14. The trained model was used for the new scenarios and the re in Figure 15.    From the simulation results, it can be seen that in winter the PV enough to support the load demand. The hydrogen storage system is d the time, and the microturbine is put into use at peak hours from 16:00 PV generation power in winter is low, there is no PV curtailment in wi In summer, the PV generation is high. The load demand can be m lation of BESS and the hydrogen storage system. From 9:00 to 17:00, th larger, and the hydrogen storage system is in the charging state. From load peak is at peak hours, and the hydrogen storage system is in the d To compare with the DDPG algorithm, the GA method is applied ios, and the results are shown in Table 5.

DDPG
Operating cost of winter/USD 2.07 Operating cost of summer/USD 5.31 As shown in Table 5, the trained DDPG model can be applied to rectly without additional training, and the operating cost of the microgr using the GA, which indicates that the proposed algorithm has a cert after training, and can reduce the operating cost of the hydrogen micro As shown in Table 5, the trained DDPG model can be applied to new scenarios directly without additional training, and the operating cost of the microgrid is less than that using the GA, which indicates that the proposed algorithm has a certain generalization after training, and can reduce the operating cost of the hydrogen microgrid.

Conclusions
This paper proposes a refined model to represent the electrolyzer efficiency characteristics using the linear interpolation method. The electrolyzer efficiency characteristic model is combined with the model of the microgrid with hydrogen storage. Additionally, an optimal operation method based on the DDPG algorithm is proposed for the microgrid. According to the simulation results, the following conclusions can be drawn:

•
The electrolyzer efficiency characteristics model using linear interpolation method can describe the operation of electrolyzer more accurately. The proposed optimal operation method for the microgrid considering electrolyzer efficiency characteristics can reduce the PV curtailment and reduce the microgrid operation cost; • The optimal microgrid operation method based on DDPG algorithm can effectively reduce the operation cost and improve the microgrid efficiency compared with the method based on traditional algorithms, such as the GA and interior point method; • The optimal microgrid operation method based on DDPG algorithm has a certain generalization and can be used in in different scenarios.
However, the uncertainties of PV and load are not considered in this research, and the fuel cell efficiency is ignored. Future work will focus on the microgrid operation optimization strategy under uncertain environments and take into account the characteristics of fuel cell to make the operation model more realistic.