A Frequency and Voltage Coordinated Control Strategy of Island Microgrid including Electric Vehicles

: Frequency and voltage deviation are important standards for measuring energy indicators. It is important for microgrids to maintain the stability of voltage and frequency (VF). Aiming at the VF regulation of microgrid caused by wind disturbance and load ﬂuctuation, a comprehensive VF control strategy for an islanded microgrid with electric vehicles (EVs) based on Deep Deterministic Policy Gradient (DDPG) is proposed in this paper. First of all, the SOC constraints of EVs are added to construct a cluster-EV charging model, by considering the randomness of users’ travel demand and charging behavior. In addition, a four-quadrant two-way charger capacity model is introduced to build a microgrid VF control model including load, micro gas turbine (MT), EVs, and their random power increment constraints. Secondly, according to the two control goals of microgrid frequency and voltage, the structure of DDPG controller is designed. Then, the deﬁnition of space, the design of global and local reward functions, and the selection of optimal hyperparameters are completed. Finally, different scenarios are set up in an islanded microgrid with EVs, and the simulation results are compared with traditional PI control and R ( λ ) control. The simulation results show that the proposed DDPG controller can quickly and efﬁciently suppress the VF ﬂuctuations caused by wind disturbance and load ﬂuctuations at the same time.


Introduction
Microgrid refers to a small power generation and distribution system that is composed of distributed power sources, energy storage devices, energy conversion devices, related loads, monitoring, and protection devices. It is an autonomous system that can realize self-control, protection, and management. In addition, the microgrid can operate in gridconnected mode and islanded mode. In islanded mode, the power quality of the microgrid is usually maintained by the micro sources and flexible loads [1]. At the same time, with the development of vehicle-to-grid (V2G) technology, the research of EVs in the areas of grid peak and valley filling, suppression of power fluctuations, and microgrid stability control has also been deepened [2,3], which brings opportunities and challenges to the VF regulation of microgrids.
Due to the limited capacity of the islanded microgrid, ensuring the stability of frequency and voltage is the key for the operation safety of microgrid. In [4], a VF strategy of an islanded microgrid based on fuzzy logic controller is proposed, which can control active and reactive powers and decrease power losses of the microgrid, thus the effectiveness and robustness of the proposed controller over the conventional proportional-integral controller. In [5], a decoupled VF controller for DGs is proposed, which is able to keep the grid VF magnitude constant, so as to enhance the resilience and increase the penetration of renewable energy to the stand-alone microgrid. In [6], an optimized solution is proposed for minimizing both frequency and voltage deviations. The simultaneous control of VF is achieved with proper load sharing among the DG units. However, the system parameter setting of the traditional control strategy mentioned above is complicated, and the control performance needs to be further improved when faced with complex working conditions such as wind disturbance and load fluctuation. Therefore, various intelligent algorithms are gradually being widely used in the control of microgrids. In [7], a new scheme for the online minimization of harmonic distortion of an islanded microgrid based on a population-based optimization method is proposed, presenting a new central controller to optimize network voltage harmonics according to particle swarm optimization (PSO) algorithm, while active power is shared between distributed generation units. In [8], a coordinated load shedding control scheme based on Double-Q learning for an islanded microgrid is proposed to solve the problem of how to determine the appropriate load shedding amount and objects when frequency is disturbed by considering the relationship between the active power and frequency deviation of each distributed energy resource. However, the intelligent controllers mentioned above can only regulate frequency or voltage, that is, they cannot take both of frequency recovery and voltage adjustment.
Meanwhile, in the construction of the microgrid model, the access of EVs is not considered, and the boundary of the output power of each unit is ignored. Thus, there is room for further optimization in the microgrid model and control strategy. EVs have become a new type of distributed energy storage unit with its energy saving, environmental protection, and flexibility [9,10], which can provide power support for the islanded microgrid and improve its operational flexibility through V2G technology. In [11], an islanded microgrid LFC model including loads, distributed power sources, MT, EVs, and their constraints is established. However, the output power boundary of the EVs charging station model in this paper is a fixed value, which does not match the actual situation. In [12], a microgrid including micro gas turbine (MT), EVs, distributed power, and loads is established, and an improved robust model predictive frequency control strategy of microgrids with EVs is proposed, which can better suppress the frequency fluctuation with a faster response speed than other methods, but the random output power boundary is not refined from the perspective of cluster EVs. In addition, none of the above references considers the reactive power regulation effect of EVs on voltage stability. In fact, the power boundary of the charging station can be affected by user travel demand, charging behavior, and the characteristics of EV clusters. Thus, the active power P and reactive power Q output by the EVs charging station can be adjusted according to the control command and the power factor angle of the charger, so as to complete the stability control of VF.
In summary, the randomness of users can affect the charging behavior of EVs stations. In addition, there is no suitable intelligent control algorithm that can use EVs to realize the coordinated control of the VF of the islanded microgrid. Thus, a VF coordinated control strategy based on Deep Deterministic Policy Gradient (DDPG) is proposed in this paper, which is applied to the VF control of an islanded microgrid with EVs. The main contributions are as follows: (1) In order to solve the problem of randomness in the charging boundary of EVs caused by the users' randomness, the VF control model of EVs is established. The SOC constraint condition of the EVs is established, and a fourquadrant two-way charger capacity model is introduced. Thus, a microgrid VF control model including load, MT, EVs, and their random power boundary is built; and (2) the voltage and frequency fluctuations can be caused by wind disturbance and load fluctuations. Thus, the DDPG controller with online learning and experience playback capabilities is selected. The convergence characteristics of DDPG are great, so it can coordinate the frequency recovery and voltage regulation of the islanded microgrid greatly. (3) In order to achieve effective regulation of voltage and frequency at the same time, the structure of DDPG controller is designed according to the two control goals of microgrid frequency and voltage. In addition, then the definition of space, the design of global and local reward functions, and the selection of optimal hyperparameters are completed. Thereby, it can simultaneously meet the VF control requirements.

Microgrid Control Model with EVs
The VF control in microgrid can be realized by distributed power supply, energy storage device, etc. In addition, EVs can also participate in microgrid VF regulation. For the VF control of microgrid, the microgrid control system of distributed power supply, load, MT, and EVs is established in this section.

Electric Vehicle Control Model
As a flexible energy storage device in microgrid control, EVs can regulate the charge and discharge power of the battery according to the instructions of the controller, thereby to control the interaction of active power with the grid [13]. At the same time, charger scheduling is applied to realize the regulation of voltage or reactive power. The two-way charger can realize four-quadrant operation [14], and the power factor cannot determine the transmission direction of reactive power, so the operating quadrant of the charger cannot be determined. Taking the power factor angle as the control variable can determine the transmission direction and magnitude of active and reactive power together, which is more conducive to the two-way transmission control of active and reactive power between the grid and the EV.
The function of EVs in microgrid control is similar to that of energy storage devices. In terms of active power, the charging and discharging power ranges of EV are limited within ±λ e , due to the limits of inverter capacity. The E max is the maximum capacity of EVs station. In addition, the recommended maximum capacity E rmax = 0.9E max and the recommended minimum capacity E rmin = 0.1E max are set to ensure the safe and stable operation of EV station. When the current capacity E of the EVs station is higher than the E rmax , the EV stations can discharge to the microgrid, and the discharge power range is 0-λ e . Similarly, if the current capacity of the EVs station is lower than the E rmin , the EV station can be charged from the microgrid within the charging power range is −λ e -0. In addition, the EV control model can be affected by users' uncertain factors such as the randomness of travelling demands and charging behavior of users.
Firstly, the randomness of user travel demand affects the capacity and limitation of the charging station to be random. Therefore, it is necessary to establish the constraints of SOC to ensure that the user's normal travel is still satisfied under the interaction between EVs and the grid. In addition, the initial SOC of the battery in this paper is set as a random number [15] obeying Gaussian distribution, and its probability density function is expressed as Equation (1): where µ s represents the average value of SOC, and σ s represents the standard deviation. According to the 2017 National Household Travel Survey (NHTS) of the US Department of Transportation [16], it can be obtained that the daily mileage L obeys lognormal distribution, and its probability density function is as follows: where µ L represents the average value of the daily mileage L, and σ L represents the standard deviation. According to the daily driving mileage, the charging time T c is calculated: where P c is the charging power, and Q 100 is the power consumption per 100 km. For the leaving time T leave , it is required that T leave ≥ T c . Thus, T leave is set as follows: where σ T is a positive random number. Based on the above parameters, the demanded SOC for future travel named SOC m can be calculated [17]: where S 0 is the initial SOC for EVs. Therefore, for EVs in the station, the SOC can be maintained within the range of [SOC rmin , SOC rmax ]. SOC rmax and SOC rmin are the recommended maximum and minimum value of SOC, which can ensure the life of the battery. To satisfy the sufficient SOC m to make sure the follow-up driving when EVs leave, the constraint conditions are added to the SOC of EVs, as shown in Figure 1. The blue dotted line represents the charge boundary, which means that the EV can no longer charge when the SOC reaches SOC rmax . The red dotted line represents the discharge boundary, which means that the EV can no longer discharge when the SOC reaches SOC rmin . The solid green line represents the boundary of forced charging, which means that the EV is forced to charge to ensure the SOC m when leaving the charging station.
where Pc is the charging power, and Q100 is the power consumption per 100 km. For the leaving time Tleave, it is required that Tleave ≥ Tc. Thus, Tleave is set as foll where σT is a positive random number. Based on the above parameters, the demanded SOC for future travel name can be calculated [17]: where S0 is the initial SOC for EVs.
Therefore, for EVs in the station, the SOC can be maintained within the r [SOCrmin, SOCrmax]. SOCrmax and SOCrmin are the recommended maximum and mi value of SOC, which can ensure the life of the battery. To satisfy the sufficient S make sure the follow-up driving when EVs leave, the constraint conditions are a the SOC of EVs, as shown in Figure 1. The blue dotted line represents the charge ary, which means that the EV can no longer charge when the SOC reaches SOCr red dotted line represents the discharge boundary, which means that the EV can no discharge when the SOC reaches SOCrmin. The solid green line represents the boun forced charging, which means that the EV is forced to charge to ensure the SOC leaving the charging station.  Furthermore, in terms of active power, the rated charging power of a single be set to P EV,i, ch and the rated discharging power to P EV,i dis The relationship betw charging power of a single EV and the charging and discharging state can be obta follows: When SOCi ≥ SOCrmax, the single EV can discharge positive power increm ΔPEV,I < P EV,i dis , which can ensure that SOCi is controlled below SOCrmax. When SOCmin, the single EV can only be charged, that is, only the negative power increm Furthermore, in terms of active power, the rated charging power of a single EV can be set to P ch EV,i, and the rated discharging power to P dis EV,i The relationship between the charging power of a single EV and the charging and discharging state can be obtained as follows: When SOC i ≥ SOC rmax , the single EV can discharge positive power increment 0 < ∆P EV,I < P dis EV,i , which can ensure that SOC i is controlled below SOC rmax . When SOC i ≤ SOC min , the single EV can only be charged, that is, only the negative power increment can be discharged −P ch EV,i < ∆P EV,i < 0, which can ensure that SOC i is controlled above SOC rmin . When SOC rmin < SOC i < SOC rmax , the single EV can be charged and discharged. Thus, the power increment satisfies −P ch EV,i < ∆P EV,i < P dis EV,i . In summary, the instruction distribution of the EVs station through the controller is shown in Figure 2. In addition, the charging and discharging constraint boundary of a single EV can be obtained as follows: lectronics 2022, 11, x FOR PEER REVIEW be discharged -P EV,i ch < ∆P EV,i < 0, which can ensure that SOCi is controlled a When SOCrmin < SOCi < SOCrmax, the single EV can be charged and dischar power increment satisfies -P EV,i ch < ∆P EV,i < P EV,i dis . In summary, the instruction the EVs station through the controller is shown in Figure 2. In addition, the discharging constraint boundary of a single EV can be obtained as follows: , 0 and , 0 and The charging and discharging constraint boundary of the cluster EVs' tained from the boundary of a single EV as follows: The charging and discharging constraint boundary of the cluster EVs' P EV can be obtained from the boundary of a single EV as follows: where n EV is the number of EV. In addition, the active power capacity calculation is related to the number and the SOC state of EV: where E i represents the active power capacity of a single EV, E all represents the total active power capacity of EVs, and E ct represents the real time active power capacity of the EVs station. From this, it can be obtained that the output power ∆P EV of the EV charging station during the charging and discharging process should meet the following constraints: when E ct > E rmax , the real time active power capacity E ct of the EV station is higher than the recommended maximum capacity E rmax , due to the rapid increase in the number of EVs in the charging station. When E ct < E rmin , the number of EVs in the charging station is too small, or the EVs in the charging station are all in a low battery state. When E rmin < E ct < E rmax , the EV station can either discharge to the microgrid or charge from the microgrid. Furthermore, the capacity state E of the EVs station is related to the EVs existing in the EVs station in different SOC states. Therefore, by combining Equations (8) and (10), it can obtain the constraint of active output power ∆P EV considering the travel demand of users, the number of electric vehicles, and the real-time SOC of electric vehicles as: After obtaining the boundary of the active discharge power ∆P EV of the EVs, the reactive power boundary can be obtained through the power factor angle of the charger, and the circuit topology of the four-quadrant bidirectional charger mostly uses a doublebuck AC-DC half-bridge conversion circuit, a traditional AC-DC half-bridge conversion circuit, and an AC-DC full-bridge conversion circuit. The capacity curve of the charger is shown in Figure 3 [18]. power capacity of EVs, and Ect represents the real time active power capacity station.
From this, it can be obtained that the output power ΔPEV of the EV chargi during the charging and discharging process should meet the following constr when Ect > Ermax, the real time active power capacity Ect of the EV station is highe recommended maximum capacity Ermax, due to the rapid increase in the numbe the charging station. When Ect < Ermin, the number of EVs in the charging sta small, or the EVs in the charging station are all in a low battery state. When E Ermax, the EV station can either discharge to the microgrid or charge from the m Furthermore, the capacity state E of the EVs station is related to the EVs the EVs station in different SOC states. Therefore, by combining Equations (8) a can obtain the constraint of active output power ΔPEV considering the travel d users, the number of electric vehicles, and the real-time SOC of electric vehicles After obtaining the boundary of the active discharge power ΔPEV of the E active power boundary can be obtained through the power factor angle of th and the circuit topology of the four-quadrant bidirectional charger mostly uses buck AC-DC half-bridge conversion circuit, a traditional AC-DC half-bridge c circuit, and an AC-DC full-bridge conversion circuit. The capacity curve of the shown in Figure 3  φ is the power factor angle when the apparent rated power is ΔSEV. φmin an the minimum and maximum power factor angles of the charger. The positive P axis and Q axis represents the energy transferred from the grid to the EV char  ϕ is the power factor angle when the apparent rated power is ∆S EV . ϕ min and ϕ max are the minimum and maximum power factor angles of the charger. The positive axis of the P axis and Q axis represents the energy transferred from the grid to the EV charger. When the active power is OA, the adjustable range of reactive power is CC', and the length of OB is the apparent rated power ∆S. In addition, the relationship of the active and reactive power ∆P EV and ∆Q EV can be charged by Figure 3, as in the Formula (12): Thus, the power factor angle needs to meet the operating characteristics of the charger, and when ∆P EV > 0, the grid feeds active power to the EVs, when ∆Q EV > 0, the grid feeds reactive power to the EVs.
In summary, the boundary of the output power increment of the EV charging station is affected by the number of EV in the charging station N EV , SOC state, electric vehicle charging station real time capacity E, and the angle of charging power factor.

VF Control Model of Microgrids with EVs
The output characteristics of distributed wind power and photovoltaic system are random, and load fluctuations simultaneously affect the output of active and reactive power. Therefore, in the process of microgrid VF control in this paper, the wind power and photovoltaic system are equivalent to disturbance sources [19]. In addition, the load response characteristics of wind power system and photovoltaic power system are similar, so only the microgrid load VF control under the wind power disturbance is considered, and it is applied using recorded historical data [20]. In addition, the MT is added to the microgrid system as a main control unit in this paper to ensure the flexibility and validity of microgrid regulation.
The structure of the microgrid is in Figure 4. The microgrid includes a MT, EVs, distributed wind power, and load. the active power is OA, the adjustable range of reactive power is CC', and the le OB is the apparent rated power ΔS. In addition, the relationship of the active and r power ΔPEV and ΔQEV can be charged by Figure 3, as in the Formula (12): Thus, the power factor angle needs to meet the operating characteristics charger, and when ΔPEV > 0, the grid feeds active power to the EVs, when ΔQEV > grid feeds reactive power to the EVs.
In summary, the boundary of the output power increment of the EV charging is affected by the number of EV in the charging station NEV, SOC state, electric charging station real time capacity E, and the angle of charging power factor.

VF Control Model of Microgrids with EVs
The output characteristics of distributed wind power and photovoltaic syst random, and load fluctuations simultaneously affect the output of active and r power. Therefore, in the process of microgrid VF control in this paper, the wind and photovoltaic system are equivalent to disturbance sources [19]. In addition, th response characteristics of wind power system and photovoltaic power system ar lar, so only the microgrid load VF control under the wind power disturbance is ered, and it is applied using recorded historical data [20]. In addition, the MT is ad the microgrid system as a main control unit in this paper to ensure the flexibility a lidity of microgrid regulation.
The structure of the microgrid is in Figure 4. The microgrid includes a MT, EV tributed wind power, and load.   ∆P L and ∆Q L are the load disturbance power, ∆P W and ∆Q W are the wind disturbance power, ∆P MT and ∆Q MT are the power variation of MT, and ∆P EV and ∆Q EV are the power variation of EVs.

The Design of Microgrid VF Controller Based on DDPG
In the islanded microgrid, it is important to maintain the stability of VF, but there are some control problems such as various uncertainties and nonlinearities caused by DGs and EVs, which can inevitably cause the VF fluctuation and make it deviate from the reference value.
In addition, the Deep Reinforcement Learning (DRL) with online learning, experience playback capabilities and other advantages, is suitable for nonlinear systems [21]. Therefore, in this paper, a VF controller based on DDPG for islanded microgrid with EVs is designed. The frequency and voltage deviation is fed back to the DDPG controller, which adjusts the power output of each unit to ensure the stability of the frequency and voltage of the system.

Theoretical Analysis of DDPG
Q-learning and Deep Q-learning (DQN) are typical value-based reinforcement learning algorithms that use value functions to learn the optimal strategy during the interaction with the environment [22]. However, since the Q-learning cannot process continuous signals, it is necessary to discretize the action space. Therefore, it is difficult to realize the precise control of MT, EVs and chargers, which is not suitable for the design of this paper.
In addition, the learning of the DDPG can be carried out in a continuous action space [23]. The DDPG contains four networks, namely actor current network, actor target network, critic current network, critic target network. At t, the actor current network parameter is θ, and the actor target network parameter is θ , the critic current network parameter is ω, the critic target network parameter is ω .
In the above four networks, the actor current network can generate action a t according to the current status s t . The actor target network can generate the action a t + 1 at the t + 1 time according to the subsequent state of the environment. The critic current network can calculate the value R t corresponding to the status s t and action a t . The Critic target network can generate the value of Q value (s t + 1, a t+1 |ω ) based on subsequent state s t+1 and action a t + 1 , which is used to calculate the target value y, as shown in the Formula (14): where γ is a discount factor and 0 < γ < 1, Q value (s t + 1, a t + 1 |ω ) is the value generated by subsequent state s t + 1 and action a t + 1 , which is used to calculate the target value y.
Meanwhile, the critic current network parameter ω is updated by the gratial direction of the neural network using a mean square difference loss functional Formula (15). In addition, the parameter of the actor current network θ is updated through the gradient of the neural network, as shown in Formula (16): where m is the number of samples, y j is the target value of the j sample, Q(s j ,a j ,ω) is the output value of the critic current network for the j sample, and π θ (·) is the output value of the actor current network.
Furthermore, it is necessary to update the critic target network and actor target network parameters by Equation (17): where τ is an update coefficient, which is generally small.
In addition, the E is a termination function, which is to determine whether the Agent enters the termination. If the Agent enters the termination state, the iterative process stops and a new round of state sequence starts. If the Agent enters the non-termination state, the iterative process of the wheel can be continued.
In summary, status information, reward value, action information, and termination status information {s, a, R, s , E} are formed into a sample unit and stored in the empirical playback set D. Then, m sample units of set D are taken to be trained by Formulas (14)- (17). A total of T rounds is trained, and the training step length of each round is T m . The specific training process is shown in Figure 5. In addition, the E is a termination function, which is to determine whether the Agent enters the termination. If the Agent enters the termination state, the iterative process stops and a new round of state sequence starts. If the Agent enters the non-termination state, the iterative process of the wheel can be continued.
In summary, status information, reward value, action information, and termination status information {s, a, R, s′, E} are formed into a sample unit and stored in the empirical playback set D. Then, m sample units of set D are taken to be trained by Formulas (14)- (17). A total of T rounds is trained, and the training step length of each round is Tm. The specific training process is shown in Figure 5.

Design of DDPG VF Controller Structure
Considering MT and EV output power increment limiting constraints, a VF controller structure based on DDPG is proposed, as shown in Figure 6. The controller is composed of two layers: coordinate layer and control layer. The coordinate layer provides real-time

Design of DDPG VF Controller Structure
Considering MT and EV output power increment limiting constraints, a VF controller structure based on DDPG is proposed, as shown in Figure 6. The controller is composed of two layers: coordinate layer and control layer. The coordinate layer provides real-time regulation signal ∆A to the control layer according to the frequency deviation ∆f, voltage deviation ∆U, and the real-time boundary of output power of EV charging station, and then controls the output power of MT and EV to quickly suppress the frequency and voltage deviation.
The state space The action space

Definition of Space and Reward Function
As mentioned above, the state set of the control system is frequency dev voltage deviation ∆U(t), and the real-time boundary of output power of EV ch tion P EV ± t and Q EV ± t , so the state space S can be defined as follows: In addition, the joint action set A of the DDPG controller, namely the ou controller, should be a real-time set of dispatch instruction of the active a power output of MT, the output active power of EVs, and the power factor a charger. Thus, the action space A can be defined as follows: In addition, then, China's power safety work principle stipulates that th of the power system during normal operation should be within the range of and the voltage deviation should within 5%. Thus, on this basis, a certain adjus zone is considered, the discrete set of real- Meanwhile, the control objectives in this paper are: ①Restore the frequ rated value; ②Regulate and control the voltage to restore to the best state. A comprehensive reward function including two local reward functions can b

Definition of Space and Reward Function
As mentioned above, the state set of the control system is frequency deviation ∆F(t), voltage deviation ∆U(t), and the real-time boundary of output power of EV charging station P ± EV (t) and Q ± EV (t), so the state space S can be defined as follows: In addition, the joint action set A of the DDPG controller, namely the output of the controller, should be a real-time set of dispatch instruction of the active and reactive power output of MT, the output active power of EVs, and the power factor angle of the charger. Thus, the action space A can be defined as follows: In addition, then, China's power safety work principle stipulates that the frequency of the power system during normal operation should be within the range of 50 ± 0.2 Hz, and the voltage deviation should within 5%. Thus, on this basis, a certain adjustment dead zone is considered, the discrete set of real- Meanwhile, the control objectives in this paper are: 1 Restore the frequency to the rated value; 2 Regulate and control the voltage to restore to the best state. As a result, a comprehensive reward function including two local reward functions can be set up to coordinate frequency recovery and voltage adjustment: where R is the global reward, r f is the frequency reward, r u is the voltage reward, µ 1 , µ 2 , µ 3 and µ 4 are the weights corresponding to the reward function of each control region in the frequency penalty item r f , and δ 1 , δ 2 , δ 3 and δ 4 and are the weights corresponding to the voltage control regions.
The control process needs to control the frequency through r f , when |∆f | is in adjusting dead zone [−0.05, 0.05] Hz, and the frequency meets the minimum error requirement of normal operation, so the maximum reward value given to the DDPG controller at this time is 0. When |∆f | is respectively in normal control (0.05, 0.10) and (0.10, 0.15) Hz, auxiliary control area (0.15, 0.2) Hz, emergency control area (0.2, +∞) Hz, the controller can get the corresponding negative incentives, namely the penalty value. Meanwhile, when voltage control is performed, the voltage needs to be regulated by r u , when |∆U| is in adjusting dead zone [−0.01, 0.01], the maximum reward value given to the DDPG controller at this time is 0, and when |∆U| is respectively in normal control (0.01, 0.02) and (0.02, 0.03), auxiliary control area (0.03, 0.05), emergency control area (0.05, 1), the controller can get the corresponding penalty value.
When determining the values of the above parameters, it should be noted that the size of the reward value can affect the convergence effect and the learning speed. Therefore, it is necessary to perform simulation tests based on actual calculation examples, and the specific process will be discussed later.
In summary, the state space and reward function designed in this paper can realize the simultaneous adjustment of voltage and frequency. When the frequency is restored, it can consider whether the voltage exceeds the limit, and, when adjusting the voltage, it can also consider whether the frequency deviates from the rated value, which significantly improves the overall stability of the microgrid.

The Selection of Hyperparameter
In DRL, it is necessary to provide the agent with a set of optimal hyperparameters to improve the performance and effect of learning [24].
First of all, the larger the discount factor γ, the more the agent attaches importance to past experience and can give up current interests and pursue overall interests. However, if γ is too large, it will also cause the training of agent to fail to converge. The greater the learning rate α, the faster the agent converges, but the worse the stability; the smaller the α, the better the stability, but the slower the agent converges. Therefore, the convergence speed should be improved on the premise when the agent training can converge. In addition, the design of network structure can be discussed from two aspects: network type and network depth. The choice of network type depends largely on the state space, and the state space of the control system in this paper is frequency and voltage deviation, which belong to one-dimensional vector, so the full connection layer can better meet the requirements of the storage strategy set. In addition, the network depth determines the generalization ability of the neural network, which includes the number of layers of the neural network h and the neurons in each layer u.
In addition, the specific values of γ, α, h and u need to be selected according to the calculation example.

Summary of Control Strategy
In summary, the control strategy of this paper is carried out in the following steps:

1.
Firstly, definite the state set of the control system as ∆F(t), ∆U(t), P ± EV (t) and Q ± EV (t). In addition, the action space can be defined as ∆A P,MT (t), ∆A P,EV (t), ∆A Q,MT (t), ∆A ϕ,EV (t).

2.
Secondly, the parameters are adjusted according to the actual calculation example, and the values of the reward function coefficients and hyperparameters are obtained.

3.
Thirdly, perform agent training according to the process in Figure 5, and obtain the optimal value function Q network Q ϕ(s,a) .

4.
Finally, in different cases, input disturbances to the islanded microgrid system, and the agent can generate corresponding actions based on the disturbances to adjust the output of each unit, so as to ensure the frequency and voltage balance of the islanded microgrid system.

Simulation Results
In order to evaluate the control effect of the above strategy, the coupled islanded microgrid system is built as shown in Figure 7. In addition, the specific settings of equipment parameters are shown in Table 1. The verification of the calculation examples in this paper is carried out through simulation experiments. The computing platform is a PC with i7-1165G7@2.80GHz CPU and 16 GB RAM, and the software environment is Windows 10 Professional and MATLAB R2021a.
Electronics 2022, 11, x FOR PEER REVIEW

Summary of Control Strategy
In summary, the control strategy of this paper is carried out in the following 1. Firstly, definite the state set of the control system as ∆F(t), ∆U(t), P EV ± t and In addition, the action space can be defined as ∆AP,MT (t), ∆AP,EV (t), ∆AQ,MT (t (t). 2. Secondly, the parameters are adjusted according to the actual calculation e and the values of the reward function coefficients and hyperparameters are o 3. Thirdly, perform agent training according to the process in Figure 5, and ob optimal value function Q network Qφ(s,a). 4. Finally, in different cases, input disturbances to the islanded microgrid syst the agent can generate corresponding actions based on the disturbances to ad output of each unit, so as to ensure the frequency and voltage balance of the i microgrid system.

Simulation Results
In order to evaluate the control effect of the above strategy, the coupled i microgrid system is built as shown in Figure 7. In addition, the specific settings o ment parameters are shown in Table 1. The verification of the calculation example paper is carried out through simulation experiments. The computing platform with i7-1165G7@2.80GHz CPU and 16 GB RAM, and the software environment dows 10 Professional and MATLAB R2021a. In the microgrid, there is a MT with capacity of 40 kW, a WT with capacity of 20 kW, an EV station1 with capacity of 16 kW, an EV station2 with capacity of 14 kW, and 60 kW ordinary loads. In addition, this paper assumes that the initial state of the microgrid is stable. Thus, when there is no external disturbance, the power output of MT, EV stations, WT, and conventional loads are always in balance. Therefore, in the following calculation examples, only the per-unit value of the power fluctuations of MT, EVs stations, WT, and load need to be considered.

Pre-Learning Stage
Before the controller is used, it needs to undergo a random trial and error learning process, which is called the pre-learning stage. In the initial stage of pre-learning, the controller has not accumulated any experience and has no intelligent control ability [25]. Only after accepting various state actions can the optimal value function Q network Q ϕ(s,a) . Therefore, the wind and load disturbances superimposed by various different amplitudes and different types of functions are set up for repeated training of the controller. Meanwhile, according to the output capacity change data of the electric vehicle charging station, a boundary function of the output power increment that changes randomly over time is set. Take active power disturbance and the output boundary of the active power of EVs as examples. The random disturbance of a certain training process is shown in Figure 8.  In the microgrid, there is a MT with capacity of 40 kW, a WT with capacity of 20 kW, an EV station1 with capacity of 16 kW, an EV station2 with capacity of 14 kW, and 60 kW ordinary loads. In addition, this paper assumes that the initial state of the microgrid is stable. Thus, when there is no external disturbance, the power output of MT, EV stations, WT, and conventional loads are always in balance. Therefore, in the following calculation examples, only the per-unit value of the power fluctuations of MT, EVs stations, WT, and load need to be considered.

Pre-Learning Stage
Before the controller is used, it needs to undergo a random trial and error learning process, which is called the pre-learning stage. In the initial stage of pre-learning, the controller has not accumulated any experience and has no intelligent control ability [25]. Only after accepting various state actions can the optimal value function Q network Qφ(s,a). Therefore, the wind and load disturbances superimposed by various different amplitudes and different types of functions are set up for repeated training of the controller. Meanwhile, according to the output capacity change data of the electric vehicle charging station, a boundary function of the output power increment that changes randomly over time is set. Take active power disturbance and the output boundary of the active power of EVs as examples. The random disturbance of a certain training process is shown in Figure 8. Meanwhile, through a large number of simulation studies, µ 1 , µ 2 , µ 3 , and µ 4 are referred as 1, 5, 10, and 20, respectively, δ 1 , δ 2 , δ 3 and δ 4 are referred as 5, 20, 50, and 100 respectively, and α and γ are referred as 0.01, 0.09. Meanwhile, the number of learning iterations of the DDPG controller is set to 500, each with 500 steps, and the step length is 0.1 s. Therefore, six groups of parameters (h, u) are set for the convergence test, and the learning results are shown in Table 2. It can be seen that the reward value of the system at convergence is the highest when h = 5 and u = 50. Thus, when h = 5 and u = 50, the pre-learning process of the agent is shown in Figure 9. Meanwhile, through a large number of simulation studies, µ1, µ2, µ3, and µ4 are referred as 1, 5, 10, and 20, respectively, δ1, δ2, δ3 and δ4 are referred as 5, 20, 50, and 100 respectively, and α and γ are referred as 0.01, 0.09. Meanwhile, the number of learning iterations of the DDPG controller is set to 500, each with 500 steps, and the step length is 0.1 s. Therefore, six groups of parameters (h, u) are set for the convergence test, and the learning results are shown in Table 2. It can be seen that the reward value of the system at convergence is the highest when h = 5 and u = 50. Thus, when h = 5 and u = 50, the pre-learning process of the agent is shown in Figure  9. It can be seen that the agent basically converges after 80 iterations, and the system judges that the learning process has been completed and stops the training after 248 iterations. In this case, the average reward is −21.096 and the final award is 0.65307, which shows that the controller can complete the subsequent simulation at this time.

The Implementation of Constraint Conditions in the EV Model
In order to verify the implementation of constraint conditions in the EV model, this paper selects several typical monomer EV SOC simulation situations as examples, as shown in Figures 10 and 11. In addition, to ensure the life of battery, the initial SOC is set between SOCrmin = 0.2 and SOCrmax = 0.8.
The first situation in Figure 10 shows that, when SOC < SOCrmin, the EV will be forced to enter the charging state. Only when SOC > SOCrmin can the EV participate in system regulation. The second situation in Figure 10 shows that, when the EV is close to the leaving time and SOC < SOCm, it will turn to the forced charging state to ensure that the SOC reaches the expected SOCm when leaving the charging station. In general, the changes in It can be seen that the agent basically converges after 80 iterations, and the system judges that the learning process has been completed and stops the training after 248 iterations. In this case, the average reward is −21.096 and the final award is 0.65307, which shows that the controller can complete the subsequent simulation at this time.

The Implementation of Constraint Conditions in the EV Model
In order to verify the implementation of constraint conditions in the EV model, this paper selects several typical monomer EV SOC simulation situations as examples, as shown in Figures 10 and 11. In addition, to ensure the life of battery, the initial SOC is set between SOC rmin = 0.2 and SOC rmax = 0.8.
The first situation in Figure 10 shows that, when SOC < SOC rmin , the EV will be forced to enter the charging state. Only when SOC > SOC rmin can the EV participate in system regulation. The second situation in Figure 10 shows that, when the EV is close to the leaving time and SOC < SOC m , it will turn to the forced charging state to ensure that the SOC reaches the expected SOC m when leaving the charging station. In general, the changes in the SOC of EVs participating in the regulation of the microgrid are shown in Figure 11. The SOC of EVs will change in the constraint range.

Case Study
After completing the pre-learning phase and the verification of the EV SOC constraints, the example can be simulated under different operation scenarios. Meanwhile, in order to evaluate the effect of DDPG controller proposed in this paper, traditional PID controller and R(λ) controller are used in the same scene respectively, and the corresponding controller parameters are shown in Table 3.

Case Study
After completing the pre-learning phase and the verification of the EV SOC straints, the example can be simulated under different operation scenarios. Meanwhi order to evaluate the effect of DDPG controller proposed in this paper, traditional

Case Study
After completing the pre-learning phase and the verification of the EV SOC straints, the example can be simulated under different operation scenarios. Meanwhil order to evaluate the effect of DDPG controller proposed in this paper, traditional  First of all, wind power disturbance is added to the islanded microgrid system, and wind mainly provides active power disturbances to the grid. In order to compare the adjusting speed of each controller, the wind power disturbance ends after 43 s. The disturbance setting is shown in Figure 12.  First of all, wind power disturbance is added to the islanded microgrid system, and wind mainly provides active power disturbances to the grid. In order to compare the adjusting speed of each controller, the wind power disturbance ends after 43 s. The disturbance setting is shown in Figure 12. There is not the fluctuation of reactive power in this case, so the impact of voltage fluctuation is not considered here. The variation of frequency deviation under wind power disturbance is shown in Figure 13. Meanwhile, according to the simulation results, this paper takes the absolute value of |Δf| as the evaluation object, and sets the threshold of the frequency deviation excellence rate to 2 × 10 −4 Hz, and defines Trecover as the time which is taken for |Δf| to recover to 5 × 10 −5 Hz after the wind power disturbance ends. The results of the control test under wind disturbance are shown in Table 4.  There is not the fluctuation of reactive power in this case, so the impact of voltage fluctuation is not considered here. The variation of frequency deviation under wind power disturbance is shown in Figure 13. Meanwhile, according to the simulation results, this paper takes the absolute value of |∆f | as the evaluation object, and sets the threshold of the frequency deviation excellence rate to 2 × 10 −4 Hz, and defines T recover as the time which is taken for |∆f | to recover to 5 × 10 −5 Hz after the wind power disturbance ends. The results of the control test under wind disturbance are shown in Table 4.  First of all, wind power disturbance is added to the islanded microgrid system, and wind mainly provides active power disturbances to the grid. In order to compare the adjusting speed of each controller, the wind power disturbance ends after 43 s. The disturbance setting is shown in Figure 12. There is not the fluctuation of reactive power in this case, so the impact of voltage fluctuation is not considered here. The variation of frequency deviation under wind power disturbance is shown in Figure 13. Meanwhile, according to the simulation results, this paper takes the absolute value of |Δf| as the evaluation object, and sets the threshold of the frequency deviation excellence rate to 2 × 10 −4 Hz, and defines Trecover as the time which is taken for |Δf| to recover to 5 × 10 −5 Hz after the wind power disturbance ends. The results of the control test under wind disturbance are shown in Table 4.  Figure 13. Performance of frequency control under wind power disturbance. Figure 13. Performance of frequency control under wind power disturbance. It can be seen from Figure 13 and Table 4 that, compared with the PID controller, the DDPG and R(λ) controller with the ability of online learning and experience playback can more effectively deal with the highly random disturbance. Under the wind disturbance, the frequency fluctuation of the islanded microgrid under the DDPG controller can be limited in 2 × 10 −4 Hz, and the excellent rate can reach 98%, which is significantly better than the traditional controller. In addition, if only analyzed from the perspective of frequency control, the control strategy of DDPG and R(λ) controller in this paper possesses virtues of great control effect, smaller amplitude of frequency fluctuation, and faster regulation speed than a traditional controller. Furthermore, the regulation speed of DDPG controller is much faster than a R(λ) controller.
Furthermore, the power variations of each equipment in islanded microgrid under the DDPG controller are shown in Figure 14. It can be seen that, when the system suffers disturbance, the MT undertakes the main work of frequency regulation, and the output power of EV charging station is also significant. In addition, when the limit is reached, the power variations of different charging stations are different.  It can be seen from Figure 13 and Table 4 that, compared with the PID contro DDPG and R(λ) controller with the ability of online learning and experience playb more effectively deal with the highly random disturbance. Under the wind distu the frequency fluctuation of the islanded microgrid under the DDPG controlle limited in 2 × 10 −4 Hz, and the excellent rate can reach 98%, which is significantl than the traditional controller. In addition, if only analyzed from the perspectiv quency control, the control strategy of DDPG and R(λ) controller in this paper p virtues of great control effect, smaller amplitude of frequency fluctuation, and fa ulation speed than a traditional controller. Furthermore, the regulation speed o controller is much faster than a R(λ) controller.
Furthermore, the power variations of each equipment in islanded microgri the DDPG controller are shown in Figure 14. It can be seen that, when the system disturbance, the MT undertakes the main work of frequency regulation, and the power of EV charging station is also significant. In addition, when the limit is reac power variations of different charging stations are different.  The DDPG controller is compared with traditional PID and R(λ) controller, and the frequency and voltage fluctuation are shown in Figures 16 and 17. The same as the case 1, this part takes |Δf| and |ΔU| as the evaluation object, and sets the threshold of the |Δf| excellence rate to 2 × 10 −4 Hz, the |ΔU| excellence rate to 0.01 p.u. Meanwhile, Trecover is defined as the time which is taken for |Δf| to recover to 5 × 10 −5 Hz and |ΔU| to recover to 0.002 p.u after the load power disturbance no longer changes. Thus, the statistical results of the control test under load disturbance are shown in Tables 5 and 6.   The DDPG controller is compared with traditional PID and R(λ) controller, and the frequency and voltage fluctuation are shown in Figures 16 and 17. The same as the case 1, this part takes |∆f | and |∆U| as the evaluation object, and sets the threshold of the |∆f | excellence rate to 2 × 10 −4 Hz, the |∆U| excellence rate to 0.01 p.u. Meanwhile, T recover is defined as the time which is taken for |∆f | to recover to 5 × 10 −5 Hz and |∆U| to recover to 0.002 p.u after the load power disturbance no longer changes. Thus, the statistical results of the control test under load disturbance are shown in Tables 5 and 6. The DDPG controller is compared with traditional PID and R(λ) controller, and the frequency and voltage fluctuation are shown in Figures 16 and 17. The same as the case 1, this part takes |Δf| and |ΔU| as the evaluation object, and sets the threshold of the |Δf| excellence rate to 2 × 10 −4 Hz, the |ΔU| excellence rate to 0.01 p.u. Meanwhile, Trecover is defined as the time which is taken for |Δf| to recover to 5 × 10 −5 Hz and |ΔU| to recover to 0.002 p.u after the load power disturbance no longer changes. Thus, the statistical results of the control test under load disturbance are shown in Tables 5 and 6.  The DDPG controller is compared with traditional PID and R(λ) controller, and the frequency and voltage fluctuation are shown in Figures 16 and 17. The same as the case 1, this part takes |Δf| and |ΔU| as the evaluation object, and sets the threshold of the |Δf| excellence rate to 2 × 10 −4 Hz, the |ΔU| excellence rate to 0.01 p.u. Meanwhile, Trecover is defined as the time which is taken for |Δf| to recover to 5 × 10 −5 Hz and |ΔU| to recover to 0.002 p.u after the load power disturbance no longer changes. Thus, the statistical results of the control test under load disturbance are shown in Tables 5 and 6.  It can be seen from Figures 16 and 17 and Tables 5 and 6 that, when the load changes, compared with the PI controller and R(λ) controller, the DDPG controller can ensure that the frequency deviation of the microgrid is maintained within ±1 × 10 −3 Hz Hz, and the voltage deviation is also close to 0, which is much smaller than the control index of the power quality of the power grid. In addition, compared with the R(λ) controller, the DDPG controller can coordinate the frequency recovery and voltage adjustment of the islanded microgrid, so as to meet the VF control requirements at the same time, which has superior dynamic control characteristics.
Furthermore, the power variations of each equipment are shown in Figure 18. The MT in the micro grid is used as the main source to maintain the stability of the VF amplitude of the microgrid, while the EV 1 and EV 2 as the slave sources are mainly responsible for the regulation of the active power of the microgrid and also participate in the regulation of the reactive power. In addition, due to the randomness of users, the output power boundary of EV charging stations is random, showing obvious jagged shapes.  It can be seen from Figures 16 and 17 and Tables 5 and 6 that, when the load changes, compared with the PI controller and R(λ) controller, the DDPG controller can ensure that the frequency deviation of the microgrid is maintained within ±1 × 10 −3 Hz Hz, and the voltage deviation is also close to 0, which is much smaller than the control index of the power quality of the power grid. In addition, compared with the R(λ) controller, the DDPG controller can coordinate the frequency recovery and voltage adjustment of the islanded microgrid, so as to meet the VF control requirements at the same time, which has superior dynamic control characteristics.
Furthermore, the power variations of each equipment are shown in Figure 18. The MT in the micro grid is used as the main source to maintain the stability of the VF amplitude of the microgrid, while the EV1 and EV2 as the slave sources are mainly responsible for the regulation of the active power of the microgrid and also participate in the regulation of the reactive power. In addition, due to the randomness of users, the output power boundary of EV charging stations is random, showing obvious jagged shapes.

Conclusions
To solve the problem in which the stability of island microgrid is greatly affected by random power sources, and it is difficult to control frequency and voltage together, a VF control strategy of islanded microgrids with EVs is proposed in this paper. The randomness of charging behavior is considered, and an islanded microgrid system including MT, WT, EVs stations, and loads is established. Thus, a VF synergistic control strategy based on DDPG is proposed. The simulation results show that:

Conclusions
To solve the problem in which the stability of island microgrid is greatly affected by random power sources, and it is difficult to control frequency and voltage together, a VF control strategy of islanded microgrids with EVs is proposed in this paper. The randomness of charging behavior is considered, and an islanded microgrid system including MT, WT, EVs stations, and loads is established. Thus, a VF synergistic control strategy based on DDPG is proposed. The simulation results show that:

1.
Compared with PID controller, the DDPG controller with the ability of online learning and experience playback can more effectively deal with the highly random disturbance. Under the wind disturbance, the frequency fluctuation of the islanded microgrid under the DDPG controller can be limited in 2 × 10 −4 Hz, and the excellent rate can reach 98%, which is significantly better than the traditional controller.

2.
Compared with the R(λ) controller, the DDPG controller in this paper can coordinate the frequency recovery and voltage adjustment of the island microgrid, so as to meet the VF control requirements at the same time, which is more suitable for the stable control of the microgrid. When the load changes, the DDPG controller can ensure that the frequency deviation of the microgrid is maintained within ±1 × 10 −3 Hz, and the voltage deviation is also close to 0. 3.
The EV charging station has the characteristics of small inertia and fast regulation speed in the microgrid control, which can play an important role in VF regulation; 4.
The realization effect of the constraint conditions in the EV model is great. The single EV can judge whether it participates in the adjustment of the microgrid system according to the SOC situation.
For microgrid systems with more complex structures and larger volumes, it is necessary to consider the multi-microgrid interconnection technology. In addition, multi-agent algorithms such as MA-DDPG, COMA, CommNet, etc. will also be applied to the control of multi-microgrid. The follow-up work will focus on in-depth analysis and research in these directions, and add corresponding hardware circuit experiments or semi-physical simulation experiments.