Adaptive Control for Virtual Synchronous Generator Parameters Based on Soft Actor Critic

This paper introduces a model-free optimization method based on reinforcement learning (RL) aimed at resolving the issues of active power and frequency oscillations present in a traditional virtual synchronous generator (VSG). The RL agent utilizes the active power and frequency response of the VSG as state information inputs and generates actions to adjust the virtual inertia and damping coefficients for an optimal response. Distinctively, this study incorporates a setting-time term into the reward function design, alongside power and frequency deviations, to avoid prolonged system transients due to over-optimization. The soft actor critic (SAC) algorithm is utilized to determine the optimal strategy. SAC, being model-free with fast convergence, avoids policy overestimation bias, thus achieving superior convergence results. Finally, the proposed method is validated through MATLAB/Simulink simulation. Compared to other approaches, this method more effectively suppresses oscillations in active power and frequency and significantly reduces the setting time.


Introduction
As environmental pollution and energy shortages have become increasingly serious, renewable energy power generation technologies represented by wind power and photovoltaics have garnered significant attention [1].Renewable energy sources are generally connected to the power grid through power electronic converters, which lack the inertia and damping characteristics inherent in a traditional synchronous generator (SG).Consequently, the substantial integration of renewable energy poses significant challenges to the operational, security, and stability aspects of the power grid [2][3][4].To address this problem, some researchers have introduced virtual synchronous generator (VSG) control technology.The VSG emulates the rotor motion equation of an SG, providing the inverter with inertia characteristics that can enhance the stability of the power grid [5].Nevertheless, despite the VSG introducing inertia, it also inherits the low-frequency oscillation issue observed in SG [6].
It is worth noting that the virtual synchronous generator is a control algorithm executed by software, and its parameters are flexible and adjustable.Consequently, some researchers enhance the dynamic response of the VSG by adaptively adjusting parameters such as virtual inertia or the damping coefficient.In [7], an adaptive virtual inertia control strategy based on improved bang-bang control is proposed, which effectively suppresses frequency oscillation, but it allows for altering the control signal only within a constrained range of discrete values, resulting in significant jitter when it encounters interference.Ref. [8] analyzes the impact of virtual inertia on the dynamic response of VSG and designed a continuous adaptive function of virtual inertia to make the control signal smoother, but it does not consider the damping coefficient.In order to achieve better optimization results, ref. [9] designs double adaptive functions for the virtual inertia and damping coefficient.However, linear functions cannot accurately express the changing rules of parameters.Ref. [10] uses a fuzzy controller to make up for the shortcomings of linear functions.Ref. [11] also adds the adaptive control of droop gain, which provides better state of charge (SOC) maintenance capabilities for the energy storage system and avoids additional oscillations under low power consumption.The above methods are all designed based on expert experience and therefore lack objectivity and universal applicability.As the external environment and technology change rapidly, the experience of experts may quickly become outdated.Updating the knowledge and rules in expert systems is a complex and time-consuming process.
To avoid relying too much on expert knowledge, some researchers use heuristic algorithms to optimize the parameters of VSG.For example, in [12], the particle Swarm Optimization (PSO) is used to optimize the parameters and virtual impedance of VSG, which improves the stability of the microgrid and reduces the current overshoot.Additionally, ref.
[13] employs a multi-objective genetic algorithm to select optimal parameters of VSG, leading to reduced maximum frequency deviation and shortened setting time.In the work of [14], a control in the virtual inertia control loop based on a proportionl-integral (PI) controller is designed, and the authors use the Manta Ray Foraging Optimization (MRFO) algorithm to optimize the PI controller.Moreover, ref. [15] takes into account the uncertainty of VSG dynamics and system inertia and uses the whale optimization algorithm to optimize and improve the virtual inertia loop.In general, heuristic algorithms can not only achieve different optimization effects according to the objective functions, but they also flexibly handle various constraints.Compared with methods based on expert experience, they have wider applicability.Nonetheless, heuristics still have their limitations.The iterative process of heuristic algorithms requires a systematic mathematical model, but power systems usually have complex topological structures and many uncertain factors, so the establishment of accurate power system models is a very arduous task.Furthermore, heuristic algorithms are resource-intensive, requiring significant time and memory, necessitating advanced hardware capabilities, and presenting challenges in satisfying the real-time demands of power system control.
Reinforcement learning (RL) is a machine learning paradigm where an agent iteratively interacts with its environment, making decisions based on feedback from past actions, with the ultimate goal of maximizing its cumulative reward over time [16][17][18].In [19], the work employs a Q-learning algorithm to fine-tune the controller parameters of the VSG during the frequency control process.However, as a method characterized by discrete states and actions, Q-learning relies on a Q-table to archive the Q-value for every possible state-action combination.Consequently, with the expansion in the number of state and action pairs, the efficiency of Q-learning is prone to diminishing.With the evolution of deep learning, the fusion of RL with neural networks, known as deep reinforcement learning (DRL), has delivered outstanding outcomes.For example, Deep Q-Network (DQN) adopts neural networks to approximate the Q-value, enabling it to manage complex, highdimensional environments and significantly lower the memory demands associated with Qlearning [20].Proximal Policy Optimization (PPO) is a policy-based reinforcement learning algorithm that optimizes the policy itself and can output a continuous distribution of action probabilities, making it particularly suitable for addressing problems with continuous action spaces [21].As for VSG, ref. [22] introduced an adaptive tuning approach for VSG's parameters using the DQN algorithm.However, the discrete action set of the DQN algorithm poses a limitation for adjusting VSG parameters within a continuous value range.In [23], the authors use the Deep Deterministic Policy Gradient (DDPG) algorithm to adjust the virtual inertia and damping coefficient of the islanded VSG.DDPG, being a reinforcement learning algorithm based on the actor-critic framework, operates in continuous state and action spaces [24][25][26].Nevertheless, it is well-documented that DDPG is susceptible to overestimation bias, potentially resulting in suboptimal control policies.Twin Delayed Deep Deterministic Policy Gradient (TD3) is an extension of DDPG, that can reduce overestimation bias and enhance the stability and efficiency of learning by using twin Q-networks and delayed policy updates [27].
Based on the above inspiration, this paper investigates the optimal control issue of VSG in a model-free context and employs the Soft Actor Critic (SAC) algorithm to identify the optimal strategy.SAC incorporates entropy regularization, which endows it with superior exploration capabilities, enhanced sample efficiency, and a more stable and accelerated training process compared to DDPG.Considering the over-optimization issue in traditional VSG optimization methods, a setting time penalty has been incorporated into the reward function to prevent excessive damping of frequency fluctuations, which could result in prolonged adjustment periods and compromise system stability.The main contributions of this work are as follows: 1.
The optimal adaptive control problem of VSG is transformed into an RL task, thereby obviating the necessity for intricate mathematical models and expert knowledge.Subsequently, the state-of-the-art SAC algorithm is employed to train the agent, enabling it to discover the optimal strategy.2.
The traditional optimization objective of VSG focuses solely on mitigating active power and frequency fluctuations, overlooking the optimization of the system's transient response time.In the reward function design, this paper introduces an adjustment time component, motivating the agent to refine the strategy further for enhanced system performance.
The rest of this paper is organized as follows.Section 2 delves into the fundamental principles of VSG, formulates the optimization issue as a multi-objective function, and subsequently elucidates the principles behind parameter selection.Section 3 converts the optimization problem of the VSG into an RL task, detailing the design of the reward function and the configuration of the agent.Section 4 carries out simulation experiments in MATLAB/Simulink to validate the effectiveness of the proposed method.Finally, Section 5 concludes this paper, outlining prospects for future work.

VSG Control Principle
Figure 1 shows the schematic diagram of VSG control with RL, where R f , L f and C f represent the resistance, inductance and capacitance of the filter circuit; R g and Lg represent the resistance and inductance of the output line; U dc is the DC terminal voltage; U and I are the inverter output voltage and current.As shown in Figure 1, VSG operates as an outer loop control, primarily comprising an active power control loop (APL) and a reactive power control loop (RPL).The APL achieves control of active power and frequency by simulating the primary frequency regulation characteristics of synchronous generators; the RPL employs droop control or a PI controller to regulate reactive power and voltage; and the voltage and current control loop employs a dual closed-loop PI controller to achieve zero steady-state error tracking of the inverter's output voltage and current.The core function of VSG lies in the APL, which provides inertia and damping to the inverter by simulating the rotor motion equation of SG [28,29].Therefore, the purpose of this paper is to improve the APL.The rotor motion equation is described as follows: where P re f and P e represent the reference active power and output active power, respectively; ω and ω 0 represent the rotor angular velocity and rated angular velocity, respectively; and J and D are the virtual inertia and damping coefficient.When the circuit impedance exhibits inductive behavior (or a virtual impedance is employed otherwise), the APL and RPL are approximately decoupled.In accordance with the power flow calculation equation, the active power computation formula under these conditions is delineated as follows: where E 0 and Ug represent the effective value of inverter and grid phase voltages, respectively; Z is the total effective reactance of the line; and δ is the power angle of VSG, which can be given by where ω g is the grid frequency.Typically, the value of power angle δ is small.For analytical simplification, it is approximated by setting sin δ ≈ δ.To conduct a more in-depth analysis of the response characteristics of the VSG and examine the impact of virtual inertia and damping coefficients on system stability, a small signal model can be constructed based on Equations ( 1)-( 3): where ∆ω, ∆P re f , ∆P e , ∆δ represent the change in VSG angular frequency, reference active power, output active power and power angle, respectively.According to Equation ( 4), the transfer function of the active power outer loop can be expressed as follows: where and G ω (s) represent the transfer function of frequency response and active power response, respectively.
According to the transfer function ( 5) and ( 6), it can be seen that the dynamic response of VSG is affected by virtual inertia and the damping coefficient.Figures 2 and 3 plot the response curves of VSG frequency and output active power under different virtual inertia and damping coefficients, respectively.As depicted in Figure 2, with a fixed damping coefficient, the reduction in virtual inertia leads to a decrease in active power overshoot and a shorter setting time.However, this reduction is accompanied by an increase in frequency deviation.According to Figure 3, when the virtual inertia is fixed, as the damping coefficient increases, the active power overshoot decreases, the setting time is shortened, and the frequency deviation also becomes smaller.Nevertheless, a progressive increment in the damping coefficient may transition the system from an under-damped to an over-damped state, thereby prolonging the setting time and negatively impacting the system's stability.The analysis indicates that the virtual inertia governs the oscillation frequency during the dynamic response of VSG, while the damping coefficient determines the attenuation rate of the oscillation process.A solitary adjustment of either the virtual inertia or damping coefficient fails to attain a comprehensive optimization of active power, frequency, and the overall system response speed.Consequently, this paper employs RL algorithms to enhance VSG control and pursue the optimal control strategy.The entire control framework is depicted in Figure 1.In each control cycle, the RL agent receives the state information and reward signals from the environment, subsequently generating actions via the actor network.The actions are set to the virtual inertia J t and damping coefficient D t at the current instance.During training, the agent continuously interacts with the environment and improves the strategy through the rewards obtained, thereby finding the optimal strategy.

Objective Function
This paper describes the oscillation suppression process of VSG as a multi-objective optimization problem.According to the optimization objective, the objective function can be described as follows: where N is the total steps; f p , f ω and f T represent active power deviation cost, frequency deviation cost and setting time cost, and τ 1 , τ 2 and τ 3 are the corresponding weight coefficients, respectively; and T s represent the setting time.f p and f ω contribute to the suppression of active power and frequency oscillations; however, excessive oscillation suppression might induce sluggish changes in power and frequency, consequently resulting in longer settling times and impacting system stability.To address this concern, we introduce a setting time cost f T into the objective function.

Constraint Conditions
The active power loop of the VSG, as described by Equation ( 5), conforms to the characteristics of a typical second-order system.The natural oscillation angular frequency, denoted as ω n , and the damping ratio, denoted as ζ, can be calculated as follows: To guarantee the stability of the system, it is essential to restrict the damping ratio of the second-order system.In the case of underdamping in a second-order system, the relationship between overshoot σ% and damping ratio ζ is expressed as follows: Derived from Equation ( 9), an excessively small damping ratio can induce substantial power overshoot, potentially causing damage to power electronic devices.Conversely, an excessively large damping ratio may result in an extended system settling time, which is detrimental to system stability.Therefore, the selection of damping ratio should not be too large or too small.In this paper, the chosen range for the damping ratio is 0.6 ≤ ζ ≤ 1.2.In accordance with the damping ratio range, the constraints for J and D can be determined as follows: The relationship between the damping coefficient, frequency deviation and power deviation is give by In adherence with the EN50438 standard [30] for renewable energy grid connection, a 1 Hz frequency change in the grid corresponds to a change in the output active power of the inverter in the range of 40% to 100% of the rated capacity [31].This study establishes the rated capacity of the inverter at 100 kW and with a stipulated maximum allowable frequency deviation of 1 Hz.According to Equation (11), the range of values for the damping coefficient D can be obtained as follows: 20 ≤ D ≤ 50 (12) In order to meet the stability margin of the system, it is crucial that the characteristic roots of the system do not approach the imaginary axis too closely.Therefore, the real part of the system characteristic roots must adhere to the following constraints: In consideration of the grid capacity connected to the inverter, the range of values for Re(x) max is specified as [−10, −5].This study adopts Re(x) max = −10.According to Equations ( 10), ( 12) and ( 13), the range of values for J and D is graphically depicted in a two-dimensional plane, as shown in Figure 4.The shaded area within the plot indicates the parameter range.

Markov Decision Model
The parameter adjustment process of VSG can be regarded as a sequential decisionmaking problem.Furthermore, it satisfies the Markov property, that is, the state of the system at the next moment is only related to the state of the current moment and has nothing to do with the past state.Consequently, the adjustment of virtual inertia and damping coefficient can be reformulated as a Markov Decision Process (MDP) and addressed through a reinforcement learning algorithm.An MDP can be succinctly described by a quintuple ⟨S, A, p, γ, R⟩.Where S is the set of state space, A is the set of action space, p is the transition matrix, indicating the probability of taking action a t in state s t and transitioning to the subsequent state s t+1 , γ is the discount factor denoting the significance of future rewards in relation to current rewards (typically ranging from 0 to 1), and R stands for the reward function representing the feedback provided by the environment after taking action a t in state s t [32,33].

State Space
The primary features of the dynamic response in the VSG active power loop are manifested in the responses of active power and frequency.To comprehensively depict this response process, the selection of state space is as follows: where e p , e ω , P * and ω * represent the normalized active power deviation, normalized frequency deviation, active power normalization coefficient and frequency normalization coefficient, respectively; e p ′ and e ω ′ represent the change rates of e p and e ω , respectively.

Action Space
In this study, the control variables are the virtual inertia and damping coefficient of the VSG.Therefore, the action space is defined as the change in virtual inertia and damping coefficient and can be expressed as follows: where ∆J t and ∆D t represent the changes in virtual inertia and damping coefficient, respectively.Thus, the real virtual inertia J t and damping coefficient D t can be obtained by where J 0 and D 0 represent the initial values of virtual inertia and damping coefficient, respectively.

Reward Function
The objective of RL is to maximize the cumulative discounted reward, rendering the reward function a crucial element in steering the agent towards learning the optimal strategy.Existing adaptive control approaches for VSG typically focus on optimizing the oscillations in active power and frequency.However, suppressing oscillations might prolong the setting time and negatively influence the stability of the system.Consequently, this paper incorporates active power, frequency and setting time into the reward function design to address these considerations.The design methodology is as follows: Active power deviation reward function r 1 : In the post-disturbance recovery phase, the active power fluctuates around a reference value.The objective of this article is to minimize the oscillation amplitude and duration as much as possible, hence the reward function for active power is designed as follows: where c 1 represents the weight coefficient.
Frequency deviation reward function r 2 : In practical scenarios, the power system necessitates that the grid frequency operates around the rated frequency, with minimal deviation being ideal.Consequently, the frequency reward function designed in this paper is structured as follows: where c 2 represents the weight coefficient.
Setting time reward function r 3 : To prevent an excessive setting time, we incorporate the setting time as part of the reward function.Since the system can only ascertain the total settling time upon reaching a steady state, with only the last step receiving the settling time reward, this approach is deemed unreasonable.This paper adopts the following method to design a settling time reward function: for each non-steady state, a constant term penalty is applied as follows: where c 3 represents the penalty coefficient.Terminal reward function r 4 .When the active power and frequency reach a steady state, further adjustments to the virtual inertia and damping coefficient will not alter the system state.Consequently, the sampling data in the steady state hold no value in enhancing the strategy and are considered to be an invalid sample.To improve sample efficiency, when a steady state is reached, the training for that episode is promptly terminated, and a new episode begins.To encourage the agent to approach the steady state, a positive constant reward is given when the system state reaches stability: where M is a state-independent constant, and s end represents the stable state.By summing Equations ( 17)-( 20), the final reward function can be expressed as follows: The roles of r 1 and r 2 is to suppress the oscillation of active power and frequency, and the function of r 3 is to reduce the setting time.When the agent is close to the steady state, the difference in rewards brought by different actions is very small.As a terminal reward, r 4 can prevent the agent from fluctuating near the steady state and improve sample efficiency.Incorporating r 3 allows the optimal strategy for oscillation suppression to be adjustable, effectively circumventing the issue of excessively prolonged adjustment times.Nonetheless, the value of r 3 must be judiciously selected; if r 3 is too large, the frequency deviation will become larger, resulting in poorer oscillation suppression effect.Conversely, should the value be excessively minimal, its efficacy in application will be compromised.

Principle of Soft Actor Critic
This paper employs the soft actor critic(SAC) algorithm to address the MDP associated with the optimization of the VSG dynamic response.SAC is a model-free and offline RL algorithm.By incorporating the concept of maximum entropy, it not only enhances the algorithm's robustness but also improves the agent's exploration capability, thereby accelerating the learning speed [34,35].Figure 5 shows the framework of the SAC algorithm.
The learning objective of SAC extends beyond maximizing the cumulative rewards, it also aims to maximize the entropy of each state [36][37][38].Therefore, the objective function of SAC is as follows: where ρ π represents the state transition equation; π represents the policy function; α is the temperature coefficient, delineating the relative significance of entropy terms in comparison to reward terms.It plays a crucial role in modulating the level of randomness in the policy; H represents the entropy of policy π in state s t and is given by RL algorithms use the Q-value to represent the expected cumulative reward that can be obtained by taking action a t in state s t .Different from general value-based RL algorithms, the Q-value in SAC contains both the reward value and action entropy.According to the Bellman equation, the soft Q-value can be calculated as follows: where is the soft state value function.The SAC algorithm comprises a total of five neural networks, including a policy network, two critic networks and two target critic networks.The role of the critic network is to approximate the soft Q-value.Employing dual critic networks helps mitigate the issue of overestimating the Q-value.The parameters of the soft Q-function can be trained by minimizing the soft Bellman residual, which is expressed as follows: where θ is the parameter of critic network; D represents the experience replay buffer.The policy network outputs the probability distribution of the action.This paper uses Gaussian probability distribution, so the policy network outputs the mean and standard deviation of the Gaussian probability distribution.The loss function for the policy network is as follows: where ϕ represents the parameter of the policy network.
During training, SAC maintains a balance between exploration and exploitation by adjusting the size of temperature coefficient α.Larger temperature coefficients correspond to increased exploration, while smaller coefficients correspond to heightened exploitation.However, determining the appropriate temperature coefficient can be challenging.In order to solve this problem, the SAC algorithm sets a target entropy H for each task and automatically updates the temperature coefficient during the training process.The loss function of temperature coefficient is designed as follows: The parameters of each neural network are updated through the gradient descent method: where ∇θ i L Q (θ i ), ∇ϕ L π (ϕ) and ∇α L(α) represent the gradients of critic network, policy network and temperature coefficient, and λ Q , λ π and λ α represent the corresponding learning rate, respectively.

Simulation Results
To assess the effectiveness and feasibility of the VSG parameter adaptive control strategy based on SAC proposed in this paper, a single grid-connected VSG simulation system shown in Figure 1 is constructed using Simulink.The key parameters of the simulation system and RL agent are detailed in Table 1.
Table 1.Key parameters of simulation model.

Parameters
Value Parameters Value The RL agent is developed utilizing the RL toolbox in MATLAB, with the architecture of the actor network and the critic network illustrated in Figure 6.In this article, a Gaussian distribution is utilized as the action probability distribution.As depicted in Figure 6, the state information initially traverses through the state pathway, subsequently splitting into two branches.One branch is responsible for outputting the mean µ of the Gaussian distribution, while the other branch outputs the standard deviation σ.The role of the exponential function (exp) here is to ensure that the output value of the standard deviation remains positive.The action is derived by randomly sampling from the Gaussian distribution N(µ, σ 2 ).Given that the sampling process is non-differentiable, to facilitate the actor in executing the neural network's backpropagation process, the following formula is employed in practical applications to derive the action: where ϵ t is random noise obeying the standard Gaussian distribution.The value range of the tanh function is from −1 to 1, serving to transform Gaussian distribution sampling, which possesses an infinite value range, into a finite action value.The critic network is composed of three segments: the action path, the state path and the common path.
The action path receives action information as its input, while the state path is fed with state information.These two paths converge on the common path, which subsequently processes the combined information to output the Q-value.The activation functions utilized in the neural networks are all Relu functions, as represented by the following formula: To enhance the generalization ability of reinforcement learning agents, random disturbances are introduced to the reference active power and load power during the training process.The range of variation for the reference active power is 0 kW to 50 kW, and for load power, it is 5 kW to 30 kW.

Comparison of Different Reward Functions
The optimal strategy that the agent ultimately secures is determined by the structure of the reward function.To assess the impact of the setting time penalty term introduced in this article, we compare the training outcomes derived from various reward functions.The RL algorithm utilized in all cases is the SAC algorithm.There are three types of reward functions: The reward function of Case 1 includes only the oscillation suppression rewards r 1 , r 2 and the terminal reward r 4 , excluding the setting time penalty term r 3 .This reward function is designed to suppress power and frequency oscillations to the greatest extent possible, but it may result in excessively prolonged setting times.The reward function for Case 2 encompasses all components, yet the setting time penalty term r 3 is prioritized with a larger proportion.This reward function expects a rapid system response rate while overlooking the optimization of the process.In Case 3, the reward function also comprises all elements, but it maintains a balanced proportion of r 1 , r 2 and r 3 , aiming to achieve equilibrium between process optimization and adjustment time.
Figure 7a-c shows the training results of different reward functions.Under a range of reward functions, the cumulative rewards demonstrate convergence, suggesting that the agents have successfully identified their respective optimal strategies.Next, we conduct time-domain simulations for the trained agents with various reward functions in Simulink.The simulation scenario involves applying a step disturbance that changes the active power reference value from 0 kW to 20 kW at 1.0 s.The simulation results are depicted in Figure 8.As observed in Figure 8, compared with the traditional VSG with fixed parameters (Fixed-VSG), there are no secondary oscillations in the active power and frequency in all three cases.This outcome is attributed to the inclusion of r 1 and r 2 in the reward function.The introduction of the setting time component alters the agent's optimal strategy.An increase in the proportion of the setting time penalty term results in a gradual reduction in the setting time of the system.Further analysis reveals that since the reward function of Case 1 lacks the r 3 term, it more effectively suppresses the maximum frequency deviation, yet the optimization of the setting time is not pronounced.The reward function of Case 2 incorporates r 3 with a substantial weighting, while this significantly reduces the setting time, it leads to an increased frequency deviation and results in a deterioration of the frequency response.Case 3 establishes a balanced ratio among the various rewards, effectively suppressing the maximum frequency deviation and considerably reducing the setting time, aligning closely with the desired optimization objectives.This proves the effectiveness and flexibility of the method proposed in this paper.

Comparison of Different RL Algorithms
Furthermore, to ascertain the superiority of the SAC algorithm, this article employs the DDPG algorithm, which also operates under the Actor-Critic framework, training the agent using the reward function specified in Case 3. To maintain fairness in the comparison, the shared structures and parameters across each agent are initialized with identical values.For instance, the critic networks of SAC and DDPG are configured with the same architecture, and the initial values of the neural networks are also aligned.Considering that SAC employs a stochastic policy, while DDPG utilizes deterministic policies, the pathway from input to mean output in the SAC's actor network is identical to the actor networks in DDPG.Moreover, the different algorithms also share the same learning rate and soft update coefficient.The reward curve of DDPG is shown in Figure 7d.
Comparing Figure 7c,d, it is apparent that SAC converges to the optimal strategy in just 200 rounds, whereas DDPG takes up to 750 rounds to converge, signifying a faster learning pace for SAC.The reward curve of SAC remains smooth throughout the learning process, contrasting with the large fluctuations observed in DDPG's reward curve, which suggests a more stable learning trajectory for SAC.SAC's cumulative reward ultimately settles at −67.4, while DDPG's reward converges at −71.6, further evidencing the superior learning effectiveness of SAC.This conclusively demonstrates the superiority of the SAC algorithm.

Case Studies
In this subsection, our purpose is to verify the effectiveness of the well-trained VSG controller and compare its control effect with other methods in different scenarios.The proposed method (SAC-VSG) is trained using the balanced reward function specified in Case 3. The simulation scenarios are primarily executed under three distinct disturbance conditions: active power reference disturbance, load power disturbance and grid frequency disturbance.The methods included in the comparison are traditional fixed parameter control (Fixed-VSG), linear adaptive control (Adaptive-VSG) as mentioned in [9], fuzzy adaptive control (Fuzzy-VSG) from [10], and DDPG adaptive control (DDPG-VSG) utilizing the reward function from Case 3.

Active Power Reference Disturbance
This case study compares the control effects of different methods in response to a step disturbance in the reference active power.The simulation test conditions are as follows: Initially, the system is in a stable state, and the reference active power by VSG is set to 0 kW.At 1.0 s, there is a step change for the reference active power from 0 kw to 20 kW.The simulation results are shown in Figure 9.We evaluate different control methods based on active power overshoot σ (the steady-state range is set to ±1%), maximum frequency deviation ∆ f and setting time t s presented in Table 2.
It can be seen that when the reference active power changes, the control effect of Fixed-VSG is not ideal, and there is a large oscillation in both active power and frequency response.Adaptive-VSG and Fuzzy-VSG can improve the small signal oscillation problem of the traditional VSG control to a certain extent, but the suppression of frequency fluctuations and the shortening of setting time are not obvious, and while the DDPG algorithm can further optimize overshoot, maximum frequency deviation, and setting time, there remains scope for enhancement.However, SAC-VSG has almost no active power overshoot, the maximum frequency deviation is also very well suppressed, and the setting time is greatly shortened.It can be concluded that the optimization effect of the proposed method is better than for the other methods in all aspects under the reference active power change.This case study compares the control effects of different methods under load disturbances.The simulation test conditions are as follows: At the initial moment, the system is in a stable state, with the active power of the load set at 10 kW.At 1.0 s, a load with active power of 20 kW is connected.The simulation results are shown in Figure 10, and the various performance indicators of different control methods are presented in Table 3. Figure 10 shows that when the load active power changes suddenly, in order to satisfy the power balance, the inverter output power first changes with the load power and then gradually returns to the reference.Compared to the other four methods, SAC-VSG can restore to the equilibrium state more quickly and stably during power restoration.It can be concluded from Table 3 that SAC-VSG is also better than the other four methods in terms of active power overshoot, maximum frequency deviation and setting time under load power disturbance.

Grid Frequency Disturbance
In order to further verify the robustness of the proposed method, this case simulates and compares the control effects of different methods when small disturbances occur in the grid frequency.The simulation test conditions are as follows: The system is in a stable state at the initial moment, with the reference active power is 20 kW.At 1.0 s, the grid frequency changes from 50 Hz to 50.1 Hz.The comparison of simulation results is shown in Figure 11.As shown in Figure 11, when the grid frequency undergoes changes, there exists a static deviation between the output active power and reference active power, quantified as ∆P = −D 0 ω 0 ∆ω g , where D 0 denotes the steady-state damping coefficient, set to 20 in this case.Consequently, a 0.1 Hz increase in grid frequency results in a 4 kW decrease in reference active power, and vice versa, an increase of 4 kW occurs.Figure 11 shows that the suppression effect of frequency oscillation and active power oscillation of SAC-VSG is more obvious than the other four methods, indicating that the proposed method is also suitable for disturbances in the power grid frequency.

Conclusions
Based on VSG control, this paper proposes an adaptive control strategy for virtual inertia and damping coefficient based on SAC to address the oscillation problem existing in the dynamic process of grid-connected VSG control.In order to better achieve the expected control performance, the oscillation suppression problem of VSG is described as a multi-objective optimization problem, and the optimization of frequency, active power and setting time is comprehensively considered.The problem is then transformed into a RL task and solved using the SAC algorithm.RL algorithms learn strategies through interaction with the environment and can achieve optimization in different scenarios.Compared with existing VSG adaptive control methods, the proposed method has better optimization effects in various application scenarios.Moreover, the proposed method does not require expert knowledge and system mathematical models and is a fully automatic optimization algorithm.
The RL parameter adaptive adjustment algorithm in this paper currently only considers the VSG control of a single inverter and can conduct research on multi-machine parallel systems in the future.

Figure 1 .
Figure 1.VSG control framework based on reinforcement learning.

Figure 4 .
Figure 4.The value range of virtual inertia and damping coefficient.

Figure 5 .
Figure 5.The framework of the SAC algorithm.

Figure 6 .
Figure 6.Structures of the actor and critic network.

Figure 7 .Figure 8 .
Figure 7. Training results with different reward functions and RL algorithms.(a-c) are training results of SAC with different reward functions, and (d) is training results of DDPG with Case 3. (a) Case 1.(b) Case 2. (c) Case 3. (d) DDPG with Case 3.

Figure 9 .
Figure 9. Performance comparison of different VSG controllers under reference active power change.(a) Active power response.(b) Frequency response.

Figure 10 .
Figure 10.Performance comparison of different VSG controllers under load change.(a) Active power response.(b) Frequency response.

Figure 11 .
Figure 11.Performance comparison of different VSG controllers under grid frequency change.(a) Active power response.(b) Frequency response.

Table 2 .
Performance indicators under reference active power disturbance.

Table 3 .
Performance indicators under load power disturbance.