1. Introduction
With the rapid increase in renewable energy penetration driven by global carbon neutrality goals, the modern power grid is undergoing a fundamental transformation. A large number of power electronic converters are replacing conventional synchronous generators (SGs), significantly reducing system inertia. This transition increases the system’s vulnerability to disturbances and poses serious challenges to transient stability [
1,
2,
3]. Traditional stability analysis and control methods, which rely on synchronous machine-based assumptions, are increasingly inadequate for this new paradigm. As a result, there is a growing demand for advanced control frameworks that can ensure system stability in a low-inertia, converter-dominated grid environment [
4,
5,
6].
To address these challenges, grid-forming (GFM) control technologies have gained significant attention. Among them, VSG control is particularly prominent due to its ability to mimic the inertial and damping characteristics of conventional synchronous machines [
7,
8,
9,
10]. VSG-equipped inverters contribute to system frequency support and enhance transient stability by emulating generator dynamics. However, conventional VSG implementations typically rely on fixed control parameters, such as virtual inertia and damping, which limits their adaptability to varying operating conditions and network configurations. In practical systems, grid conditions fluctuate due to varying loads, faults, and renewable output. Fixed-parameter VSG controllers may lead to suboptimal performance or even system instability under such dynamic environments. This lack of adaptability restricts the capability of GFM-based systems to maintain rotor angle synchronization and overall system stability, especially in scenarios with high penetration of renewables and weak grid conditions.
Several methods have been proposed to tune VSG parameters, including exhaustive search [
11], equal-area criteria [
12], and worst-case scenario analysis frameworks [
13]. While these methods can provide reasonable performance in well-modeled systems, they often require precise knowledge of system parameters, network topology, and operating conditions. Such requirements are difficult to meet in practice, especially in complex or time-varying networks. Consequently, these approaches lack generalization capability and are often unsuitable for online or real-time applications [
5,
14].
To overcome the limitations of model-based tuning approaches, reinforcement learning (RL) has emerged as a promising data-driven alternative [
15,
16,
17,
18,
19]. RL enables online adaptation of control parameters by learning directly from system observations, without requiring detailed physical models. It offers fast decision-making and has demonstrated strong potential for real-time applications in power system control. However, despite these advantages, RL-based methods face significant challenges that hinder their deployment in practical systems [
20,
21].
(1) Scarcity of high-quality scenario samples: In data-driven power system methods, sample quality and quantity are critical for model accuracy. Reference [
22] introduces a dataset generation method using infeasibility certificates to reduce unsafe regions. However, insufficient initial sampling or poorly selected infeasible points can leave unsafe areas uncovered and limit safe samples in the dataset. Similarly, ref. [
23] highlights that insufficient or noisy historical data causes models to poorly represent real operating conditions, especially in systems with high renewable energy penetration. Samples often focus on high-probability regions, ignoring critical low-probability states. Moreover, ref. [
24] stresses that insufficient high-quality samples in large-scale systems or environments with many uncertainties result in inaccurate predictions for complex or rapidly changing conditions.
(2) Low learning efficiency and unreliable strategies of RL agents: The complexity of power systems poses challenges for learning efficiency and strategy reliability in data-driven models. While ref. [
25] shows that linearization simplifies power flow equations, it fails to capture nonlinear system behavior under extreme conditions, leading to unreliable strategies. Reference [
26] points out that linear regression-based distributed algorithms improve efficiency but lack physical mechanism support, making them inadequate for complex networks. Similarly, ref. [
27] highlights that high-dimensional observables increase computational complexity, reducing learning efficiency and reliability, particularly for discrete events or power electronic dynamics. Finally, ref. [
28] underscores that linearization improves computation but overlooks critical nonlinear characteristics, compromising accuracy.
To address these limitations, physics-informed learning (PIL) has been proposed as a hybrid approach that embeds physical laws into data-driven models. One common method involves using physical information as a separate component in the neural network for predictions. For example, ref. [
29] proposes a real-time voltage control method that embeds physical knowledge into reinforcement learning (RL) through a Trainable Action Mask (TAM), improving training efficiency and control performance. However, the TAM can increase model complexity and require more computational resources. It may also lead to suboptimal or unstable control strategies if training data is insufficient or if the system state changes frequently. In [
30], physical knowledge is directly embedded into the Actor design. The physics-informed Actor uses power flow equations to ensure that the generated actions satisfy the equality constraints of the Optimal Power Flow problem. Another approach incorporates physical knowledge as a constraint in the loss function during RL. Ref. [
31] introduces a voltage ratio constraint in the loss function to guide Multi-Agent Deep RL (MADRL) for solving distributed voltage control problems. Additionally, ref. [
32] proposes an analytical physical model for wind power output during extreme events, which is used in the Pinball error evaluation function during training. However, despite these efforts to integrate physical knowledge into RL, most methods fail to address transient processes in power systems, particularly the underlying dynamic mechanisms.
Physics-Informed Neural Networks (PINNs) are a widely used approach within the field of physics-informed learning (PIL) [
33,
34,
35]. By integrating data-driven approaches with fundamental physical laws, PINNs overcome the limitations of traditional deep neural networks (DNNs) [
36,
37,
38,
39], enabling them to effectively solve ordinary differential equations (ODEs), identify parameters, and tackle complex problems such as partial differential equations (PDEs). PINNs not only enhance robustness and reduce reliance on large datasets but are also particularly well-suited for modeling transient processes, where accurate and reliable predictions are critical. This makes PINNs highly deployable in the New Power System, where reliability and limited scenario samples are key considerations.
Compared with model-predictive control (MPC) and robust optimization, classical implementations must solve a constrained optimization at every control step, and their runtime grows with model dimension, prediction horizon, and constraint set [
40]. Robust variants such as tube or min–max MPC further enlarge the problem size and introduce conservatism, which aggravates real-time feasibility issues [
41]. In power-electronic and renewable-integration applications, long horizons or detailed multi-machine dynamics often render conventional MPC too computationally heavy for sub-100 ms deadlines [
42]. Even recent work seeks to accelerate MPC by embedding machine-learning surrogates to alleviate its online latency, underscoring the difficulty of plain MPC in real-time grids [
43]. Moreover, grid-forming inverter control requires almost instantaneous response to disturbances, which exacerbates the challenge [
44]. In contrast, physics-informed reinforcement learning (PIRL) executes with (near) constant online latency and leverages embedded physical constraints to enforce consistency and generalization even under model mismatch [
45]. This motivates the development of a PIRL-based adaptive control framework that prioritizes lower latency, scalable online computation, and physical fidelity in uncertain, time-varying power-system environments.
In this paper, we propose an adaptive grid-forming control framework for VSGs based on PIRL. The proposed method leverages the expressiveness of RL for predictive adaptation based on periodically updated system knowledge and the physical consistency of PINNs to enhance learning robustness. An adaptive controller, referred to as 3N-D, is developed to periodically adjust virtual inertia and damping parameters in anticipation of potential transient events. The key contributions of this study are summarized as follows:
(1) An adaptive VSG control strategy based on differential observations is proposed, which periodically updates the virtual inertia and damping parameters to achieve predictive adjustment under evolving system conditions, without relying on explicit multi-machine models.
(2) A physics-informed neural network is integrated into the reinforcement learning framework to embed system dynamic characteristics, which guides policy optimization with physical consistency, reduces the dependence on large-scale datasets, and enhances learning efficiency and generalization in uncertain environments.
The remaining structure of this paper is as follows:
Section 2 introduces the mathematical model of the VSG.
Section 3 proposes a method based on Reinforcement Learning and add Physics-Informed for adaptive VSG parameter tuning.
Section 4 rigorously tests these methods based on the revised IEEE-39 bus system. Finally,
Section 5 discusses the conclusions and future research directions.
2. System Model and Preliminaries
2.1. GFM-VSG Model
Grid-forming (GFM) control technologies have emerged as essential strategies for maintaining system stability in converter-dominated grids. Among them, the VSG is a widely adopted GFM implementation, aiming to replicate the dynamic behavior of conventional SGs, including their inertial and damping characteristics. By emulating synchronous generator dynamics, VSG-equipped inverters can provide frequency support and contribute to system robustness during disturbances as shown in
Figure 1a.
Figure 1b further details the internal loop, which manages frequency and power to maintain system equilibrium.
The system measures the power deviation between the reference power and the actual power , as well as the frequency deviation between current frequency deviation between the current frequency and the system’s nominal frequency . The damping coefficient D regulates the system’s response to frequency variations. and jointly determine the rate of frequency change, preventing excessive fluctuations. The frequency signal is integrated and fed back for dynamic adjustment and further integration providing the phase angle , ensuring synchronization. This closed-loop control of frequency and phase angle allows VSG to maintain system stability during load changes, simulating SG behavior.
To further enhance adaptability, this paper introduces a nonlinear decision module named 3N-D into the VSG control loop. The 3N-D module periodically adjusts the virtual inertia J and damping coefficient D based on system state observations. Unlike conventional fixed-parameter designs, this approach enables predictive control by tuning the VSG parameters in advance of potential disturbances, thereby improving both dynamic performance and robustness.
In summary, the control equation of the VSC can be represented by the following rotor motion equation:
We have converted
in Equation (
1) to
, which corresponds to the traditional swing equation:
This method enables RES to emulate the dynamic characteristics of SGs during grid connection by leveraging inverter-based control. The integrated 3N-D module utilizes current system state observations to periodically determine updated values of the virtual inertia J and virtual damping coefficient D, thereby enhancing both dynamic performance and overall system robustness. This control strategy effectively enhances the system’s robustness and reliability, reduces frequency fluctuations, and ensures efficient operation of the power system.
Although different RESs, such as wind and solar, exhibit diverse physical characteristics at the source level, their dynamic interaction with the power grid is ultimately shaped by the inverter-based interface control strategy. In this study, all RESs are interfaced through a GFM control mode based on the VSG approach, which standardizes their grid-facing behavior via tunable inertia and damping parameters. This abstraction allows the proposed adaptive tuning strategy to be generally applicable across various RES types.
2.2. Analysis of VSG Parameters for Stability
In the power angle stability issue discussed in this paper, integrating RESs does not change the fundamental definition of rotor angle stability. However, it affects rotor angle stability by altering power flow on major interconnections, replacing large SGs, influencing the damping torque of nearby SGs, and substituting SGs with critical power system stabilizers (PSSs). On the other hand, as the proportion of RESs in the grid increases, the inertia of the power systems decreases. When the grid experiences a large disturbance, it must provide power support quickly, often leading to larger and faster rotor swings. Unlike traditional generators, the selection of virtual inertia and damping coefficients under VSG is more flexible. Therefore, to enhance the power system’s ability to cope with large disturbances, these two controllable parameters can be dynamically adjusted, leading to the following transformation of Equation (
1):
where
. And when we set
, the equation becomes the following:
From the above equations, it is evident that and are determined by J and D, representing the system’s sensitivity to power error and angular velocity error, respectively. We can analyze the impact of J and D on system stability and select appropriate parameters to optimize system performance.
(1) Influence of Virtual Inertia J: A larger J results in a smaller and , increasing the system’s virtual inertia. This helps suppress rapid frequency fluctuations and improves system stability by slowing down the rate of frequency changes. However, the downside is that the system’s response becomes more sluggish, making it less suitable for applications requiring rapid adaptation to dynamic changes.
A smaller J results in a larger and , allowing the system to respond more quickly to changes in power or frequency deviation. This improves dynamic performance and responsiveness. However, reduced inertia may lead to larger frequency fluctuations and compromise system stability, as the system becomes more sensitive to disturbances.
(2) Influence of Damping Coefficient D: A larger D increases , providing stronger damping effects. This helps reduce system oscillations and enhances stability by effectively counteracting frequency deviations. However, if D is too large, the system may become overly sluggish, reducing the speed of dynamic adjustments in response to frequency changes.
A smaller D results in a smaller , which can improve the system’s dynamic response speed. However, insufficient damping leads to increased oscillations and potential instability, as the system lacks sufficient resistance to rapid frequency deviations.
In summary, a larger J and D provide better stability but slower response. A smaller J and D provide faster dynamic response but may lead to system instability. While a larger D improves damping, if J is small, the effect of damping might not be sufficient to stabilize fast disturbances. Therefore, selecting these parameters requires balancing system stability and dynamic performance to meet specific application needs. Additionally, the impact of increasing renewable energy penetration on transient rotor angle stability depends on factors such as grid layout and the location and control of renewable energy generators, adding significant complexity to physical modeling.
From the above analysis, it is evident that appropriate tuning of J and D is essential for VSGs to emulate the dynamic characteristics of SGs. The virtual inertia J allows VSGs to replicate the inertial response of SGs during frequency disturbances, while the damping coefficient D mimics the damping torque provided by SG mechanical dynamics and power system stabilizers (PSSs). By properly adjusting these parameters, VSGs can provide fast and coordinated frequency and angle support, thus contributing to overall system stability under high renewable penetration.
3. Adaptive Transient Rotor Angle Control Strategy Based on PIRL
3.1. VSG Adaptive Transient Power Angle Control Framework
When VSG uses fixed control parameters
J and
D, it may not achieve optimal control effects for the following three main reasons: First, fixed control parameters cannot adapt to grid conditions, operating conditions, and various disturbances, leading to poor control performance. Second, due to the complexity and nonlinearity of the system, fixed parameters may fail at certain operating points, making it impossible to ensure global optimization. Third, the lack of adaptive capability and the ability to address the uncertainties of renewable energy means that fixed parameters cannot be adjusted promptly to respond to changes in system state and environment, affecting system stability and response effectiveness. Therefore, this paper proposes a novel VSG adaptive transient power angle control method, with its basic framework shown in
Figure 2.
The proposed adaptive transient power rotor angle control method periodically updates the virtual inertia J and damping coefficient D to achieve online coordinated control of multiple VSGs. Considering the time-varying nature of the system model and operating conditions, optimal control parameters and are calculated online at fixed intervals based on the latest grid model and operating state information (such as data collected from EMS) and deployed to the controllers of each VSG.
To implement the periodic updating of J and D, the proposed framework follows a two-stage mechanism within each update interval : (1) system state estimation and prediction, and (2) parameter computation and deployment. Specifically, at the beginning of each interval, system-level measurements (e.g., bus voltages, power angles, and load data) are collected from the EMS and fed into the trained physics-informed reinforcement learning (PIRL) agent. The PIRL agent embeds a physics-informed neural network (PINN) to predict the rotor angle trend , which is then combined with the current state vector to form an augmented state input for the Actor network. The Actor network outputs optimized and values for all VSGs through centralized policy inference. These values are then distributed to the corresponding local VSG controllers via the 3N-D module. This inference–deployment cycle is repeated periodically every (e.g., 5–15 min), allowing the system to maintain an updated and anticipatory control strategy.
It is important to note that this method does not focus on post-fault adjustments but rather on proactively optimizing system parameters to enhance adaptability and stability in the presence of potential disturbances. Unlike traditional event-triggered control strategies, this approach periodically adjusts VSG parameters so that the system remains in a more favorable operating state when disturbances occur, thereby reducing the impact of sudden events on system stability.
The parameter update interval is typically set to several minutes (5 to 15 min) to capture long-term dynamic changes in the system. Although short-term disturbances are not directly managed, the input data in the control process already reflects the real-time state of the system, including the effects of short-term variations. This ensures that the system remains stable most of the time, and by proactively adjusting parameters, we avoid the delays of handling disturbances only after they arise. This approach improves system robustness while avoiding the computational overhead and noise of frequent updates.
To further improve performance during transient events, this paper proposes a physics-informed reinforcement learning (PIRL)-based VSG parameter optimization method, which integrates reinforcement learning with physical system constraints to ensure efficient and timely online computation of optimal parameters.
3.2. MDP Setting
Reinforcement learning is a machine learning method that learns strategies by interacting with the environment. An intelligent agent chooses an action in a given state and adjusts the strategy based on the rewards fed back from the environment. In the training of RL, the target is set to find the optimal policy , with which the cumulative reward is maximized. We commonly use a Markov process consisting of five tuples for modeling.
For the transient power angle problem concerned in this paper, the following settings are made to PIRL:
(1) States Space
In this paper, the selected observations are bus voltage
, bus voltage phase angle
, bus load
and
, generator power angle
, frequency
, output power
, and the time series
t (input time series data for 5 observation points within 1 s), specifically integrated into the following equation:
(2) Action Space
The action space consists of all possible actions that the agent can select at each time step. In this paper, the controlled variables are the
J and
D parameters of the VSG, as described earlier, specifically represented as follows:
That is, two coefficients are output for each VSG. If there is no model-based conclusion and the action space is continuous, it may lead to millions of different operations in each event.
(3) Reward Function
This paper primarily focuses on studying the transient stability of the system and the variation in active power output. The reward is designed based on the system’s immediate response in terms of the VSG frequency , electrical power output , and the virtual power angle after a disturbance.
For evaluating transient stability, we use the transient stability index (TSI) of the rotor angle, defined as follows:
where
represents the maximum difference in rotor angle between any two generators during the simulation. As indicated by the formula, when the equation is greater than 0, the system is considered to be transiently stable; conversely, if the equation is less than or equal to 0, the system is at risk of instability.
Consequently, the system’s reward at time
t is formulated as
, which is the weighted sum of individual components for frequency, power angle, and power deviations:
where
where
,
, and
separate the penalty coefficients set to ensure the effect of frequency, rotor angle, and active power in the transient state, respectively. To ensure the overall reward value,
is set as a scaling factor for the total reward. It should be noted that the reward function is designed to balance the trade-off between system stability and dynamic response. The coefficients
,
, and
are carefully chosen to prioritize system frequency stability
, rotor angle stability
, and power output accuracy
. Importantly, when an action is infeasible or leads to system instability, the three data-dependent terms
,
, and
are not computed; instead, a constant penalty
is directly applied. Likewise, when the action enables the system to recover stability (i.e.,
), a constant reward
is added. We set
as the global outer scaling factor applied to both penalties and bonuses. By adjusting these weights, the reinforcement learning agent is encouraged to maintain a stable power angle while minimizing frequency fluctuations and ensuring that the power output remains close to the setpoint. This ensures that both
J and
D are dynamically adjusted to maintain an optimal balance between stability and response speed.
3.3. DDPG Algorithm for Adaptive Control
In this work, the Deep Deterministic Policy Gradient (DDPG) is selected instead of alternatives like PPO or TD3 mainly due to its suitability for continuous and high-dimensional action spaces with direct physical interpretations. PPO relies on stochastic policy optimization, which may introduce unnecessary variance when precise parameter outputs are required for real-time grid control. TD3 improves upon DDPG but at the expense of additional critics and delayed policy updates, leading to higher computational burden and longer training time, which is less favorable under strict real-time constraints. By contrast, DDPG directly learns deterministic mappings from system states to continuous control parameters (), achieving fast convergence and low-latency decision-making. Furthermore, the stability concerns of DDPG are mitigated in our framework by embedding physical constraints via PINNs and employing target networks and experience replay, which enhance robustness during training.
DDPG is designed for continuous action spaces, utilizing an Actor–Critic architecture where the Actor generates actions and the Critic evaluates their quality. In this paper, we apply the DDPG algorithm to optimize the control parameters of VSG, ensuring that the system can respond adaptively to grid disturbances while maintaining stability. The novelty of our approach lies in how we integrate DDPG into the VSG control framework, customizing the algorithm to suit the specific requirements of power system dynamics.
3.3.1. Actor-Network
In the proposed PIRL framework, the Actor-network is designed to output the complete set of virtual inertia coefficients and damping coefficients for all VSGs based on the current system state . These parameters are critical for VSGs to emulate the behavior of SGs and provide stability support to the grid. Unlike traditional DDPG applications with generic action spaces, the action space in this study directly corresponds to the physical control parameters of multiple VSGs, making the policy learned by the Actor directly applicable to real-world grid control.
A centralized Actor structure is adopted, as shown in
Figure 3. The network takes the complete system state
as input and simultaneously generates control parameters for
N VSGs through a deep neural network. This design enables coordinated control across VSGs by leveraging global state information.
As depicted in
Figure 3, the Actor is a multilayer neural network consisting of an input layer, hidden layers, and an output layer. The input layer receives the system state
and the output of PINNs (discussed in the next subsection), which includes the operating states of all VSGs. After multiple layers of nonlinear transformations, the output layer produces
N sets of parameters
. This structure allows the Actor to learn the coupling relationships among multiple VSGs, particularly the strong nonlinearities exhibited during large grid disturbances. By sharing network parameters, the system effectively captures dynamic interactions between VSGs, enhancing transient regulation performance.
At the execution level, these parameters are transmitted to a single 3N-D module that serves as the parameter execution interface. The 3N-D module then distributes the appropriate parameter sets to each local VSG controller, maintaining the characteristics of a distributed execution architecture while ensuring control strategy consistency through centralized decision-making. This design maintains the characteristics of a distributed execution architecture, where each VSG responds quickly based on local information, while ensuring control strategy consistency through centralized decision-making.
The innovation of this architecture lies in its centralized–distributed hybrid control paradigm: the Actor-network updates its parameters based on the gradient information from the Critic-network to maximize the Q-value, achieving global optimization at the decision-making level; the execution layer performs periodic parameter updates through 3N-D modules based on system state snapshots. By periodically adjusting the parameter sets for all VSGs, the system can respond more sensitively to power angle fluctuations. The parameter update timing mechanism is adaptive, allowing the system to dynamically adjust control cycles based on grid conditions, balancing control accuracy and communication load.
The Actor-network updates its parameters based on the gradient information of the Critic-network to maximize the Q value of the Critic-network output:
Here, the policy is determined by the parameter .
3.3.2. Critic-Network
To ensure that the control actions output by the Actor network are optimal, the Critic network provides feedback by evaluating the quality of the Actors’ actions. In the DDPG algorithm, the Critic network measures the quality of each state–action pair based on a Q-value function.The Q-value function is continuously updated by the following Bellman equation:
The task of the Critic network is to minimize the prediction error of the Q-value, allowing the Q-value function to gradually approximate the true cumulative reward.
In practice, the Critic-network is updated by minimizing the following loss function:
In the application scenario of this paper, the reward function of the Critic network is closely related to the transient power angle stability of the power system. Specifically, the Critic network evaluates the effectiveness of the Actor network’s control actions based on the transient response of the power grid. If the actions generated by the Actor can effectively suppress excessive oscillations of the transient power angle and restore the system to a stable state quickly after a disturbance, the Critic will assign a high Q-value to those actions. Conversely, if the system experiences large power angle oscillations or significant frequency deviations, the Critic will assign a lower Q-value, guiding the Actor to adjust its strategy accordingly.
3.4. Physics-Informed Learning Mechanism for Multi-VSG Systems
This paper employs physics-informed neural networks (PINNs) to embed physical laws and generator identity information into a unified framework for a dynamic prediction and control of multi-VSG systems. This is accomplished by considering the given partial differential equation:
where
is a differential operator, and
u is the function to be solved. PINNs approximate the solution
u by constructing a neural network.
Given the previously detailed dynamic equations, we define as the predicted rotor angle at the next time step. Thus, the PINNs can be expressed as follows:
(1) represents the neural network’s predicted rotor angle at the next moment;
(2) represents the physical equation for rotor angular acceleration according to the traditional swing equation.
Therefore, the whole definition of PINNs is as follows:
The specific physical constraint enforced here is the classical swing equation of synchronous generators. It ensures that the predicted rotor angle trajectory not only fits the data but also satisfies the physical relationship between inertia J, damping D, power imbalance , and frequency deviation . In practice, this constraint is embedded as a soft penalty in the loss function, guiding the network to produce physically consistent outputs while improving its generalization under limited or noisy samples.
As the power system comprises multiple VSGs operating in parallel, a generator identity encoding method is adopted within the PINNs framework to differentiate the dynamic characteristics of various generators. Assuming each generator’s state is represented by a vector containing
d features, we assign each generator a unique identifier using one-hot encoding. Specifically, for each generator
i, its identity encoding can be represented as an
N-dimensional vector
, as follows:
where the
i-th position is 1, and all others are 0. This encoding ensures that each generator carries a unique identifier in the input, allowing the PINNs framework to accurately distinguish between different generators.
Next, to integrate the generator’s identity information with its state features, we concatenate each generator’s state vector
with its corresponding identity encoding
, forming a new input vector:
where
is the input vector containing both state information and identity information. By clearly representing both state dynamics and generator identities, the neural network can more effectively learn the individualized dynamics of each generator and improve prediction accuracy.
The unified total loss function comprises two parts: data loss and physics loss, assuming the same set of time points is used for both data fitting and enforcing physical constraints:
(1) Data loss term : calculates the mean squared error (MSE) between the network’s predicted rotor angles and actual observations, ensuring consistency with observed data.
(2) Physics loss term : calculates the MSE between the predicted rotor angular acceleration and the real angular acceleration derived from the traditional swing equation, ensuring predictions adhere to physical laws.
The loss functions are defined explicitly as follows:
It should be noted that the weights of data loss
and physics loss
are assumed equal by default in this paper. However, in practical applications, it may be necessary to adjust these weights according to specific tasks to balance the strengths of data fitting and physical constraints appropriately. Moreover, since rotor acceleration fluctuations are expected during disturbances in practical power systems, strictly minimizing the physics loss to zero might not be reasonable. Instead, the physics loss should be controlled within a reasonable range to avoid over-constraining the network’s predictive performance. The whole PINNs architecture is shown in
Figure 4.
By integrating data loss and physics loss, the approach effectively balances data-driven learning with adherence to physical laws, enabling accurate fitting of historical data while strictly following the dynamic characteristics of the power system.
The input vector is fed into the PINNs model, which outputs the predicted rotor angle
by learning the physical laws and constraints. Subsequently, this predicted rotor angle serves as the new state input for the Actor network within the RL framework, enabling the generation of corresponding control actions based on both predicted rotor angles and current system states. We refer to this integrated PINNs and Actor network approach as the 3N-D model, which ensures accurate prediction and effective control of generator rotor angles, optimizing overall power system stability and responsiveness. The pseudocode of proposed method is as Algorithm 1.
Algorithm 1 Physics-Informed Deep Deterministic Policy Gradient |
- 1:
Randomly initialize Physics-Informed neural network - 2:
Initialize policy network parameters and value network parameters ; - 3:
Initialize target policy network parameters and target value network parameters ; - 4:
Initializes the replay buffer ; - 5:
for episode = 1,M do - 6:
Initializes the action exploration noise - 7:
Get the initial state - 8:
for t=1, T do - 9:
Predict next rotor angle trend from PINNs: ; - 10:
Combine and together, select actions based on online policy network and exploration noise: ; - 11:
Exexute action , get reward from the environment and the next state ; - 12:
Save experience to replay buffer: - 13:
end for - 14:
for k=1, K do - 15:
Generate identity information of generator with stae features by Equation ( 15) - 16:
Calculate the PINNs value by Equation ( 13); - 17:
Update PINNs parameters by minimizing loss function Equation ( 16); - 18:
Calculate the target value by Equation ( 9) - 19:
Update Critic-networks by minimizing loss function Equation ( 11); - 20:
Update Actor-networks and target networks. - 21:
end for - 22:
end for
|
The physics-informed DDPG implements three training techniques to prevent instability in DDPG training and uses physical information to reduce training complexity while avoiding local optima issues.
(1) Using Target Networks: Both the Actor- and Critic-networks have target networks to generate stable target values. The parameters of the target networks gradually approach the main network parameters through soft updates rather than being fully copied each time. This method reduces the changes in target values for the target networks, preventing drastic updates to the main network parameters and thus improving training stability.
(2) Experience Replay: Storing past experience samples and randomly sampling small batches for training. This method can break time correlations, improving the independence and identically distributed nature of the samples.
(3) Physics-Informed Action Mapping: Although DDPG performs well in handling continuous action spaces, it can still fall into local optima, especially in complex and multi-peak policy spaces. The embedding of physical information guides the training of the agent while avoiding local optima.
4. Case Study
4.1. Analysis of Parameter J and D for VSG
From
Section 2.2, we have already discussed the influence of virtual inertia
J and damping coefficient
D. Larger values of
J and
D enhance system stability but result in slower response times, while smaller values improve dynamic response but may compromise system stability. To further investigate these effects, we will first conduct simulations based on a Single Machine Infinite Bus (SMIB) system, incorporating both a Virtual VSG and a SG. In the simulation where the
J was varied,
D was fixed at 5. Similarly, in the tests of
D,
J was fixed at 5. By fixing one parameter while adjusting the other, we can more clearly observe the effects of virtual inertia and virtual damping on the system’s dynamic response, ensuring that the simulation results are both targeted and comparable. All simulations are conducted using the open-source power system simulation software ANDES (version 1.9.2) [
46], programmed in Python 3.7. This platform is selected for its strong capabilities in multi-device dynamic modeling, support for modular controller integration, and symbolic formulation of differential-algebraic equations (DAEs), which make it well-suited for implementing custom VSG control strategies. Additionally, its open-source nature ensures transparency and reproducibility, which aligns with the research objectives of this work.
The left side of
Figure 5 illustrates the impact of different virtual inertia values on system response. A smaller virtual inertia (e.g.,
) allows the system to respond more quickly, with angular velocity
reaching a steady state rapidly. However, due to the lack of sufficient inertia, the system exhibits larger oscillations, with significant frequency deviations and notable power oscillations in the output power
. As the virtual inertia increases (e.g.,
), the system’s oscillations decrease, and both angular velocity and power fluctuations are significantly reduced. This indicates that increasing the virtual inertia improves system stability. However, a larger inertia also results in a slower system response, prolonging the duration of oscillations.
The right side of
Figure 5 shows the effect of different damping coefficients
D on the system’s dynamic characteristics. In the absence of damping (i.e.,
), the system exhibits pronounced oscillations, with large amplitude fluctuations in both angular velocity
and output power
, making the system difficult to stabilize quickly. As the damping coefficient increases (e.g.,
), the system’s oscillations are effectively damped, and the fluctuations in angular velocity and power gradually diminish, leading to faster stabilization. Larger damping provides stronger suppression of frequency deviations, reducing system oscillations. However, excessive damping may slow down the system’s response.
In summary, the virtual inertia J and damping coefficient D play complementary roles in balancing the trade-off between response speed and system stability. Increasing virtual inertia enhances system stability by reducing oscillations but slows down the dynamic response. Similarly, increasing damping D mitigates system oscillations and improves stability, but excessive damping can lead to slower system response. Therefore, selecting appropriate values for J and D is crucial to achieving an optimal balance between rapid response and stability in various applications.
4.2. Test System: Modified IEEE-39 Bus System
This section validates the 3N-D controller using the PI-DDPG algorithm on a modified IEEE 39-bus system. Random three-phase ground faults are introduced at buses 2, 9, 13, 17, 23, and 25, with fault occurrence at s and fault clearing s. To emulate observation-to-action latency in practical EMS, the control action is applied with a delay of 0.2 s after the fault clearing time, while the RL agent only uses state data up to the fault clearing instant for decision-making. This setting effectively incorporates data latency into the simulation design. Three cases are simulated. The load levels in all cases are set to 120–126% of the original load level.
As shown in
Figure 6. Case 1 and Case 2 represent scenarios with relatively extreme proportions of renewable energy units (20% and 80%, respectively) to evaluate the effectiveness of the method under different renewable energy ratios. In Case 3, the proportion of renewable energy units is 50%, and the method’s reliability is tested under high load by increasing the operational load. Case 1: Seven SGs (at buses 30, 31, 33, 34, 35, 36, and 37) and three VSGs (at buses 32, 38, and 39). Case 2: Two SGs (at buses 30 and 33) and eight VSGs (at buses 31, 32, 34, 35, 36, 37, 38, and 39). Case 3: Five SGs (at buses 30, 33, 34, 36, and 37) and five VSGs (at buses 31, 32, 35, 38, and 39).
In the simulation experiments of this paper, the PINNs include two fully connected layers, each containing 128 neurons; the Actor- and Critic-networks each have three fully connected layers, with 1024, 128, and 64 neurons per layer, respectively. Additionally, the activation function is the rectified linear unit (ReLU), the batch size for sampling is 64, the optimizer used is Adam, and the learning rate is 0.001.
Table 1 summarizes the key parameters of the VSGs in the 39-bus system, including the inertia constant
J, damping coefficient
D, and other control settings. Since there is no universal standard for selecting VSG parameters, a wide range of values can be applied. In this study, we selected the parameters based on simulations reported in [
17,
20,
21,
47], combined with the default settings of the simulation platform, to ensure reliable results and provide a basis for further optimization. Although detailed hardware constraints are not explicitly modeled in this work, the chosen values of
J and
D fall within the feasible ranges reported in the related literature, ensuring that the parameter settings remain physically meaningful and practically relevant. The table lists the configurations adopted in the case study.
The apparent power ratings are identical to those of the original conventional SGs.
The parameters and for all VSGs.
and represent the voltage controller’s proportional and integral gains, while and represent the current controller’s proportional and integral gains, respectively.
4.3. Training Efficiency and Deployment Performance
Figure 7 illustrates the convergence of PI-DDPG. The reward curve reflects the overall performance of the policy. As shown in
Figure 7a, the rewards for all cases increase rapidly during the early training phase (first 3000 epochs), indicating that the policy network quickly improves its adaptability to the environment. In Case 1, the reward value rises sharply from nearly 0 to over 10 within the first 3000 epochs and then gradually stabilizes throughout the training process, eventually converging around 15. In Case 2, the reward increases rapidly to approximately 14 in the early training phase and remains stable after 3000 epochs, ultimately converging around 15. In Case 3, the reward quickly rises to around 15 and remains stable throughout the training process.
Figure 7b depicts the training process of the value loss function, where the Critic loss for all cases exhibits good convergence. In Case 1, the loss value drops sharply from a high level within the first 3000 epochs and continues to decrease gradually during training, stabilizing after 9000 epochs with reduced fluctuations. In Case 2, the loss value rapidly decreases to around 100 in the early training phase and then remains stable after 3000 epochs, with minimal fluctuation, indicating a highly stable training process. In Case 3, the loss value decreases rapidly to around 80 within the first 3000 epochs and then remains stable throughout the training process, showing excellent convergence.
Figure 7c presents the declining trend of policy loss during training for all cases, demonstrating the optimization effectiveness of the policy network. In the early training stage (first 3000 epochs), the policy loss drops rapidly, followed by a steady decrease between 3000 and 9000 epochs. In the later training stage (after 9000 epochs), the loss value gradually stabilizes at a low level, and the loss curves of all cases converge smoothly, indicating the stability and convergence of the policy network.
In
Figure 7d, the physics loss exhibits excellent convergence across all cases, maintaining a stable state throughout training with almost no fluctuations. This performance indicates that the physical constraints are fully satisfied, ensuring that the network optimization consistently follows physical laws, resulting in a highly stable optimization process.
On the other hand,
Figure 8 demonstrates the application of the PI-DDPG-based 3N-D controller, particularly highlighting the adaptive adjustment of parameters over time to maintain stability. In
Figure 8a, without adaptive tuning, ‘VSG-10’ in Cases 1 and 2 experiences significant rotor angle increases from around 8 s until the end of the simulation, resulting in power angle instability. However, with PI-DDPG adaptive tuning (
Figure 8b), the rotor angles of all generators remain stable throughout the simulation. In Case 3 (right side of
Figure 8), the system is more vulnerable to large disturbances under high load. The rotor angles of SGs and VSGs deviate at around 2.5 s, maintaining three clusters of power angles until the end of the simulation. After a large disturbance, PI-DDPG adjusts the VSG parameters, transitioning the system rotor angles to synchronous reduction, achieving stability according to current standards. The PI-DDPG algorithm demonstrates effective and timely parameter adjustments under severe conditions, significantly enhancing power system stability through adaptive tuning of VSG parameters
J and
D.
The dynamic response of the system in terms of rotor speed
and electrical power
under different scenarios is presented in
Figure 9. In all three cases without control, the system exhibits significant oscillations during the fault, with noticeable fluctuations in both
and
and a delayed return to stability after the fault. Particularly in Case 2 and Case 3, the system experiences larger deviations and slower recovery, indicating its inability to maintain stability effectively without proper control.
However, with PI-DDPG control, the system’s oscillations are significantly reduced and both and stabilize more quickly across all cases. Even in Case 2 and Case 3, where the uncontrolled system experiences larger disturbances, PI-DDPG ensures a rapid recovery with minimal deviations from the steady-state values.
The virtual inertia J and virtual damping coefficient D are crucial for stabilizing rotor speed , power angle , and electrical power output . Without control, the system experiences instability, particularly in , due to insufficient inertia and damping, which causes significant power angle deviations. For example, in `VSG-10’ with and , large power angle shifts occur during disturbances, leading to instability.
Therefore, the adaptive tuning of J and D is critical for balancing stability and dynamic performance. Larger values of J reduce frequency fluctuations, while larger values of D enhance damping to suppress oscillations. These parameter adjustments under the PI-DDPG framework ensure system stability across different scenarios. These results confirm that the PI-DDPG controller maintains robust performance even when decision-making is based on delayed observations, which demonstrates its adaptability to latency commonly encountered in EMS.
4.4. Validation of the Effectiveness of PINNs
To validate the feasibility and adaptability of the proposed PI-DDPG-based 3N-D controller, we compare its performance with the SAC algorithm based on stochastic gradient descent for comparison in this section.
Figure 10 illustrates the reward curves and Critic loss of different algorithms during the training process. In all cases, DDPG and PI-DDPG converge significantly faster than SAC. Due to the guidance of PINNs, PI-DDPG obtains higher rewards in the early stage, which helps guide policy optimization into the correct range, thereby accelerating convergence. Specifically, the reward functions of PI-DDPG and DDPG converge at approximately 200 epochs across all cases, whereas SAC shows signs of convergence around 8000 epochs in Case 1 and Case 2. Notably, in Case 3, SAC does not converge until 10,000 epochs.
On the other hand, regarding the convergence of the Critic loss function, PI-DDPG and DDPG converge at around 7500 epochs, while SAC continues until the end of the training process. Furthermore, compared to the other two algorithms, PI-DDPG exhibits significantly smaller fluctuations, indicating that the inclusion of PINNs improves the convergence of the reinforcement learning algorithm.
According to the algorithm data in
Table 2, the PI-DDPG algorithm quickly achieves and maintains high reward values and success rates across all cases and training epochs, indicating its superior ability to optimize the objective function. In this study, the “success rate” is defined as the proportion of fault scenarios in which the system maintains transient rotor angle stability after adaptive tuning of VSG parameters. A trial is considered successful if the rotor angles of all generators remain synchronized without loss of stability during the post-fault period. This definition directly corresponds to practical transient stability criteria used in grid operation.
In Case 1, where SGs dominate the system, the system has high inertia and damping characteristics. Due to the unbalanced proportion of VSGs, the adaptive control of VSG parameters has certain limitations in improving power angle stability. Under this scenario, PI-DDPG demonstrates a relatively high reward value (−0.03) and success rate (55.43%) even in the early stage (1000 epochs), significantly outperforming DDPG and SAC. This suggests that PI-DDPG effectively enhances system stability by adjusting the limited number of VSG parameters. In the later stage (15,000 epochs), PI-DDPG maintains stable performance with a reward value of 1.91 and a success rate of 60.41%. The relatively lower success rate in Case 1 is attributed to the dominance of SGs, which already provide strong inertia and damping, thus limiting the contribution of VSG-based adaptive control. Since only three VSGs are included in this case, the ability of the proposed controller to further improve system stability is inherently constrained. Nevertheless, PI-DDPG still significantly outperforms DDPG and SAC under the same conditions, validating the effectiveness of the proposed approach.
In Case 2, the system has a high penetration of renewable energy, leading to significantly weakened inertia and damping characteristics, making power angle stability highly dependent on the dynamic adjustment of VSGs and increasing control difficulty. PI-DDPG demonstrates exceptional adaptability from the early stage (1000 epochs), achieving a reward value of 15.00 and a success rate of 96.88%, significantly outperforming DDPG and SAC. This indicates that PI-DDPG can quickly optimize the inertia and damping characteristics of VSGs, effectively compensating for the system’s inertia deficiency and enhancing stability. In the later stage (15,000 epochs), PI-DDPG reaches a near-theoretical optimal performance with a reward value of 15.06 and a success rate of 97.50%, significantly surpassing DDPG (−2.79 reward value; 48.13% success rate). This highlights PI-DDPG’s superior performance and rapid convergence capability in complex scenarios.
In Case 3, the system’s inertia and damping characteristics fall between those of Case 1 and Case 2. PI-DDPG quickly identifies a coordinated control strategy for SGs and VSGs in the early stage (1000 epochs), achieving a reward value of 15.78 and an impressive success rate of 99.38%, far exceeding DDPG (−2.74 reward value; 47.83% success rate) and SAC (−1.65 reward value; 50.93% success rate). In the later stage (15,000 epochs), PI-DDPG’s reward value and success rate remain at the theoretical optimum (15.80 reward value; 99.38% success rate), closely matching SAC’s performance but with significantly faster convergence. This demonstrates PI-DDPG’s efficiency and stability in control.
The PI-DDPG algorithm outperforms others in terms of both performance and learning efficiency, quickly converging to optimal solutions and maximizing rewards across various test scenarios. In contrast, SAC converges more slowly due to its broader strategy search range, as it balances maximizing expected rewards with policy entropy for exploration. While this entropy-based exploration enhances robustness in some environments, it results in slower convergence, particularly in tasks where control actions are more deterministic, as in our setup. DDPG, on the other hand, focuses on deterministic policy optimization, allowing for faster convergence in environments with well-defined control action spaces. By optimizing the policy directly without entropy-based exploration, DDPG efficiently finds the optimal solution in our test cases, where system dynamics require less exploration.
As shown in
Figure 7d, the Physics Loss in all three cases remains at a low level of around 0.1 throughout the training process and gradually stabilizes without noticeable oscillations or divergence. Case 1 exhibits slightly lower values, while Case 2 and Case 3 show marginally higher losses due to the increased complexity of system dynamics under high renewable penetration and heavy-load conditions. It should be noted that the Physics Loss is not required to converge to zero; maintaining a stable and low level is sufficient to ensure a balance between physical consistency and data fitting capability. This trend confirms that the embedded physical constraints are effectively learned by the network, ensuring that PINNs capture the swing equation dynamics while avoiding overfitting. The convergence of PINNs in this study is determined when the total loss (composed of data loss and physics loss) becomes stable within a predefined threshold over consecutive training epochs. Unlike methods that enforce the physics loss to converge to a very small value, we adopt this criterion because the physics loss may retain residual errors due to factors such as numerical precision limits in power system simulation, discretization of differential equations, and inevitable approximation errors in the neural network representation. Therefore, instead of forcing the physics loss to diminish excessively, which may over-constrain the model and degrade prediction accuracy, convergence is recognized once the total loss reaches stability. This approach ensures both feasibility in training and consistency with physical laws. To enhance generalization, the PINNs are trained with diverse scenarios including multiple fault locations, clearing times, and different renewable penetration levels. Moreover, the embedded physical constraints derived from the swing equation act as an inherent regularization, ensuring that the learned solutions remain consistent with physical laws even when encountering unseen operating conditions.
4.5. Sensitivity Analysis of Reward Penalty Coefficients
To further evaluate the robustness of the proposed physics-informed reinforcement learning (PIRL) framework, we conduct a sensitivity analysis of the reward weight coefficients. In particular, the rotor-angle-based transient stability index (TSI) is adopted as the dominant term of the reward, while penalty terms associated with frequency deviation () and active power deviation () are considered as auxiliary constraints. To investigate the influence of these penalty coefficients () on the performance and convergence of the control policy, we conducted a sensitivity analysis with three representative settings:
Group A (default): ;
Group B (Weakened penalties): ;
Group C (Strengthened penalties): .
Each configuration was trained under consistent conditions for 15,000 episodes. Evaluation metrics included cumulative reward, success rate, TSI, and tracking accuracy of power and frequency. The results are shown in
Figure 11.
The results indicate that TSI consistently dominates the reward shaping across all configurations. Under the default setting, TSI reaches 0.727, which is significantly larger than the contribution from active power deviation (0.056). This confirms that the controller prioritizes rotor angle stability. When the penalties are weakened (Group B), the weight of TSI slightly decreases to 0.668, while the contribution of active power deviation increases markedly to 1.741. This suggests that a looser penalty on power mismatch allows the agent to relax its tracking requirement, thereby allocating more learning capacity to stability optimization. Conversely, when penalties are strengthened (Group C), active power deviation is effectively suppressed to 0.158, while TSI (0.723) and frequency deviation (≈10) remain nearly unchanged compared to the default case. These results demonstrate that the TSI-dominant reward structure guarantees rotor angle stability even under different penalty settings, while power tracking can be flexibly adjusted.
The underlying mechanism behind these observations can be explained by the design of the reward function. The dominance of TSI originates from the positive offset applied in its formulation and the large negative penalty incurred upon instability, which ensures that maintaining synchronism is always prioritized. In contrast, the frequency deviation term shows little variation because the VSG dynamics inherently damp frequency oscillations, and the reward function only considers terminal states, thereby limiting the impact of . The active power deviation term is the most sensitive component, as it is penalized linearly without any offset, making it directly proportional to . Reducing lowers the cost of power mismatch, resulting in larger deviations, while increasing effectively suppresses the mismatch.
In summary, the sensitivity analysis confirms that the proposed reward design is robust against variations in penalty coefficients. TSI consistently dominates the learning process and ensures rotor angle stability, while active power deviation is the most sensitive term, influenced mainly by . The default configuration thus provides a balanced trade-off between system stability and power tracking performance.
4.6. Complex System Validation
To validate the capability and generalization of the proposed PI-DDPG method in handling complex systems, simulations were conducted on the WECC-179 bus case with a high proportion of renewable energy generation, as illustrated in
Figure 10. This case consists of 15 SGs and 14 VSGs, accounting for approximately 50% of the total generation (detailed shown in
Figure 12). Faults were introduced at bus 16, 47, 53, 75, 85, 110, and 119, respectively, with
tf and
tc set as before. The load levels were consistent with the basic case of WECC-179.
From
Figure 13a, it can be seen that PI-DDPG still performs consistently well, with DDPG slightly lagging behind. However, the SAC algorithm consistently fails to achieve satisfactory results. This aligns with the conclusions drawn from the simulation experiments on the modified IEEE-39 bus system discussed earlier. Similarly, these trends are reflected in the critic loss function, indicating that PI-DDPG converges the fastest.
On the other hand,
Figure 13 depicts the deployment performance of PI-DDPG based on the modified WECC-179 bus system. In
Figure 13c, the system’s time-domain simulation is shown without VSC. The system exhibits angular fluctuations from 2 s to 7 s, and instability tendencies emerge around 8 s, amplifying continuously until the end of the simulation. However, in
Figure 13d, for VSC decisions made by PI-DDPG, the angular fluctuations in the system are delayed until 4 s, with lower amplitudes and frequencies. Throughout the simulation, the fluctuation amplitude decreases continuously, indicating that the system’s final angular stability is achieved.
In summary, the PI-DDPG algorithm’s ability to significantly reduce training time and achieve high-quality results highlights its effectiveness in practical applications. It demonstrates robustness and scalability across different complexity levels, making it an efficient decision-making tool in real-world applications by rapidly converging to optimal solutions and maximizing rewards.