Next Article in Journal
Pilot-Scale Fenton-like System for Wastewater Treatment Using Iron Mud Carbon Catalyst
Next Article in Special Issue
Mitigating Crossfire Attacks via Topology Spoofing Based on ENRNN-MTD
Previous Article in Journal
Interpretable Fuzzy Control for Energy Management in Smart Buildings Using JFML-IoT and IEEE Std 1855-2016
Previous Article in Special Issue
Detecting Out-of-Distribution Samples in Complex IoT Traffic Based on Distance Loss
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reactive Power Optimization of a Distribution Network Based on Graph Security Reinforcement Learning

1
School of Electronic and Information, Xi’an Jiaotong University, Xi’an 710049, China
2
Gansu Tongxing Intelligent Technology Development Co., Ltd., Lanzhou 730050, China
3
School of Electrical Engineering, Xi’an University of Technology, Xi’an 710048, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(15), 8209; https://doi.org/10.3390/app15158209
Submission received: 24 June 2025 / Revised: 17 July 2025 / Accepted: 21 July 2025 / Published: 23 July 2025
(This article belongs to the Special Issue IoT Technology and Information Security)

Abstract

With the increasing integration of renewable energy, the secure operation of distribution networks faces significant challenges, such as voltage limit violations and increased power losses. To address the issue of reactive power and voltage security under renewable generation uncertainty, this paper proposes a graph-based security reinforcement learning method. First, a graph-enhanced neural network is designed, to extract both topological and node-level features from the distribution network. Then, a primal-dual approach is introduced to incorporate voltage security constraints into the agent’s critic network, by constructing a cost critic to guide safe policy learning. Finally, a dual-critic framework is adopted to train the actor network and derive an optimal policy. Experiments conducted on real load profiles demonstrated that the proposed method reduced the voltage violation rate to 0%, compared to 4.92% with the Deep Deterministic Policy Gradient (DDPG) algorithm and 5.14% with the Twin Delayed DDPG (TD3) algorithm. Moreover, the average node voltage deviation was effectively controlled within 0.0073 per unit.

1. Introduction

With the acceleration of the clean and low-carbon energy transition and the continuous advancement of new power system construction, the grid integration of a high proportion of new energy has become a development trend and characteristic of China’s power system. The strong randomness and volatility of new energy outputs (e.g., the daily fluctuation of photovoltaic power can reach 80% of the rated capacity), when superimposed with the time-varying nature of load [1], have intensified the risk of frequent voltage limit violations at distribution network nodes.
In power systems, reactive power plays a pivotal role in maintaining voltage stability, ensuring power quality, and enhancing grid reliability. Adequate reactive power support is essential to keep the voltage within acceptable limits, especially in distribution networks with high penetration of distributed energy resources (DERs). With the increasing integration of DERs such as photovoltaic power and wind power, the traditional reactive power optimization methods face new challenges due to the intermittent and uncertain nature of these sources. Events such as the 2003 power crises in North America and Europe have demonstrated that voltage instability has become a critical trigger for cascading failures in power grids [2]. There is an urgent need to conduct research on reactive power and voltage optimization and control for new energy grid-connected systems. In power systems, voltage violations at nodes caused by faults or other security factors often trigger cascading failures, which can lead to catastrophic accidents. Given the criticality of this issue, maintaining safe node voltage is of utmost importance for ensuring the safe and stable operation of power systems [3].
Recent studies have made significant progress in reactive power control. For instance, a three-phase shift power level controller (TPSPC) was proposed in [4], which can independently and effectively control active and reactive power in single-phase dual active bridge (DAB) DC–DC converters. This method demonstrates the potential for improving reactive power management in power systems. Another study [5] highlighted the importance of combining flexible alternating current transmission system (FACTS) devices and renewable energy sources (RES) for optimal reactive power dispatch (ORPD), aiming to reduce active power losses and enhance voltage profiles. Additionally, a reactive power controller for single-phase DAB DC/DC converters was presented in [6], which can regulate reactive power, while controlling the DC output voltage or active power. A comprehensive review of reactive power planning (RPP) methods was provided in [7], emphasizing the effectiveness of strategic RPP approaches in reducing power losses, minimizing equipment failures, and improving power quality.
In response to the voltage instability problems arising from the integration of DERs, severe load fluctuations, and extensive deployment of power electronic devices, traditional reactive power optimization methods face three major challenges, as detailed below: First, centralized optimization based on static topology assumptions struggles to adapt to the time-varying network structure resulting from the random switching of DERs [8]. Second, conventional model predictive control (MPC) methods are constrained by the forecasting accuracy of renewable energy outputs, leading to insufficient regulation timeliness at the minute scale [9]. Third, the multi-timescale dynamic interactions among power electronic devices significantly increase control dimensionality [10], thereby posing a risk of dimensional explosion for traditional mathematical programming approaches.
Current research on reactive power and voltage optimization (RPO) primarily focuses on three directions: deterministic mathematical programming (DMP), heuristic intelligent algorithms, and deep reinforcement learning methods. The first category of methods is based on DMP, such as second-order cone programming (SOCP) [11] and mixed-integer nonlinear programming (MINLP) [12], which handle non-convex constraints through convex relaxation. However, these methods’ computational efficiency drops significantly when handling multi-timestep coupled decisions. For instance, in a distribution network with 100 nodes, the solution time of MINLP may exceed 10 min [13], which makes it difficult to meet second-scale reactive power regulation requirements. Deterministic mathematical programming methods rely on accurate grid parameters, which may cause optimization results to deviate from the practical feasible region. Additionally, in the presence of uncertainties, mechanisms such as scenario reduction and chance constraints must be introduced through stochastic programming or robust optimization. For example, the variable scale of two-stage robust optimization increases by a factor of 3–5 [14], significantly increasing the computational burden.
The second category employs heuristic intelligent algorithms, such as particle swarm optimization (PSO) [15] and genetic algorithms (GA) [16]. These algorithms do not rely on accurate power system model parameters or assume differentiability or continuity of objective functions, thus enabling them to handle highly non-convex and discrete problems [17]. However, heuristic algorithms struggle to strictly satisfy voltage security constraints and often use penalty functions to address violations. Experiments show that improper setting of penalty coefficients can lead to an increase in the violation rate [18].
The third-generation methods adopt deep reinforcement learning (DRL), such as the Deep Q-Network (DQN) [19] and Proximal Policy Optimization (PPO) [20], to enable end-to-end control policy generation. However, they commonly suffer from an insufficient state-space representation capacity [21].
Existing DRL approaches have also attempted to integrate graph neural networks (GNNs). For example, the Graph-QN architecture proposed in [22] to address grid topology dynamics handles power system topological features but it fails to establish explicit security constraint mechanisms. Early Deep Deterministic Policy Gradient (DDPG) algorithms approximated both the value function and policy using deep neural networks (DNNs), enabling the direct handling of continuous state and action spaces. By integrating the strengths of Q-learning and deterministic policy gradients, DDPG learns optimal policies through interaction with the environment. In robotics, DDPG has achieved notable success in complex control tasks such as bipedal locomotion and robotic arm manipulation [23]. In the power systems domain, DDPG has also been explored to address various control challenges [24]. For example, it has been applied to regulate hybrid distribution transformers [25] and photovoltaic (PV) smart inverters [26] to maintain voltage levels within permissible limits. Furthermore, the Twin Delayed DDPG (TD3) algorithm [27] improves upon DDPG by incorporating a double Q-network mechanism to reduce overestimation bias and enhance value estimation accuracy. While DDPG and TD3 reduce manual tuning and improve training efficiency, their performance remains limited in reactive power and voltage optimization tasks for distribution networks with high penetration of distributed energy resources, particularly in handling state representation and enforcing safety constraints.
To address the voltage instability issues caused by the integration of distributed power sources, severe load fluctuations, and the widespread use of power electronic devices, research is being conducted on the optimization of reactive power distribution and voltage regulation strategies to improve the voltage stability and quality of power grids and prevent voltage collapse.
Notably, recent studies have made breakthroughs in addressing security constraints. Liang et al. [28] integrated voltage constraints into a reward function via the Lagrangian multiplier method, yet manual adjustment of penalty coefficients is required. Zhang et al. [29] designed a safety layer to filter hazardous actions, but this comes at the cost of overly conservative policies. In terms of state representation, Wang et al. [22,30] demonstrated the superiority of graph convolutional networks (GCNs) in capturing topological coupling relationships among nodes.
To address the shortcomings of previous studies in mitigating the voltage safety impacts of renewable energy on power grids and to enhance the feature extraction of distribution network structures, this paper proposes a Security Graph Deep Deterministic Policy Gradient (SGDDPG) algorithm with voltage security constraints. The main contributions are listed below:
  • We model Secure Reactive Power Optimization under DER fluctuations as a Constrained Markov Decision Process, explicitly incorporating voltage security limits.
  • A graph-enhanced neural network extracts both topological and nodal features, improving agent awareness of spatial dependencies.
  • We introduce a cost critic network alongside the reward critic, using a primal-dual update to enforce safety constraints without manual penalty weight tuning.
  • On an improved IEEE 33-bus test case with real load and PV data, our method outperformed standard DDPG and TD3 in both safety (no voltage violations) and efficiency (lower network losses).
The remainder of this paper is organized as follows: Section 2 describes the Secure Reactive Power Optimization (SRPO) problem in ADNs and formulates it as a Constrained Markov Decision Process (CMDP). Section 3 details the proposed methodology. Section 4 presents the implementation and experimental validation of the proposed solution. Finally, Section 5 provides discussions and concludes the paper.

2. Problem Description and System Models

2.1. Problem Formulation

In the context of reactive power and voltage security control in distribution networks with high penetration of renewable energy sources, the various nodes are equipped with PV units and capacitor banks. The PV units are integrated into the grid through controllable smart inverters, which, like conventional capacitors, provide reactive power support to regulate the bus voltage at each node within a stable threshold.
The structure of reactive power and voltage security control in the distribution network is illustrated in Figure 1. Under the control of the trained agent, the reactive power compensation devices are coordinated to ensure the voltage security of the distribution network, while simultaneously minimizing device regulation costs and active power losses. Therefore, in a distribution network consisting of N + 1 nodes, including m PV units and k capacitor banks, the objective of reactive power-based voltage security optimization over T time steps is to
min t = 1 T 1 N i N ν i , t ν ref + ( i , j ) Ω r i j I i j 2 + a c a 1 ( d t a )
In the above formulation, v i , t denotes the voltage magnitude at node i at time t, and v ref represents the reference voltage value. The pair ( i , j ) denote a branch connecting node i and node j, with r i j and I i j representing the resistance and current magnitude of the branch, respectively. C a refers to the switching cost of the a-th mechanical device. The indicator function 1 ( d t a ) equals 1 if the a-th device is actuated at time t, and 0 otherwise.

2.2. Distribution Network Models and Equipment Models

In our CMDP formulation, the physical constraints of the system are represented by two sets of conditions: the power flow equations (see Equation (2)) and the operational limits of the equipment (see Equation (3)). These constraints are strictly enforced by the PYPOWER solver and a custom simulation environment, respectively.
P t o t a l = U i j = 1 N U j G i j cos θ i j + B i j sin θ i j Q t o t a l = U i j = 1 N U j G i j sin θ i j B i j cos θ i j P t o t a l = P i G + P i P V P i L Q t o t a l = Q i G + Q i P V + Q i C B Q i L
In the power flow constraints of the distribution system, P total and Q total denote the total active and reactive power of the system, respectively. U i and U j represent the voltage magnitudes at nodes i and j. G i j and B i j denote the conductance and susceptance of the branch connecting nodes i and j, and θ i j is the voltage angle difference between these nodes. Where P i G , P i P V , and P i L represent the active power output from the generator, photovoltaic unit, and load at node i, respectively. Where Q i G , Q i P V , and Q i C B represent the reactive power from the generator, photovoltaic unit, and capacitor bank, and Q i L is the reactive power demand at node i.
p m , t PV 2 + q m , t PV 2 s m PV 2 P m , min PV p m , t PV P m , max PV Q m , min PV q m , t PV Q m , max PV 0 q k , t C B Q k , max C B V i , min ν i , t V i , max
For the equipment operational constraints, p m , t P V , q m , t P V , and s m , t P V represent the active power, reactive power, and apparent power output of the m-th PV unit at time t, respectively. q m , t C B denotes the reactive power supplied by the m-th capacitor bank at time t, and V i , t represents the voltage magnitude at node i at time t.

2.3. Safety Constraints for Reactive Power Optimization

The secure operation constraints include voltage constraints for all bus nodes and current constraints for network branches. Under safe operating conditions, the bus voltage magnitudes and branch currents must satisfy the following conditions:
0.95 U i 1.05 i N I i l I i l max i l Ω
where U i denotes the bus voltage magnitude at node i, and N represents the set of nodes. I i l is the current flowing through branch i l , I i l max denotes the maximum allowable current for branch, and Ω represents the set of branches.

2.4. Constrained Markov Decision Process

A CMDP is a mathematical framework used to model decision-making problems where an agent aims to maximize a cumulative reward while satisfying certain constraints. It extends the standard Markov Decision Process (MDP) by incorporating constraints that must be satisfied throughout the decision-making process. In the context of reactive power optimization, CMDP is suitable for addressing the voltage security control problem, which involves optimizing control actions, while ensuring safe operation of the distribution network.
The reactive power and voltage security control problem is formulated as a CMDP, where the device scheduling center is regarded as an agent. We define the CMDP as a tuple S , A , P , R , C . As illustrated in Figure 2, the agent interacts with the distribution network environment within the established CMDP framework. The distribution network, acting as the environment, receives actions from the agent and updates its state based on system dynamics and constraint conditions. The new state is then fed back to the agent, and the environment generates both reward and cost signals based on the agent’s action and the resulting state.

2.4.1. State Space

This study leverages GCNs to extract spatial graph features such as node voltage and power-related information. In the framework of the CMDP, the state space is jointly defined by the node feature matrix and the topological adjacency matrix, denoted as G ( C , Λ ) . The node feature matrix C = [ O 1 , t , O 2 , t , , O N , t ] , O N , t = [ V i , t , P i , t L , Q i , t L , Q i , t P V , Q i , t C B ] includes the voltage magnitudes of all nodes, the active and reactive power outputs of PV units, and the reactive power support from capacitor banks. Additionally, it encompasses branch current magnitudes and the operational states of all controllable devices. The adjacency matrix Λ C × C captures the connectivity among nodes in the distribution network, thereby reflecting the system’s topological structure.
The state space in the CMDP includes the voltage magnitudes at all nodes, the active and reactive power generations from PV units, and the reactive power supply from capacitor banks. It also includes the current magnitudes in network branches and the operational status of all devices.

2.4.2. Action Space

The action space A = [ A P V , A C B ] consists of control actions for reactive power compensation, including the reactive power output adjustments of PV inverters and switching operations of capacitor banks. Here, A P V represents the action set of PV inverters. Since PV inverters support continuous control, the action of each inverter is defined within the range [ 1 , 1 ] . In contrast, capacitor banks (CBs) are discrete devices, and the action variables in A C B are binary values [ 0 , 1 ] . Moreover, the operation of discrete devices not only affects the system state, but also incurs corresponding action costs.

2.4.3. Transition Probability

The transition probability describes the probabilistic relationship between the current state, taken action, and the next state. It is determined by the system dynamics, including the power flow equations and equipment operational characteristics. In power systems, the evolution of the environment and its state transitions are determined by the system model. In this study, the PYPOWER simulation tool is employed to model and analyze the state transition probabilities within the power system.

2.4.4. Reward Function

The reward function is designed to reflect the optimization objectives, including the voltage deviation from reference values, active power losses, and the cost of device regulation. The objective of the agent’s action is to maintain the voltage within a secure range around the reference level, while ensuring that inverters and capacitor banks provide appropriate reactive power for voltage regulation. The reward function can be defined as follows:
R = E t = 0 T γ t r t
r = min t = 1 T 1 N i N ν i , t ν ref + ( i , j ) Ω r i j I i j 2 + a c a 1 ( d t a )
where γ is the discount factor at time t, which determines the importance assigned to future rewards. r t is the value of the reward function at time t.

2.4.5. Constraints

The CMDP incorporates safety constraints, including the voltage magnitude limits at all bus nodes. These constraints are critical for ensuring the secure operation of the distribution network. Therefore, an additional set of constraint cost functions is introduced as follows:
R c = E t = 0 T γ t c t d
where c = i N max 0 , V i , τ 1.05 + max 0 , 0.95 V i , τ , d denotes the predefined constraint threshold.
The detailed interaction process between the agent and the environment is illustrated in Figure 3. At each time step, the agent first perceives the current state of the distribution network environment, which includes key information such as node voltages, PV states, and capacitor states. Based on the current state, the agent selects an action a according to its decision-making policy. This action influences the reactive power outputs of the PV inverters, as well as the reactive power compensation provided by the capacitor banks. Subsequently, the environment updates its internal state according to the action, including changes in power flow, PV status, and capacitor status. The updated states are then fed back to the agent, along with corresponding reward and cost signals.

3. Reactive Power Optimization Method Based on Graph Security Reinforcement Learning

3.1. Security Graph DDPG Algorithm

The conventional DDPG algorithm employs random exploration during training to expand the search space. However, in the context of voltage control, the actions taken by the agent are directly applied to the distribution network, which may lead to safety issues such as voltage violations. Therefore, it is necessary to further constrain the actions output by the agent’s policy network. This paper proposes an improved approach by incorporating a constraint critic network into the deep policy network. The constraint critic evaluates the safety cost of the actions generated by the actor network and restricts them within a low-cost region, enabling the agent to accumulate experience that adheres to safety constraints during training. The neural network architecture of the proposed algorithm is illustrated in Figure 4 and a detailed description of SGDDPG is shown in Algorithm 1.
Algorithm 1 Safe graph DDPG algorithm.
Require: Hyperparameters: EPISODES, STEPS, TEST, LR_ACTOR, LR_CRITIC, dual_variable, dual_variable_lr, obs_dim, act_dim, memory_capacity, batch_size, var, γ , τ , constraint_limit
  1:
Initialize Actor network π and Reward Critic network Q ω R R , Q ω C C
  2:
Initialize target networks π and Q ω R R , Q ω C C
  3:
Initialize replay buffer B
  4:
Initialize dual variable λ = 0
Ensure: Trained Actor network π
  5:
for episode = 1 to EPISODES do
  6:
      Initialize environment and obtain initial state s 0
  7:
      for step = 1 to STEPS do
  8:
           Select action a t = π ( s t )
  9:
           Execute action a t , observe reward r t , cost c t , and next state s t + 1
10:
           Store transition ( s t , a t , r t , c t , s t + 1 ) in B
11:
           Update current state s t s t + 1
12:
           if episode > memory_warmup_steps then
13:
                 Sample random mini-batch of transitions from B
14:
                 Update Reward Critic network by minimizing loss in Equation (12)
15:
                 Update Cost Critic network by minimizing loss in Equation (14)
16:
                 Update Actor network by ascending θ π E [ Q ( s , π ( s ) ) ]
17:
                 Soft update target networks:
18:
                     Ψ τ Ψ + ( 1 τ ) Ψ
19:
                     ω R τ ω R + ( 1 τ ) ω R
20:
                     ω C τ ω C + ( 1 τ ) ω C
21:
                 Update dual variable λ based on constraint violation
22:
            end if
23:
      end for
24:
      Evaluate the performance of the current policy
25:
      if performance meets criteria then
26:
            Save the current Actor network parameters
27:
      end if
28:
end for

3.2. Graph Convolutional Network

GCNs have emerged as a powerful tool for processing graph-structured data. Unlike traditional neural networks designed for grid-like data (e.g., images), GCNs can effectively handle irregular graph structures, making them suitable for applications such as social network analysis, recommendation systems, and power grid optimization.
The input to a GCN is a graph data structure represented as G ( C , Λ ) , where C is the set of nodes and Λ is the set of edges. Each node i C has associated features, which can be represented as a feature vector H i . The graph structure is typically encoded using an adjacency matrix A , where A ij = 1 if there is an edge between node i and node j, and 0 otherwise. Additionally, a degree matrix D is used, which is a diagonal matrix where each diagonal entry D ij represents the degree of node i (i.e., the number of edges connected to node i).
A GCN consists of multiple layers, each performing a graph convolution operation. The core idea behind GCN layers is to aggregate information from a node’s neighbors and update the node’s features based on this aggregated information. The propagation rule for each layer can be described as follows:
H ( l + 1 ) = ReLU D ˜ 1 2 A ˜ D ˜ 1 2 H ( l ) W ( l ) ,
where H ( l ) is the feature matrix of the l-th layer, with each row representing the feature vector of a node. A ˜ = A + I is the adjacency matrix with added self-loops to account for the node’s own features. D ˜ is the degree matrix corresponding to A ˜ . W ( l ) is the weight matrix for the l-th layer, which is learned during training. ReLU is the rectified linear unit activation function, which introduces non-linearity into the model.

3.3. Actor Network

The agent’s actor network, also referred to as the policy network, is designed to implement a parameterized mapping from observed states to corresponding actions. Specifically, the policy network takes the observed state information from the environment as input and computes the corresponding action through forward propagation. The policy π is parameterized, meaning that it is determined by a set of trainable parameters μ . For the online policy network, this parameter is denoted by Ψ , while those of the target policy network are denoted by Ψ . The primary objective of the actor network is to maximize the policy objective function, which is typically associated with the expected long-term cumulative reward, aiming to guide the agent to adopt a policy that yields the highest possible return over time. The mathematical formulation is given as follows:
J π ( μ ) = E [ Q φ R R s , μ λ Q φ C C O , μ ]
where E [ · ] denotes the expectation operator, Q φ R R represents the expected reward value (Q-value), and Q φ C C denotes the expected cost value. These quantities are approximated by a reward critic network and a cost critic network, both of which are parameterized.
The parameters of the online and target critic networks are updated according to the following rules:
Ψ Ψ + η Ψ J π ( Ψ )
Ψ τ Ψ + ( 1 τ ) Ψ
where η is the learning rate and τ is a soft update parameter.

3.4. Reward Critic and Cost Critic Networks

In the DDPG algorithm, to ensure that the policy network parameters are updated in the direction of maximizing cumulative rewards, a critic network is typically employed to evaluate the quality of the actions taken by the actor network. The critic network calculates the action-state Q-value, which represents the expected cumulative reward for taking a specific action in a given state. Similarly to the actor network, the critic network also utilizes a primary network and a target network for stable parameter updates.
In this study, in addition to the reward critic network that computes Q-values, a cost critic network has been established to calculate cost constraints. The cost critic network shares the same architecture as the reward critic network but is trained using cost-related signals instead of reward signals. This dual critic network setup allows for a more comprehensive evaluation of both the benefits and costs associated with different actions, enabling the agent to make more informed decisions that balance reward maximization and cost minimization.
The training of the reward Critic network is performed by minimizing the squared error between the predicted and target values using a loss function. The parameters ω of the primary Critic network are updated by minimizing the following mean squared error (MSE) loss function:
L ( ω R ) = E r + γ Q φ R R ( s , μ ) Q φ R R 2
The target reward critic network performs soft updates through the soft update parameter τ :
ω R τ ω R + ( 1 τ ) ω R
Similarly to the reward Critic network, the cost Critic network is updated by minimizing the mean squared error (MSE) loss function of the cost.
L ( ω C ) = E c + γ Q φ C C ( s , μ ) Q φ C C 2
The target reward critic network performs soft updates through the soft update parameter τ :
ω C τ ω C + ( 1 τ ) ω C

3.5. Dual Variable

During the optimization process, we introduce dual variables and adjust the policy by computing a dual gradient to satisfy the constraint conditions. The dual gradient reflects the discrepancy between the incurred cost and the predefined constraint limit. If a certain cost exceeds the predefined safety constraint threshold d, correction must be performed by adjusting the corresponding dual variable. To strictly ensure voltage security at each node, we set a threshold of d = 0 in this study. To maintain the stability of dual variable updates, we adopt the update mechanism defined by Equations (16)–(19). At each update step, the average deviation between the current cost R c and the constraint limit d is computed, and the dual variable is adjusted using the predefined learning rate α . Specifically, we first compute the difference between the cost and the constraint limit as follows:
λ = R c d
The computed dual gradients are then clamped with a lower bound of zero. This clamping operation is necessary because when the cost is below the constraint limit (i.e., dual_gradients < 0 ), no adjustment of the dual variable is needed, as the constraint is already satisfied. This operation ensures that the dual variable is only updated when the cost exceeds the constraint threshold.
Subsequently, the dual variable is updated using the mean of the dual gradients. The update rule is given by
λ = max λ , 0
where α denotes the learning rate of the dual variable, which controls the step size of the update. After the update, it is necessary to ensure that the dual variable remains non-negative. This is because the dual variable typically represents a non-negative slack variable, which reflects the degree to which a constraint is violated. Maintaining the non-negativity of the dual variable preserves both its physical interpretation and mathematical validity.
λ [ λ + α E [ λ ] ]
λ = max [ 0 , λ ]

4. Case Study

4.1. Introduction to Testing System

To verify the effectiveness of the proposed method, simulation experiments were conducted on an improved IEEE 33-bus distribution system, as illustrated in Figure 5. Two PV inverters were installed at buses 8 and 13, with capacities of 3 and 3 MVA, respectively. Two controllable capacitor banks were placed at buses 6 and 32, with capacities of 1 and 1 MVar. The voltage reference was set to 10.5 kV, and the voltage safety range was defined as [0.95 p.u., 1.05 p.u.].
The experimental simulations were implemented using PyCharm 2023.1.2, with the PyTorch framework, Python 3.11, and the PyPower 5.1.16 library. The hardware platform used for training and testing included an NVIDIA RTX 4060 GPU (NVIDIA Corporation, Santa Clara, CA, USA) and a 12th Gen Intel(R) Core(TM) i7-12650 processor (Intel Corporation, Santa Clara, CA, USA) operating at 2.30 GHz.
The experiment used a 15-min scheduling interval, with load data and PV power output collected from a regional power grid at the same interval for three typical days as training data, as shown in Figure 6 and Figure 7. The scheduling period was one day, equivalent to 96 time steps. The control hyperparameters of the Safe Graph DDPG algorithm are listed in Table 1.

4.2. Result and Analysis

To demonstrate the advantages of the proposed method, we selected four methods for comparison: Without RPO (without any reactive power control), DDPG without GCN (basic reinforcement learning without topology modeling), TD3 (an enhanced DDPG with twin critics and delayed updates), and the proposed SG-DDPG (which incorporates both safety constraints and graph-based feature extraction). Without RPO serves as a reference point to highlight the necessity of applying reactive power optimization techniques, especially under fluctuating load and generation conditions. TD3 and DDPG represent general reinforcement learning approaches that help assess the effectiveness of DRL in RPO strategies, but they do not account for spatial features. After 1000 training episodes, the training rewards of the three reinforcement learning algorithms were as shown in Figure 8.
As shown in Figure 8, the Safe Graph DDPG (SG-DDPG) algorithm proposed in this study exhibited a consistent training trend with the two comparative algorithms. As the number of training episodes increased, the agent continuously learned control strategies satisfying the voltage security constraints, leading to a gradual convergence of rewards. The proposed method achieved higher cumulative rewards and demonstrated improved convergence stability compared to the DDPG and TD3 algorithms.
To quantitatively evaluate the training effectiveness, three comparison metrics were defined: voltage deviation, network loss, and voltage violation rate. Voltage deviation measures the absolute value of the deviation of node voltages from the reference voltage at each time step during the control period, whereby a smaller value indicates greater voltage stability. Network loss quantifies the total active power loss across the network during the control period. The voltage violation rate is defined as the proportion of time steps in which any node voltage exceeds the safety threshold over the total number of time steps in each episode.
Figure 9 and Figure 10 illustrate the voltage magnitudes at all 33 buses over a typical day under the four different control strategies. Without effective voltage regulation, the system experienced widespread voltage violations. Although the application of the DDPG and TD3 algorithms mitigated some of these violations, a number of limit breaches still persisted. In contrast, the proposed safety-aware graph-based DDPG algorithm effectively regulated the control devices, maintaining all bus voltages within the secure range of [0.95, 1.05] p.u. This highlights the reliability of the proposed method in ensuring voltage safety. Detailed evaluation results are provided in Table 2.
Figure 11 presents the total active power losses of the 33-bus system over a daily scheduling horizon under four different control strategies. The results indicate that all three reinforcement learning-based reactive power control methods significantly reduced network losses compared to the conventional approach. Notably, the proposed method achieved the lowest active power loss, while effectively maintaining voltage security, demonstrating its superior efficiency and constraint-awareness.
To determine the optimal GCN depth, we conducted an ablation experiment comparing models with 1-, 2-, 3-, and 4-layer GCN configurations. Under identical training conditions, each model was trained multiple times. The results are presented in Figure 12 and Figure 13. Figure 13 shows the reward curves during training for each GCN variant, while Figure 12 summarizes the peak reward, training time, and number of episodes to convergence. As illustrated, the 2-layer GCN model achieved the highest reward and the fastest learning, converging in approximately 45 min and 100 episodes. Although the deeper GCNs could extract more detailed features, they offered negligible gains in reward at the cost of substantially longer training times. Conversely, the single-layer GCN model exhibited significantly lower performance; despite facilitating basic feature extraction, its limited topological awareness slowed convergence and led to reduced final rewards.
These findings highlight the effectiveness of integrating GCN-based feature extraction with a cooperative reward mechanism to enhance the agent’s representational capacity. The ablation study confirmed that each component independently contributed to performance gains, and their combined implementation produced the most robust and efficient recovery strategy.

5. Discussion and Conclusions

This paper has proposed an SG-DDPG-based voltage control method that integrates safety constraints into the reinforcement learning framework and leverages graph-based state representations to capture the topological characteristics of power networks. Four strategies have been compared: Without RPO (a baseline without any reactive power control), DDPG (without GCN), TD3, and the proposed SG-DDPG. DDPG served as a fundamental deterministic policy gradient algorithm for continuous control, while TD3 improved upon DDPG by incorporating a double Q-network and delayed policy updates to mitigate overestimation bias. However, neither DDPG nor TD3 explicitly enforced action constraints, which often resulted in non-zero voltage violation rates, even after convergence. In contrast, SG-DDPG incorporated a dual-variable safety layer that penalized voltage deviations exceeding ± 5 % of the nominal value, thereby guiding the learned policy to maintain all bus voltages within the [0.95, 1.05] p.u. range. The DDPG method (without GCN) reduced both power losses and voltage violations but still exceeded voltage limits in 4.9% of time steps. The TD3 algorithm further improved the stability but failed to eliminate violations entirely, resulting in a 5.1% violation rate. In comparison, the proposed SG-DDPG achieved zero violations and the lowest active power loss (11.5 MW), demonstrating the effectiveness of combining safety constraints with graph-based feature representations. The experimental results confirmed the critical role of constraint-awareness, with DDPG and TD3 serving as representative DRL baselines. The No-RPO case further quantified the fundamental benefits of reactive power control. Moreover, the integration of GCN enhanced the spatial generalization capability, while the dual-variable mechanism ensured operational safety. Future work will explore dynamic graph partitioning and multi-agent coordination to further improve scalability and robustness.

Author Contributions

Conceptualization, X.Z. and X.G.; methodology, X.G.; software, P.S.; validation, X.G., P.S. and X.L. (Xing Li); formal analysis, X.Z.; investigation, X.Z.; resources, X.L. (Xinghua Liu); data curation, Y.Z.; writing—original draft preparation, X.Z.; writing—review and editing, X.G.; visualization, X.W.; supervision, C.D.; project administration, X.Z.; funding acquisition, X.L. (Xinghua Liu). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Basic Research Program of Shaanxi Province under Grant 2023-JC-ZD-38, and in part by the National Key Research and Development Program of China under Grant 2022YFB3305502 and Grant 2022YFB3305503.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are included in the article.

Acknowledgments

The authors appreciate the comments and suggestions by the editors and anonymous reviewers.

Conflicts of Interest

Authros Xu Zhang, Pei Sun, Xing Li and Yuan Zhang were employed by the company Gansu Tongxing Intelligent Technology Development Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Nomenclature

C B 1 , C B 2 The 1st and 2nd capacitor banks installed in the distribution network
P V 1 , P V 2 The 1st and 2nd photovoltaic inverters installed in the distribution network
I i j Branch current magnitude between node i and j
v i , t Voltage magnitude at node i at time t
v r e f Reference voltage value
r i j Resistance of the branch between node i and j
c a Switching cost of the a-th mechanical device
A P V , A C B Action sets of PV inverters and capacitor banks
γ t Discount factor at time t
r t Reward value at time t
c t Cost value at time t
dPredefined constraint threshold
π Policy network parameterized by trainable parameters
Ψ Parameters of the online policy network
Ψ Gradient with respect to the online policy network parameters
τ Soft update parameter
ω R , ω C Critic network parameters for reward and cost
λ Dual variable for constraint handling
λ Gradient of the dual variable

References

  1. Singh, B.; Das, S. Adaptive Control for Undisturbed Functioning and Smooth Mode Transition in Utility-Interactive Wind–Solar Based AC/DC Microgrid. IEEE Trans. Power Electron. 2024, 39, 15011–15020. [Google Scholar] [CrossRef]
  2. Andersson, G.; Donalek, P.; Farmer, R.; Hatziargyriou, N.; Kamwa, I.; Kundur, P.; Martins, N.; Paserba, J.; Pourbeik, P.; Sanchez-Gasca, J.; et al. Causes of the 2003 major grid blackouts in North America and Europe, and recommended means to improve system dynamic performance. IEEE Trans. Power Syst. 2005, 20, 1922–1928. [Google Scholar] [CrossRef]
  3. Wang, C.; Mishra, C.; Centeno, V.A. A Scalable Method of Adaptive LVRT Settings Adjustment for Voltage Security Enhancement in Power Systems with High Renewable Penetration. IEEE Trans. Sustain. Energy 2022, 13, 440–451. [Google Scholar] [CrossRef]
  4. Naseem, H.; Seok, J.-K. Reactive Power Controller for Single Phase Dual Active Bridge DC–DC Converters. IEEE Access 2023, 11, 141537–141546. [Google Scholar] [CrossRef]
  5. Adegoke, S.A.; Sun, Y.; Wang, Z.; Stephen, O. A mini review on optimal reactive power dispatch incorporating renewable energy sources and flexible alternating current transmission system. Electr. Eng. 2024, 106, 3961–3982. [Google Scholar] [CrossRef]
  6. Naseem, H.; Seok, J.-K. Reactive Power Control to Minimize Inductor Current for Single Phase Dual Active Bridge DC/DC Converters. In Proceedings of the 2021 IEEE Energy Conversion Congress and Exposition (ECCE), Vancouver, BC, Canada, 10–14 October 2021; pp. 3261–3266. [Google Scholar]
  7. Mohammed, A.; Sakr, E.K.; Abo-Adma, M.; Elazab, R. A comprehensive review of advancements and challenges in reactive power planning for microgrids. Energy Inform. 2024, 7, 63. [Google Scholar] [CrossRef]
  8. Jabr, R.A.; Džafić, I. Sensitivity-Based Discrete Coordinate-Descent for Volt/VAr Control in Distribution Networks. IEEE Trans. Power Syst. 2016, 31, 4670–4678. [Google Scholar] [CrossRef]
  9. Dutta, A.; Ganguly, S.; Kumar, C. MPC-Based Coordinated Voltage Control in Active Distribution Networks Incorporating CVR and DR. IEEE Trans. Ind. Appl. 2022, 58, 4309–4318. [Google Scholar] [CrossRef]
  10. Liu, H.; Wu, W.; Wang, Y. Bi-Level Off-Policy Reinforcement Learning for Two-Timescale Volt/VAR Control in Active Distribution Networks. IEEE Trans. Power Syst. 2023, 38, 385–395. [Google Scholar] [CrossRef]
  11. Farivar, M.; Low, S.H. Branch Flow Model: Relaxations and Convexification—Part I. IEEE Trans. Power Syst. 2013, 28, 2554–2564. [Google Scholar] [CrossRef]
  12. Gholizadeh-Roshanagh, R.; Zare, K.; Marzband, M. An A-Posteriori Multi-Objective Optimization Method for MILP-Based Distribution Expansion Planning. IEEE Access 2020, 8, 60279–60292. [Google Scholar] [CrossRef]
  13. Kaur, S.; Kumbhar, G.; Sharma, J. A MINLP technique for optimal placement of multiple DG units in distribution systems. Int. J. Electr. Power Energy Syst. 2014, 63, 609–617. [Google Scholar] [CrossRef]
  14. Byeon, G.; Kim, K. Distributionally Robust Decentralized Volt-Var Control with Network Reconfiguration. IEEE Trans. Smart Grid 2024, 15, 4705–4718. [Google Scholar] [CrossRef]
  15. Anilkumar, R.; Devriese, G.; Srivastava, A.K. Voltage and Reactive Power Control to Maximize the Energy Savings in Power Distribution System with Wind Energy. IEEE Trans. Ind. Appl. 2018, 54, 656–664. [Google Scholar] [CrossRef]
  16. Padilha-Feltrin, A.; Rodezno, D.A.Q.; Mantovani, J.R.S. Volt-VAR Multiobjective Optimization to Peak-Load Relief and Energy Efficiency in Distribution Networks. IEEE Trans. Power Deliv. 2015, 30, 618–626. [Google Scholar] [CrossRef]
  17. Qiu, W.; Yadav, A.; You, S.; Dong, J.; Kuruganti, T.; Liu, Y.; Yin, H. Neural Networks-Based Inverter Control: Modeling and Adaptive Optimization for Smart Distribution Networks. IEEE Trans. Sustain. Energy 2024, 15, 1039–1049. [Google Scholar] [CrossRef]
  18. Alonso, A.M.S.; Arenas, L.D.O.; Brandao, D.I.; Tedeschi, E.; Marafao, F.P. Integrated Local and Coordinated Overvoltage Control to Increase Energy Feed-In and Expand DER Participation in Low-Voltage Networks. IEEE Trans. Sustain. Energy 2022, 13, 1049–1061. [Google Scholar] [CrossRef]
  19. Zhang, Y.; Wang, X.; Wang, J.; Zhang, Y. Deep Reinforcement Learning Based Volt-VAR Optimization in Smart Distribution Systems. IEEE Trans. Smart Grid 2021, 12, 361–371. [Google Scholar] [CrossRef]
  20. Gu, Y.; Cheng, Y.; Chen, C.L.P.; Wang, X. Proximal Policy Optimization with Policy Feedback. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 4600–4610. [Google Scholar] [CrossRef]
  21. Cao, D.; Zhao, J.; Hu, W.; Yu, N.; Ding, F.; Huang, Q.; Chen, Z. Deep Reinforcement Learning Enabled Physical-Model-Free Two-Timescale Voltage Control Method for Active Distribution Systems. IEEE Trans. Smart Grid 2022, 13, 149–165. [Google Scholar] [CrossRef]
  22. Wang, R.; Bi, X.; Bu, S. Real-Time Coordination of Dynamic Network Reconfiguration and Volt-VAR Control in Active Distribution Network: A Graph-Aware Deep Reinforcement Learning Approach. IEEE Trans. Smart Grid 2024, 15, 3288–3302. [Google Scholar] [CrossRef]
  23. Guo, C.; Luk, W. FPGA-Accelerated Sim-to-Real Control Policy Learning for Robotic Arms. IEEE Trans. Circuits Syst. II Express Briefs 2024, 71, 1690–1694. [Google Scholar] [CrossRef]
  24. Liu, Y.-C.; Huang, C.-Y. DDPG-Based Adaptive Robust Tracking Control for Aerial Manipulators with Decoupling Approach. IEEE Trans. Cybern. 2022, 52, 8258–8271. [Google Scholar] [CrossRef] [PubMed]
  25. Kou, P.; Liang, D.; Wang, C.; Wu, Z.; Gao, L. Safe deep reinforcement learning-based constrained optimal control scheme for active distribution networks. Appl. Energy 2020, 264, 114772. [Google Scholar] [CrossRef]
  26. Li, C.; Jin, C.; Sharma, R. Coordination of PV smart inverters using deep reinforcement learning for grid voltage regulation. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Raton, FL, USA, 16–19 December 2019; pp. 1930–1937. [Google Scholar]
  27. Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Volume 4, pp. 2587–2601. [Google Scholar]
  28. Liang, Q.; Que, F.; Modiano, E. Accelerated primal-dual policy optimization for safe reinforcement learning. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 3–8 December 2018; pp. 1–7. [Google Scholar]
  29. Khalaf, M.; Ayad, A.; Tushar, M.H.K.; Kassouf, M.; Kundur, D. A Survey on Cyber-Physical Security of Active Distribution Networks in Smart Grids. IEEE Access 2024, 12, 29414–29444. [Google Scholar] [CrossRef]
  30. Liu, X.; Wang, X.; Fan, B.; Xiao, G.; Wen, S.; Chen, B.; Wang, P. Multi-Agent Primal Dual DDPG based Reactive Power Optimization of Active Distribution Networks via Graph Reinforcement Learning. IEEE Internet Things J. 2025. [Google Scholar] [CrossRef]
  31. Liu, X.; Wang, X.; Fan, B.; Li, B.; Deng, R.; Kong, M. DDPG-based Reactive Power Optimization Strategy For Active Distribution Network. In Proceedings of the 2024 7th International Conference on Intelligent Robotics and Control Engineering (IRCE), Xi’an, China, 7–9 August 2024; pp. 220–225. [Google Scholar]
Figure 1. System architecture diagram.
Figure 1. System architecture diagram.
Applsci 15 08209 g001
Figure 2. Block diagram of the CMDP process for a voltage-secure distribution network.
Figure 2. Block diagram of the CMDP process for a voltage-secure distribution network.
Applsci 15 08209 g002
Figure 3. Illustration of the agent–environment interaction process.
Figure 3. Illustration of the agent–environment interaction process.
Applsci 15 08209 g003
Figure 4. The structure of the safe-DDPG algorithm.
Figure 4. The structure of the safe-DDPG algorithm.
Applsci 15 08209 g004
Figure 5. Modified IEEE-33 system.
Figure 5. Modified IEEE-33 system.
Applsci 15 08209 g005
Figure 6. Photovoltaic active power profiles over three consecutive days.
Figure 6. Photovoltaic active power profiles over three consecutive days.
Applsci 15 08209 g006
Figure 7. Real-time load profiles over three consecutive days.
Figure 7. Real-time load profiles over three consecutive days.
Applsci 15 08209 g007
Figure 8. The learning process of agents under three DRL approaches of Proposed, DDPG, and TD3.
Figure 8. The learning process of agents under three DRL approaches of Proposed, DDPG, and TD3.
Applsci 15 08209 g008
Figure 9. Comparison of test results: Without RPO and DDPG.
Figure 9. Comparison of test results: Without RPO and DDPG.
Applsci 15 08209 g009
Figure 10. Comparison of test results: TD3 and Proposed.
Figure 10. Comparison of test results: TD3 and Proposed.
Applsci 15 08209 g010
Figure 11. The active power loss of 96 time intervals over the next day using the different methods.
Figure 11. The active power loss of 96 time intervals over the next day using the different methods.
Applsci 15 08209 g011
Figure 12. Reward curves for training rounds under different GCN layers.
Figure 12. Reward curves for training rounds under different GCN layers.
Applsci 15 08209 g012
Figure 13. Comparison curves of rewards, training time, and convergence rounds under different GCN layers.
Figure 13. Comparison curves of rewards, training time, and convergence rounds under different GCN layers.
Applsci 15 08209 g013
Table 1. Hyperparameter values.
Table 1. Hyperparameter values.
HyperparametersActor NetworkReward Critic NetworkCost Critic Network
Layer number333
Learning rate 1 × 10 5 1 × 10 4 1 × 10 4
Discount factor0.90.90.9
Safe update factor0.0020.0020.002
Table 2. Hyperparameter values.
Table 2. Hyperparameter values.
MethodAverage Voltage DeviationAction Power LossVoltage Overrun Rate
Without RPO0.038624.833 MW36.55%
DDPG [31]0.013117.202 MW4.92%
TD3 [27]0.025016.817 MW5.14%
Proposed0.007311.544 MW0.00%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, X.; Gui, X.; Sun, P.; Li, X.; Zhang, Y.; Wang, X.; Dang, C.; Liu, X. Reactive Power Optimization of a Distribution Network Based on Graph Security Reinforcement Learning. Appl. Sci. 2025, 15, 8209. https://doi.org/10.3390/app15158209

AMA Style

Zhang X, Gui X, Sun P, Li X, Zhang Y, Wang X, Dang C, Liu X. Reactive Power Optimization of a Distribution Network Based on Graph Security Reinforcement Learning. Applied Sciences. 2025; 15(15):8209. https://doi.org/10.3390/app15158209

Chicago/Turabian Style

Zhang, Xu, Xiaolin Gui, Pei Sun, Xing Li, Yuan Zhang, Xiaoyu Wang, Chaoliang Dang, and Xinghua Liu. 2025. "Reactive Power Optimization of a Distribution Network Based on Graph Security Reinforcement Learning" Applied Sciences 15, no. 15: 8209. https://doi.org/10.3390/app15158209

APA Style

Zhang, X., Gui, X., Sun, P., Li, X., Zhang, Y., Wang, X., Dang, C., & Liu, X. (2025). Reactive Power Optimization of a Distribution Network Based on Graph Security Reinforcement Learning. Applied Sciences, 15(15), 8209. https://doi.org/10.3390/app15158209

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop