Topology-Aware Graph Reinforcement Learning for Voltage-Reactive Power Control in Grid-Connected Microgrids

Zhang, Yunfei; Bao, Kefan; Liang, Gaige; Zhuang, Wennan; Qiang, Longlong; Tang, Difei; Lu, Xiangyu; Zhang, Mingxiao

doi:10.3390/electricity7020060

Open AccessArticle

Topology-Aware Graph Reinforcement Learning for Voltage-Reactive Power Control in Grid-Connected Microgrids

by

Yunfei Zhang

¹,

Kefan Bao

^1,*,

Gaige Liang

¹,

Wennan Zhuang

²,

Longlong Qiang

¹,

Difei Tang

²

,

Xiangyu Lu

² and

Mingxiao Zhang

²

¹

Xuzhou Power Supply Company, State Grid Jiangsu Electric Power Co., Ltd., Xuzhou 221000, China

²

School of Electrical and Automation Engineering, Nanjing Normal University, Nanjing 210023, China

^*

Author to whom correspondence should be addressed.

Electricity 2026, 7(2), 60; https://doi.org/10.3390/electricity7020060 (registering DOI)

Submission received: 23 April 2026 / Revised: 14 June 2026 / Accepted: 20 June 2026 / Published: 22 June 2026

Download

Browse Figures

Versions Notes

Abstract

As the global energy transition accelerates, distribution systems are integrating increasing shares of inverter-interfaced renewables, making reliable voltage support a key operational requirement. In grid-connected microgrids, especially weak radial feeders in rural and remote areas, voltage-reactive power (Volt/Var) control must coordinate multiple inverters under uncertainty from photovoltaic (PV) intermittency, load volatility, and point-of-common-coupling (PCC) disturbances. Existing droop, model-based optimization, and non-graph reinforcement learning (RL) approaches often rely on fixed rules or do not explicitly exploit electrical topology, which limits adaptive coordination. To address this gap, we propose a topology-aware graph reinforcement learning framework for voltage-reactive power control in grid-connected microgrids under uncertainty. The method encodes node states with a graph convolutional network (GCN) and learns coordinated PV/storage reactive-power actions via proximal policy optimization (PPO) with a multi-objective reward balancing voltage quality, control effort, and action smoothness. In a controlled comparison against a multilayer perceptron (MLP)-PPO baseline with identical action space, reward, and PPO objective, our method reduces voltage violation rate (VVR) from 0.0316 ± 0.0086 to 0.0048 ± 0.0019. Additional validation on a modified IEEE 33-bus feeder further reduces VVR from 0.00726 for MLP-PPO and 0.02999 for Droop control to 0.00095, supporting the effectiveness of topology-aware state representation on a larger radial benchmark feeder.

Keywords:

microgrid; voltage control; reactive power control; graph reinforcement learning; topology awareness

1. Introduction

With the ongoing transition toward low-carbon energy systems, distributed energy resources such as photovoltaic (PV) generation and battery energy storage are being increasingly integrated into distribution networks, driving power systems from passive electricity delivery toward more flexible and actively coordinated operation [1]. In this context, grid-connected microgrids have emerged as an important operational paradigm for integrating distributed generation, storage units, and local loads, and are widely regarded as key building blocks of future smart grids [2]. This transition is also highly relevant to rural microgrids, where long radial feeders, relatively weak grid support, and high shares of inverter-interfaced resources often make voltage regulation more sensitive to operating uncertainty. However, the increasing penetration of inverter-interfaced renewable generation introduces pronounced uncertainty and time-varying operating conditions caused by PV intermittency, load variability, and disturbances propagated from the point of common coupling (PCC), which can aggravate voltage deviations, reactive power mismatch, and power quality issues in distribution-level systems [3]. A conceptual illustration of the grid-connected microgrid setting considered in this study is shown in Figure 1. Consequently, achieving efficient, coordinated, and adaptive voltage and reactive power (Volt/Var) control has become a critical requirement for the secure and stable operation of grid-connected microgrids.

Existing studies on Volt/Var control in microgrids can be broadly classified into local or decentralized control, model-based optimization, and learning-based approaches [2]. Among these, droop-based control and its variants have been widely adopted for inverter coordination because of their simple structure and low communication requirements, and they have been continuously improved through controller tuning and decentralized control designs [4,5,6]. Reactive power compensation has likewise long been recognized as a key requirement for maintaining acceptable voltage profiles and power quality in microgrids [7]. However, these methods are typically built on fixed control laws or local measurements, which may limit system-level coordination capability and adaptability under rapidly varying operating conditions. To enhance coordination performance, model-based optimization and predictive control have been extensively investigated for microgrid operation and energy management [1]. In particular, model predictive control (MPC) has attracted considerable attention because it can explicitly incorporate system dynamics, operational constraints, and multi-objective regulation [8]. Existing studies have applied MPC to optimal power flow in grid-connected microgrids with coupled active and reactive power regulation [9], while more recent work has extended this line to real-time secondary voltage control architectures [10]. Data-driven and distributionally robust Volt/Var optimization methods have also been introduced to better address uncertainty [11]. Nevertheless, as renewable generation and distributed resources continue to intensify voltage regulation challenges in active distribution systems [3], these approaches still rely heavily on explicit models, parameter identification, and forecast quality, which may constrain their adaptability in uncertain and time-varying environments. More recently, reinforcement learning (RL) has attracted increasing interest as an alternative for control under uncertainty. For example, RL has shown promising performance for robust voltage control in distribution grids with uncertain parameters and renewable variability [12], and deep RL (DRL) has also been introduced to improve inverter droop strategies in microgrids [13]. Meanwhile, graph-enhanced RL has begun to demonstrate potential in decentralized power-grid control tasks by exploiting the structural information of networked systems [14]. Despite these advances, existing learning-based studies rarely address coordinated Volt/Var control in grid-connected microgrids with explicit topology-aware state representation and uncertainty-oriented adaptive decision-making. Therefore, a topology-aware and adaptive Volt/Var control framework for grid-connected microgrids that can effectively coordinate multiple inverter-interfaced resources under uncertain operating conditions is still lacking.

To address this gap, we focus on the adaptive Volt/Var control problem in grid-connected microgrids under uncertain and time-varying operating conditions caused by PV fluctuations, load variations, and disturbances at the PCC. Under such conditions, maintaining satisfactory voltage profiles and reactive power balance is particularly challenging because multiple inverter-interfaced resources must respond to continuously changing system states in a coordinated manner. Therefore, the key research question we address is how to achieve adaptive and coordinated Volt/Var regulation in grid-connected microgrids without strong reliance on fixed control rules or highly accurate system models. Accordingly, the objective of this work is to improve the adaptive coordination capability of multiple inverter-interfaced resources for voltage/reactive power regulation under uncertainty.

To achieve this objective, we develop a topology-aware graph RL (GRL) approach for voltage-reactive power control in grid-connected microgrids. In the proposed framework, the microgrid is represented as a graph to explicitly capture the structural coupling among buses and inverter-interfaced resources. On this basis, graph-based state representation is integrated with RL-based policy learning to enable adaptive and coordinated control under uncertain and time-varying operating conditions. We further evaluate the proposed approach under multiple disturbance scenarios and benchmark it against conventional control strategies.

The main contributions of this paper are summarized as follows:

We formulate an uncertainty-aware coordinated Volt/Var control problem for grid-connected microgrids that explicitly captures the coupled impacts of PV intermittency, load variability, and PCC disturbances on multi-inverter regulation. This formulation shifts the focus from fixed local rules to system-level adaptive coordination and provides a unified problem setting for topology-informed sequential decision making.
We propose a topology-aware graph reinforcement learning mechanism that embeds electrical network coupling into policy learning through graph-based state encoding and PPO-based closed-loop control. In addition, we design a multi-objective reward that jointly accounts for voltage quality, reactive-power effort, and control smoothness to improve both regulation effectiveness and operational practicality.
We evaluate the proposed method under standard, disturbance-escalation, unseen mixed-disturbance, and IEEE 33-bus validation scenarios. Under the standard 5-bus scenario, the proposed method reduces AVD/VVR/RPF to 0.0138/0.0048/0.0008, while the IEEE 33-bus validation further shows lower AVD, VVR, MVD, and recovery time than both Droop control and non-graph MLP-PPO under the same larger-feeder protocol.

2. Related Work

Recent studies have shown that microgrid control is increasingly shaped by the need to maintain stability, power quality, and resilient operation in inverter-dominated environments with high renewable penetration [2,15,16,17,18]. Within this broad context, voltage regulation and reactive power coordination remain core technical issues, because converter-interfaced resources, storage units, and varying loads must be coordinated under dynamic operating conditions [6,19,20]. Existing reviews further indicate that, despite substantial progress in hierarchical, decentralized, and intelligent control architectures, voltage-reactive power regulation in microgrids is still challenging when uncertainty, strong coupling, and multiple control resources are simultaneously present [21,22]. This motivates a closer examination of control strategies that are directly relevant to voltage-reactive power control.

A first line of research addresses this problem through droop-based and model-driven control strategies. Recent droop-oriented studies have continued to improve voltage stability and reactive power sharing by tuning controller parameters or adapting control gains online [5,23]. Beyond local control, MPC and other optimization-based methods have been widely adopted for secondary voltage control, active/reactive power management, microgrid energy management, and uncertainty-aware Volt/Var optimization [10,11,24,25,26,27,28]. Related developments in active distribution network Volt/Var control also confirm the continued relevance of optimization-oriented formulations under renewable uncertainty [29]. These methods offer important advantages in handling operational constraints and coordinated regulation. However, they commonly depend on fixed control structures, parameter tuning, accurate system models, forecast quality, or repeated online optimization, which may reduce adaptability when operating conditions vary rapidly and system uncertainty increases.

To alleviate strong model dependence, RL and DRL have recently emerged as promising alternatives for control under uncertainty. Existing studies have applied RL to robust voltage regulation in uncertain grids [12], dynamic enhancement of droop control in microgrids [13], and broader data-driven operational control of microgrids [30]. Multi-agent RL has also been explored for coordinated voltage control in coupled distribution networks and multi-microgrid settings [31], while RL-based regulation has been extended to other control tasks in active distribution systems [32]. In parallel, recent reviews have emphasized the growing relevance of safe and uncertainty-aware RL for power-system operation and control [33,34]. Nevertheless, most existing learning-based studies target either general distribution networks, islanded microgrids, or broader operational tasks, rather than explicitly focusing on coordinated voltage-reactive power control in grid-connected microgrids. Moreover, the coordinated regulation of multiple inverter-interfaced resources is often not modeled in a way that fully exploits the structural coupling of the underlying electrical network [35].

This limitation has motivated growing interest in topology-aware learning and GRL for power-system control. Recent studies have incorporated graph attention or graph-based multi-agent RL into fast voltage regulation for PV-rich active distribution networks [36], graph-convolutional RL into decentralized grid dispatch [14], and graph-enhanced or attention-based RL into coordinated active/reactive power optimization and Volt/Var control in distribution systems [37,38,39]. Related work has further shown that graph neural networks and topology-aware learning can improve distributed voltage control, topology adaptation, and overvoltage management in renewable-dominated networks [40,41,42,43]. A recent survey on GRL for power grids also highlights the growing potential of graph-based representation learning for grid control tasks [44]. However, current graph-based studies are still mainly centered on active distribution networks, generic grid dispatch, or multi-microgrid coordination. A topology-aware GRL framework explicitly designed for coordinated voltage-reactive power control of multiple inverter-interfaced resources in grid-connected microgrids under uncertain operating conditions remains insufficiently explored. This gap motivates the present study.

3. Preliminaries

This section establishes the physical and mathematical foundations for topology-aware voltage-reactive power control in a grid-connected microgrid. Section 3.1 specifies the network configuration, component models, and operating assumptions; Section 3.2 formulates a constrained multi-objective control problem with explicit voltage-quality requirements and inverter operating limits; and Section 3.3 reformulates the control task as a Markov decision process by defining states, actions, transition dynamics, and rewards. Collectively, these preliminaries provide an explicit mapping from power-system dynamics to learnable control variables and establish the conceptual basis for the subsequent technical development in this paper.

3.1. System Description

We consider a radial grid-connected AC microgrid linked to the upstream utility grid through the PCC. As shown in Figure 2, the primary study system is modeled as a representative five-bus system for studying voltage-reactive power control under uncertain operating conditions. Specifically, Bus 1 denotes the PCC bus, Bus 2 is a load bus, Bus 3 is a PV inverter bus, Bus 4 is a battery energy storage system (BESS) inverter bus, and Bus 5 is a terminal load bus. This radial configuration reflects a typical feeder-based microgrid structure and provides a physically meaningful setting for analyzing interactions among multiple inverter-interfaced resources.

The utility grid exchanges power with the microgrid through the PCC, while local power interactions occur among renewable generation, storage resources, and load buses. The PV unit injects active power into the microgrid and can provide reactive power support within its inverter operating capability. The battery energy storage system is also interfaced through an inverter and serves as an additional flexible resource for local regulation. Meanwhile, the load buses represent time-varying power demand within the microgrid. Through feeder impedances and nodal power injections, each bus voltage profile is jointly influenced by upstream grid conditions, local renewable generation, storage support, and load behavior.

During operation, the microgrid is subject to multiple sources of uncertainty, including photovoltaic power fluctuations, load variations, and voltage disturbances propagated from the PCC. These factors jointly lead to time-varying voltage deviations and reactive power imbalances across the microgrid, especially at electrically sensitive buses. In this work, the PV inverter and storage inverter are regarded as the main controllable resources for voltage-reactive power control because they provide flexible reactive power support under varying operating conditions. Therefore, the considered system constitutes a representative physical setting for subsequent modeling and analysis of coordinated voltage-reactive power control in grid-connected microgrids.

3.2. Problem Definition

Based on the above system description, we consider an uncertainty-aware coordinated voltage-reactive power control problem for a grid-connected microgrid. The main challenge is that bus voltages are jointly affected by feeder coupling, photovoltaic power fluctuations, load variations, and disturbances propagated from the PCC, whereas the available control actions are limited to reactive power support from multiple inverter-interfaced resources. Therefore, the control task is to coordinate the reactive power outputs of the PV inverter and the storage inverter to maintain acceptable voltage profiles and alleviate reactive power imbalance under time-varying operating conditions. In the considered microgrid, the controllable input vector is defined as

u_{t} = [\begin{matrix} Q_{p v, t} \\ Q_{e s, t} \end{matrix}],

(1)

where

Q_{p v, t}

and

Q_{e s, t}

denote the reactive power outputs of the PV inverter and the storage inverter at time step t, respectively. Let

N_{b}

denote the set of voltage-regulated buses and

N_{c}

denote the set of controllable inverter-interfaced resources. In this work,

N_{c} = {PV, ES}

, and

Q_{j, t}

denotes the reactive power output of controllable resource

j \in N_{c}

at time step t.

To characterize the above control objective, we define the following stage cost:

J_{t} = α \sum_{i \in N_{b}} w_{i} {(V_{i, t} - V^{ref})}^{2} + β \sum_{j \in N_{c}} {(\frac{Q_{j, t}}{Q_{j}^{max}})}^{2} + λ_{Δ q} \sum_{j \in N_{c}} {(\frac{Q_{j, t} - Q_{j, t - 1}}{Δ Q_{j}^{scale}})}^{2},

(2)

where

V_{i, t}

is the voltage magnitude of bus i at time step t,

V^{ref}

is the nominal voltage reference,

w_{i}

is the weighting coefficient of bus i,

Q_{j}^{max}

is the reactive power capability limit of controllable resource j, and

Δ Q_{j}^{scale}

is the normalization scale used for consecutive reactive-power changes. The coefficients

α

,

β

, and

λ_{Δ q}

weight voltage deviation, reactive-power effort, and action smoothness, respectively. The first term penalizes voltage deviations from the nominal operating point, the second term penalizes excessive use of reactive power support, and the third term penalizes normalized changes between consecutive reactive-power setpoints. The smoothness coefficient is denoted by

λ_{Δ q}

to distinguish it from the RL discount factor

γ

and the GAE parameter

λ_{GAE}

used in PPO training. The weighting coefficient

w_{i}

allows bus-level priorities to be specified; in the reported experiments, uniform bus-priority weights are used, as summarized in Table 3. Accordingly, the instantaneous objective at time step t is expressed as

min_{u_{t}} J_{t} .

(3)

where

J_{t}

is introduced as a stage cost to quantify the control objective at each time step, while the corresponding sequential decision-making formulation is presented in the subsequent subsection.

The above control task is subject to physical and operational constraints. First, the bus voltages must remain within the allowable range:

V_{i}^{min} \leq V_{i, t} \leq V_{i}^{max}, \forall i \in N_{b} .

(4)

Second, the reactive power support of each inverter must satisfy its capability limit:

|Q_{p v, t}| \leq Q_{p v, t}^{max}, |Q_{e s, t}| \leq Q_{e s}^{max},

(5)

where

Q_{p v, t}^{max}

denotes the available reactive power capacity of the PV inverter at time step t, and

Q_{e s}^{max}

denotes the reactive power limit of the storage inverter. In a detailed device-level implementation,

Q_{p v, t}^{max}

may be coupled to the instantaneous active power output through the inverter apparent-power capability curve. In the reported simulations, however, the PV inverter is modeled with sufficient apparent-power reserve and a fixed reactive-power control range, with

Q_{p v}^{max} = Q_{e s}^{max} = 0.4

per unit for all learning-based and droop controllers. This approximation keeps the comparison focused on coordinated Volt/Var control rather than device-level capability scheduling. No explicit hard ramp-rate constraint is imposed on consecutive reactive-power commands. Instead, temporal smoothness is encouraged through the normalized soft penalty in (2). The corresponding normalization scale is set to the full reactive-power action span,

{Δ Q_{j}}^{scale} = {2 Q_{j}}^{max}, \forall j \in N_{c},

(6)

which makes the smoothness term dimensionless without enforcing a physical ramp-rate limit.

Accordingly, the problem considered in this work can be formulated as a constrained coordinated voltage-reactive power control problem under uncertainty. In this formulation, multiple inverter-interfaced resources are jointly regulated to maintain acceptable bus voltages and coordinated reactive power support in the presence of photovoltaic fluctuations, load variations, and PCC disturbances. Owing to the sequential and time-varying nature of the operating environment, the problem is further reformulated as a Markov decision process in the next subsection.

3.3. Markov Decision Process Formulation

To address the voltage-reactive power control problem in a sequential manner, we reformulate it as a Markov decision process (MDP), given by

M = (S, A, P, R, γ),

(7)

where

S

is the state space,

A

is the action space,

P

is the state transition probability,

R

is the reward function, and

γ \in (0, 1)

is the discount factor. This formulation characterizes the sequential decision-making nature of voltage-reactive power control under uncertain and time-varying operating conditions.

3.3.1. State

At time step t, the state

s_{t} \in S

describes the operating condition of the grid-connected microgrid, including bus voltages, renewable generation, load demand, and previous control actions. For the primary five-bus test case, it is instantiated as

s_{t} = {[V_{p c c, t}, V_{2, t}, V_{3, t}, V_{4, t}, V_{5, t}, P_{p v, t}, P_{L 2, t}, Q_{L 2, t}, P_{L 5, t}, Q_{L 5, t}, Q_{p v, t - 1}, Q_{e s, t - 1}]}^{⊤} .

(8)

where

V_{p c c, t}

denotes the PCC voltage,

V_{i, t}

denotes the voltage magnitude of bus i,

P_{p v, t}

denotes the active power output of the PV unit, and

P_{L k, t}

and

Q_{L k, t}

denote the active and reactive load demands at load bus k, respectively. The inclusion of previous control actions allows the controller to account for action smoothness during decision-making. For larger feeder cases, the same information is organized through node-wise graph features rather than a fixed-length five-bus vector. In the subsequent method section, these variables are mapped to graph features to capture the topological coupling of the microgrid.

3.3.2. Action

The action

a_{t} \in A

corresponds to the reactive power setpoints issued to the controllable inverter-interfaced resources, namely

a_{t} = [\begin{matrix} Q_{p v, t} \\ Q_{e s, t} \end{matrix}],

(9)

where

Q_{p v, t}

and

Q_{e s, t}

are the reactive power outputs of the PV inverter and the storage inverter, respectively. The action space is continuous and constrained by the inverter capability limits:

|Q_{p v, t}| \leq Q_{p v, t}^{max}, |Q_{e s, t}| \leq Q_{e s}^{max} .

(10)

3.3.3. State Transition

Given the current state

s_{t}

and action

a_{t}

, the microgrid evolves to the next state

s_{t + 1}

according to

s_{t + 1} \sim P (s_{t + 1} ∣ s_{t}, a_{t}),

(11)

where the transition is jointly determined by feeder coupling, inverter control actions, and external disturbances, including photovoltaic fluctuations, load variations, and PCC voltage disturbances. In practice, the transition is generated at each decision step by a balanced AC load-flow solver based on the forward/backward sweep procedure, which is commonly used for radial distribution-system power-flow analysis [45,46].

3.3.4. Reward

To ensure consistency with the control objective defined in Section 3.2, the reward is constructed as the negative weighted sum of voltage deviation, reactive power effort, action variation, and voltage-limit violation penalties:

r_{t} = - (α c_{t}^{v} + β c_{t}^{q} + λ_{Δ q} c_{t}^{Δ q} + ξ c_{t}^{viol}),

(12)

where

c_{t}^{v} = \sum_{i \in N_{b}} w_{i} {(V_{i, t} - V^{ref})}^{2},

(13)

c_{t}^{q} = \sum_{j \in N_{c}} {(\frac{Q_{j, t}}{Q_{j}^{max}})}^{2},

(14)

c_{t}^{Δ q} = \sum_{j \in N_{c}} {(\frac{Q_{j, t} - Q_{j, t - 1}}{Δ Q_{j}^{scale}})}^{2},

(15)

and

c_{t}^{viol} = \sum_{i \in N_{b}} [max {(0, V_{i, t} - V_{i}^{max})}^{2} + max {(0, V_{i}^{min} - V_{i, t})}^{2}] .

(16)

where

c_{t}^{v}

penalizes deviations of bus voltages from the nominal reference,

c_{t}^{q}

penalizes excessive use of reactive power support,

c_{t}^{Δ q}

penalizes normalized changes between consecutive reactive-power setpoints and therefore encourages smoother control actions, and

c_{t}^{viol}

imposes an additional safety-oriented penalty when bus voltages violate their allowable limits. The first three terms correspond directly to the stage cost defined in Section 3.2, while the violation term is introduced to explicitly strengthen the penalty on unsafe voltage excursions. The numerical values of

α

,

β

,

λ_{Δ q}

,

ξ

,

w_{i}

, and

Δ Q_{j}^{scale}

are reported in Table 3. The same weighting coefficients are used for both GCN-PPO and MLP-PPO, so that the comparison focuses on the effect of topology-aware graph representation rather than reward tuning.

3.3.5. Policy Objective

The objective is to learn a control policy

π (a_{t} ∣ s_{t})

that maximizes the expected cumulative discounted reward over an episode, i.e.,

max_{π} E_{π} [\sum_{t = 0}^{T} γ^{t} r_{t}],

(17)

where T denotes the time horizon of one episode. Through this MDP formulation, the voltage-reactive power control problem is transformed into a sequential decision-making problem, which provides the basis for the topology-aware graph reinforcement learning method developed in the next section.

4. Method

This section presents the proposed method in four parts. Section 4.1 first provides an overview of the overall framework and control workflow. Section 4.2 then describes the topology-aware graph modeling process, Section 4.3 details the graph reinforcement learning controller design, and Section 4.4 introduces the closed-loop training and online control implementation.

4.1. Overview of the Proposed Approach

To address the uncertainty-aware voltage-reactive power control problem in the considered grid-connected microgrid, we propose a topology-aware graph reinforcement learning approach. The proposed framework integrates physical system evolution, graph-based state representation, and adaptive control learning into a unified closed-loop structure. The overall architecture is illustrated in Figure 3.

As shown in Figure 3, the physical grid-connected AC microgrid serves as the operating environment of the control task. At each decision step, the microgrid state evolves under photovoltaic power fluctuations, load variations, and PCC disturbances, while the PV inverter and the storage inverter provide the main controllable reactive power support. Therefore, the physical system not only determines the instantaneous voltage profile and reactive power balance, but also provides the operating observations required for subsequent graph-based representation and closed-loop control interaction.

To explicitly capture the structural coupling among buses and inverter-interfaced resources, the observed system condition is processed by a topology-aware graph modeling module. In this module, the microgrid is first abstracted as a graph and then encoded into a topology-aware embedding, such that nodal operating conditions and network connectivity are jointly represented. This design is motivated by the fact that voltage-reactive power interactions in the microgrid are inherently topology-coupled and cannot be sufficiently characterized by conventional vector-based state representations.

Based on the resulting graph embedding, the graph reinforcement learning controller generates continuous reactive power control actions for the PV inverter and the storage inverter. Specifically, the controller adopts an actor–critic structure, in which the actor produces the control commands and the critic evaluates the current policy to support parameter update. By combining topology-aware representation learning with reinforcement learning-based policy optimization, the controller is able to learn an adaptive coordinated control policy under uncertain and time-varying operating conditions.

After the control actions are issued, they are applied to the physical microgrid, and the system state is updated through power-flow calculation and environment transition. The updated state and the corresponding reward are then fed back to the controller during training, thereby forming a closed-loop interaction among physical system dynamics, topology-aware representation learning, and adaptive policy optimization. Overall, the proposed approach provides the methodological basis for the graph modeling, policy learning, and algorithm design presented in the following subsections.

4.2. Graph Modeling

Voltage-reactive power control in a grid-connected microgrid is inherently shaped by electrical topology. In particular, voltage deviations at one bus are determined not only by local operating conditions but also by power injections, feeder impedances, and reactive power responses at electrically coupled buses. A topology-aware state representation is therefore necessary to preserve these structural dependencies for coordinated control. To this end, the microgrid is modeled as a graph, and its operating state is embedded into a topology-aware representation for subsequent policy learning.

4.2.1. Microgrid Graph Representation

The grid-connected microgrid at time step t is represented as

G_{t} = (V, E, X_{t}, E),

(18)

where

V

is the node set,

E

is the edge set,

X_{t}

is the node feature matrix, and

E

is the edge feature set. In general, each bus in the considered feeder is mapped to a node:

V = {v_{i}}_{i = 1}^{N},

(19)

where N denotes the number of buses. In the primary five-bus test case,

N = 5

, and

v_{1}

,

v_{2}

,

v_{3}

,

v_{4}

, and

v_{5}

correspond to the PCC bus, the load bus, the PV inverter bus, the storage inverter bus, and the terminal load bus, respectively. In the IEEE 33-bus validation case introduced in Section 5.2, the same construction is applied with

N = 33

. Each feeder connection between buses is represented as an edge:

(v_{i}, v_{j}) \in E \Leftrightarrow buses i and j are electrically connected .

(20)

The corresponding adjacency relation is defined as

A_{i j} = \{\begin{matrix} 1, & (v_{i}, v_{j}) \in E, \\ 0, & otherwise, \end{matrix}

(21)

where

A_{i j}

denotes the

(i, j)

th entry of the adjacency matrix

A

.

The graph is treated as undirected because voltage-reactive power coupling is bidirectional along each physical feeder section, although the feeder itself has a radial structure; therefore,

A_{i j} = A_{j i}

. In the primary five-bus case used in Figure 2 and in the simulations, the physical feeder follows the chain topology Bus 1–Bus 2–Bus 3–Bus 4–Bus 5. Therefore, before adding self-loops for graph convolution, the exact adjacency matrix is

A_{5} = [\begin{matrix} 0 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 1 & 0 \end{matrix}] .

(22)

This matrix corresponds directly to the four feeder sections listed in Table 1. Thus, the reported experiments do not use a star-shaped topology in which Bus 1 is connected simultaneously to Buses 2, 3, and 4. For the operating scenario considered here, the graph topology remains unchanged, whereas nodal operating conditions evolve over time under photovoltaic fluctuations, load variations, and PCC disturbances. The resulting formulation therefore separates time-invariant structural connectivity from time-varying operating states, which is essential for topology-aware modeling of voltage-reactive power interactions.

4.2.2. Graph Feature Construction

The graph features are designed to encode local electrical states, heterogeneous node roles, control-history information, and electrical coupling characteristics. For each node

v_{i} \in V

, the node feature vector at time step t is defined as

x_{i, t} = {[V_{i, t}, P_{i, t}, Q_{i, t}, δ_{i}^{PCC}, δ_{i}^{PV}, δ_{i}^{ES}, Q_{i, t - 1}]}^{⊤},

(23)

where

V_{i, t}

denotes the voltage magnitude of node i;

P_{i, t}

and

Q_{i, t}

denote the active and reactive power injection or demand at node i;

δ_{i}^{PCC}

,

δ_{i}^{PV}

, and

δ_{i}^{ES}

are node-type indicators for the PCC node, the PV inverter node, and the storage inverter node, respectively; and

Q_{i, t - 1}

denotes the previous reactive power control action at node i. For uncontrollable nodes,

Q_{i, t - 1}

is set to zero.

The feature construction in (23) serves three purposes. First,

V_{i, t}

,

P_{i, t}

, and

Q_{i, t}

characterize the instantaneous local operating condition. Second, the node-type indicators distinguish heterogeneous physical roles in the microgrid, since PCC, load, PV, and storage nodes participate in voltage-reactive power interactions in different ways. Third, the inclusion of

Q_{i, t - 1}

retains control-history information that supports smooth reactive power regulation. Stacking the node feature vectors for all nodes yields

X_{t} = {[x_{1, t}, x_{2, t}, \dots, x_{N, t}]}^{⊤},

(24)

where N denotes the number of nodes.

For each edge

(v_{i}, v_{j}) \in E

, the edge feature vector is defined as

e_{i j} = {[R_{i j}, X_{i j}]}^{⊤},

(25)

where

R_{i j}

and

X_{i j}

denote the resistance and reactance of the feeder connecting nodes i and j, respectively. The corresponding edge feature set is

E = \{e_{i j} ∣ (v_{i}, v_{j}) \in E\} .

(26)

Unlike a purely binary adjacency description, the edge attributes in (25) preserve information about the strength of electrical coupling among connected buses. The graph state therefore encodes not only local operating variables but also the physical pathways through which voltage and reactive power interactions propagate across the microgrid.

4.2.3. Topology-Aware State Embedding

Although the graph state

G_{t}

preserves the structural and operational information of the microgrid, policy learning requires a compact representation in which topology-dependent interactions are pre-aggregated. A graph encoder is therefore employed to transform node and edge features into a topology-aware state embedding.

Let

h_{i, t}^{(l)}

denote the hidden representation of node i at graph encoding layer l, with

h_{i, t}^{(0)} = x_{i, t} .

(27)

In implementation, self-loops are added before graph convolution:

\tilde{A} = A + I, \hat{A} = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2},

(28)

where

{\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}

. The normalized matrix

\hat{A}

is used for symmetric neighborhood aggregation. Edge impedance information is incorporated into the message-passing calculation through a trainable linear projection of the edge feature vector. Specifically, the node representation is updated as

h_{i, t}^{(l + 1)} = σ (\sum_{j = 1}^{N} {\hat{A}}_{i j} (W_{h}^{(l)} h_{j, t}^{(l)} + W_{e}^{(l)} {\bar{e}}_{i j}) + b^{(l)}),

(29)

where

σ (\cdot)

denotes the ReLU activation function,

W_{h}^{(l)}

and

W_{e}^{(l)}

are trainable weight matrices, and

{\bar{e}}_{i j} = {[R_{i j}, X_{i j}]}^{⊤}

for a physical feeder edge and

{\bar{e}}_{i j} = 0

for self-loops or non-edge entries. The impedance attributes are expressed in per unit and normalized with the same base quantities as the power-flow model. Non-adjacent nodes are masked by

{\hat{A}}_{i j} = 0

, so their zero edge features do not contribute to the summation. This formulation uses the normalized adjacency matrix for topology propagation while allowing feeder impedance attributes to affect the aggregated messages. Equivalently, the update can be written in the generic neighborhood aggregation form

h_{i, t}^{(l + 1)} = ϕ^{(l)} (h_{i, t}^{(l)}, AGG (\{h_{j, t}^{(l)}, e_{i j} | j \in N (i)\})),

(30)

where

N (i)

is the neighbor set of node i,

AGG (\cdot)

is a neighborhood aggregation operator, and

ϕ^{(l)} (\cdot)

is a nonlinear transformation at layer l.

Equation (30) aggregates information from electrically connected neighbors together with the associated feeder attributes. Consequently, the hidden representation of each node is determined not only by its own operating state but also by topology-dependent interactions propagated through the microgrid graph. This mechanism enables the embedding to capture both local electrical conditions and nonlocal coupling patterns relevant to coordinated reactive power control.

After L encoding layers, the node-wise embeddings are collected as

H_{t} = {[h_{1, t}^{(L)}, h_{2, t}^{(L)}, \dots, h_{N, t}^{(L)}]}^{⊤},

(31)

and a graph-level embedding is then obtained through

z_{t} = READOUT (H_{t}) = \frac{1}{N} \sum_{i = 1}^{N} h_{i, t}^{(L)},

(32)

where

z_{t}

denotes the topology-aware state embedding at time step t. Thus, the readout layer is implemented as mean pooling over all bus embeddings. This embedding summarizes system-wide operating conditions and structural interactions in compact form, thereby providing a physically informed state representation for coordinated control policy learning.

4.3. Graph Reinforcement Learning Controller

The objective of the controller is to map the topology-aware embedding

z_{t}

to coordinated reactive power commands for the controllable inverter-interfaced resources. Since

z_{t}

already encodes both system-wide operating conditions and topology-dependent electrical interactions, the control policy can be learned on the basis of a structured state representation rather than independent local measurements. To this end, an actor–critic architecture is adopted, in which the actor parameterizes the control policy and the critic evaluates the encoded state for policy improvement. Let

π_{θ} (a_{t} ∣ z_{t})

denote the policy with parameters

θ

, and let

V_{ψ} (z_{t})

denote the value function with parameters

ψ

. The policy is expressed as

a_{t} \sim π_{θ} (\cdot ∣ z_{t}),

(33)

and the corresponding state-value estimate is given by

V_{ψ} (z_{t}) .

(34)

In this formulation, the graph embedding serves as the decision-oriented state representation through which the controller captures topology-coupled voltage-reactive power interactions and learns coordinated regulation across the microgrid.

The control outputs correspond to the reactive power commands of the PV inverter and the storage inverter. Rather than generating these commands independently from isolated local states, the actor produces them jointly from the same topology-aware embedding, thereby enabling coordinated reactive power support. For numerical robustness and capacity normalization, the actor outputs a normalized action vector

{\hat{a}}_{t} = [\begin{matrix} {\hat{a}}_{p v, t} \\ {\hat{a}}_{e s, t} \end{matrix}], {\hat{a}}_{t} \in {[- 1, 1]}^{2},

(35)

which is mapped to the physical control commands by

Q_{p v, t} = {\hat{a}}_{p v, t} Q_{p v, t}^{max}, Q_{e s, t} = {\hat{a}}_{e s, t} Q_{e s}^{max},

(36)

where

Q_{p v, t}^{max}

and

Q_{e s}^{max}

denote the available reactive power limits of the PV inverter and the storage inverter, respectively. This parameterization preserves direct consistency with the control variables of the original voltage-reactive power control problem while ensuring that the generated actions remain physically feasible.

Policy learning is performed using PPO. This choice is motivated by the continuous nature of inverter reactive power control, the stochastic disturbances induced by renewable fluctuations and load variations, and the need for stable policy updates in closed-loop operation. The PPO clipped surrogate objective is

L^{clip} (θ) = E_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})],

(37)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} ∣ z_{t})}{π_{θ_{old}} (a_{t} ∣ z_{t})},

(38)

{\hat{A}}_{t}

is the estimated advantage,

ϵ

is the clipping coefficient, and

θ_{old}

denotes the policy parameters before update. The critic is optimized through the value regression loss

L^{value} (ψ) = E_{t} [{(V_{ψ} (z_{t}) - {\hat{R}}_{t})}^{2}],

(39)

where

{\hat{R}}_{t}

denotes the estimated return. Through this optimization process, the controller learns a topology-aware policy that adaptively coordinates inverter reactive power responses under uncertain operating conditions and thus provides the decision-making core of the proposed voltage-reactive power control framework.

4.4. Control Implementation

The proposed method is implemented through closed-loop interaction between the graph reinforcement learning controller and the grid-connected microgrid. At each control interval, the current operating state is collected from the physical environment and transformed into a graph state, which is then encoded into a topology-aware embedding. Based on this embedding, the controller infers reactive power commands for the PV inverter and the storage inverter. After these control actions are applied, the microgrid evolves through a power-flow update, and the resulting state transition is used to evaluate control performance under the current operating conditions. In this way, policy learning is driven directly by the physical response of the microgrid under stochastic photovoltaic fluctuations, load variations, and PCC disturbances. The PPO-based closed-loop training structure is illustrated in Figure 4.

During training, the controller interacts with the microgrid over multiple episodes, and the resulting trajectories are stored for PPO-based updates of the actor and critic. This process enables the policy to improve progressively through repeated interaction with diverse operating scenarios. After convergence, the learned policy is fixed and deployed for online control. In the online stage, the controller performs graph-state construction, state embedding, and action inference at each decision step without iterative optimization or parameter updates. The complete training and online execution procedures are summarized in Algorithm 1.

Algorithm 1: PPO-based training and online execution of the proposed method

5. Experimental Setup

5.1. Five-Bus Microgrid Test Case

The proposed method is evaluated on a radial grid-connected AC microgrid represented by a five-bus test system. The feeder topology is fixed and consists of four line segments connecting Bus 1–Bus 2, Bus 2–Bus 3, Bus 3–Bus 4, and Bus 4–Bus 5. Bus 1 is connected to the utility grid through the PCC; Buses 2 and 5 are load buses; Bus 3 hosts the PV inverter; and Bus 4 hosts the storage inverter. This topology preserves the essential electrical coupling among buses and inverter-interfaced resources and therefore provides a compact yet representative test environment for coordinated voltage-reactive power control in a grid-connected microgrid.

All variables are expressed in per unit. The primary five-bus case is a normalized synthetic microgrid, so the reported quantities are dimensionless values on a common internally consistent power and voltage base rather than site-specific kVA and kV ratings. The power base is denoted by

S_{base} = 1.0

per unit, and the voltage base is the nominal bus-voltage reference, i.e.,

V_{base} = V^{ref} = 1.0

per unit. The feeder impedances, DER ratings, load settings, and operating constraints are summarized in Table 1. The PV active-power profile is treated as an exogenous renewable input normalized to the nominal PV active-power base, while the PV inverter is modeled as an oversized inverter with apparent-power rating

S_{p v, inv}^{rated} = 1.3

per unit and controllable reactive-power command range

[- 0.4, 0.4]

per unit. This setting keeps the reactive-power command feasible over the adopted PV active-power range. The storage inverter is modeled as a reactive-power support resource with

S_{e s, inv}^{rated} = Q_{e s}^{max} = 0.4

per unit; its active-power dispatch and state-of-charge dynamics are outside the scope of this study. Nominal active and reactive loads are assigned to Buses 2 and 5, while the acceptable voltage range is specified as

[0.95, 1.05]

per unit. A severe voltage violation is identified when the bus voltage falls below

0.92

per unit or exceeds

1.08

per unit. No explicit hard ramp-rate constraint is imposed on inverter reactive power actions; instead, control smoothness is encouraged through the normalized soft penalty in the reward formulation.

The simulation environment is driven by three classes of time-varying disturbances, namely photovoltaic fluctuations, load variations, and PCC disturbances. For each disturbance source, low-, medium-, and high-intensity scenarios are considered. Photovoltaic and load variations are generated by superimposing stochastic perturbations on periodic modulation, whereas PCC disturbances include both random boundary fluctuations and event-triggered voltage deviations. The control interval is fixed at

Δ t = 15

min, and each episode contains 96 control steps, corresponding to a 24 h operating horizon. At each step, the issued inverter reactive-power commands are applied to the AC network model, and the next bus-voltage state is obtained from the forward/backward-sweep load-flow solver described in Section 3.3.3. This setting enables systematic evaluation of voltage regulation performance, reactive power coordination, and control robustness under uncertain operating conditions. The same simulation environment and power-flow solver are used for all compared methods.

5.2. IEEE 33-Bus Validation Setup

To further examine the scalability of the proposed topology-aware controller and its applicability to a standard benchmark beyond the compact five-bus case, an additional validation protocol is designed on a modified IEEE 33-bus radial distribution feeder. The feeder is based on the widely used Baran–Wu benchmark system [47], with Bus 1 modeled as the PCC and slack bus connected to the upstream grid. The original radial topology, branch impedance data, and nominal bus load data of the benchmark feeder are retained. Electrical quantities are converted consistently to the standard benchmark per-unit base for power-flow simulation, with

S_{base} = 10

MVA and

V_{base} = 12.66

kV. The validation topology is shown in Figure 5.

To create a Volt/Var control scenario with inverter-interfaced resources, a PV inverter is connected at Bus 18 and an ESS inverter is connected at Bus 33. These downstream locations are selected to stress voltage regulation in the radial feeder, where voltage deviations are more affected by feeder impedance accumulation, load variation, and renewable fluctuation. The controllable variables remain the reactive power setpoints of the PV and ESS inverters, denoted by

Q_{p v, t}

and

Q_{e s, t}

, respectively. Consistent with the five-bus case, the active-power profile of the PV unit is treated as an exogenous time-varying input normalized on the IEEE 33-bus benchmark base, while the RL controller acts only on inverter reactive-power commands. The PV and ESS reactive-power capability ratings are both set to

0.4

per unit on the IEEE 33-bus system base, corresponding to

\pm 4

Mvar on the 10 MVA benchmark base. The inverter apparent-power reserve is assumed sufficient to provide this fixed Volt/Var capability over the adopted exogenous PV active-power profile, and the ESS is modeled as a reactive-power support inverter without active-power dispatch or state-of-charge dynamics. This common reactive-power range is used for GCN-PPO, MLP-PPO, and Droop control to preserve a matched control setting.

The control task in the IEEE 33-bus validation remains coordinated voltage-reactive power regulation under uncertainty. At each control interval, the controller observes the feeder operating state and issues reactive-power setpoints to the PV and ESS inverters. The objective is to maintain bus voltages within the acceptable operating range while reducing voltage deviation, voltage violation, and unnecessary reactive-power fluctuation. The operating uncertainty follows the same categories as in the five-bus experiments, namely PV power fluctuation, load variation, and PCC voltage disturbance. Unless otherwise stated, the control interval is kept at 15 min and each episode contains 96 control steps, corresponding to a 24 h operating horizon.

For the proposed GCN-PPO controller, the IEEE 33-bus feeder is modeled as a graph. The 33 buses are represented as graph nodes, and the normally closed distribution lines are represented as graph edges. The adjacency matrix is generated directly from the IEEE 33-bus radial topology and is treated as undirected for graph convolution, with the same self-loop addition and symmetric normalization as in (28). The node features follow the same construction principle as in the five-bus case, including bus voltage, active/reactive power injection or demand, node-type indicators, and previous reactive-power control actions at controllable inverter buses. The branch impedance data are used in the power-flow environment and are also incorporated as edge features in the graph encoder according to (29). In contrast, the MLP-PPO baseline uses a flattened vector representation of the same system-level observations. Therefore, the intended methodological difference between GCN-PPO and MLP-PPO in the IEEE 33-bus validation remains the use of topology-aware graph representation versus non-graph flat state representation.

For fair comparison on the IEEE 33-bus feeder, GCN-PPO, MLP-PPO, and droop control are evaluated under the same feeder model, disturbance profiles, PV/ESS placement, and reactive-power action limits. The two learning-based controllers use the same reward function, PPO objective, episode length, minibatch size, update epochs, learning rate, and evaluation episodes. In this validation, the training budget is also kept identical between GCN-PPO and MLP-PPO. Thus, the comparison is conducted under a matched PPO and control setting, with the main intended difference being the topology-aware graph representation versus the non-graph flat state representation. The droop controller is not trained and is evaluated under the same test episodes and inverter reactive-power limits. Performance is evaluated using AVD, VVR, RPF, MVD, and supplementary indicators such as RPU and average computation time.

5.3. Benchmark Methods

Two benchmark methods are considered to evaluate the proposed method from the perspectives of conventional local control and non-graph deep reinforcement learning. Specifically, a droop-based controller is adopted as a conventional local-control baseline, whereas an MLP-PPO controller is employed to assess the contribution of topology-aware graph representation within the same reinforcement learning framework. The same benchmark logic is used in both the five-bus test case and the IEEE 33-bus validation case.

The droop baseline regulates the reactive power outputs of the PV inverter and the storage inverter according to local voltage deviations. The corresponding droop coefficients are set to

k_{p v} = 0.25

and

k_{e s} = 0.35

for the PV and storage inverters, respectively, without introducing a deadband. The resulting reactive power commands are clipped to the same range of

[- 0.4, 0.4]

per unit as used by the proposed method. Since this controller relies exclusively on local voltage feedback, it serves as a conventional benchmark without topology awareness or data-driven adaptation.

The MLP-PPO baseline shares the same simulation environment, reward formulation, control variables, action bounds, and PPO training framework as the proposed method, but replaces the topology-aware graph representation with a flattened vector representation of the same observations. In the 5-bus case, this corresponds to the 12-dimensional flat state vector defined in Section 3.3.1; in the IEEE 33-bus validation, the vector dimension increases with the number of observed feeder variables. The actor and critic are implemented as multilayer perceptrons with two hidden layers of 128 units and ReLU activation, and the policy is parameterized as a Gaussian distribution with learnable log-standard deviation. Under this setting, the main intended methodological difference between MLP-PPO and the proposed method is the use of graph-based topology-aware state encoding rather than non-graph flat state encoding. The comparison therefore supports an assessment of the role of topology-aware graph modeling in voltage-reactive power control under a matched PPO and control setting.

5.4. Experimental Settings

For the 5-bus experiments, the proposed GCN-PPO method and the MLP-PPO baseline are both trained for 4000 episodes using a learning rate of

2 \times 10^{- 4}

and a 96-step episode horizon. This 4000-episode budget is the training budget summarized in Table 2 and is the same budget used to generate the reward trajectories in Figure 6. Both methods are trained under medium disturbance levels for photovoltaic fluctuations, load variations, and PCC disturbances (PV/load/PCC = medium/medium/medium). The main comparison uses this same disturbance setting, while robustness is evaluated under low-, medium-, and high-intensity disturbances. Generalization is further evaluated under an unseen mixed-disturbance scenario (PV/load/PCC = high/medium/low). Unless otherwise stated, all reported 5-bus results are computed over 30 test episodes and 5 random seeds under the corresponding evaluation setting and are presented as mean and standard deviation.

The proposed method employs a 2-layer GCN encoder (hidden dimension 64), followed by actor and critic networks with two 128-unit hidden layers and ReLU activation. Training uses Adam with discount factor

γ = 0.99

, GAE parameter

λ_{GAE} = 0.95

, PPO clipping coefficient

ϵ = 0.2

, minibatch size 64, and 8 update epochs. The MLP-PPO baseline uses the same reward formulation, control variables, action bounds, PPO objective, learning rate, minibatch size, and update epochs, but replaces the topology-aware graph representation with a 12-dimensional flat state vector. Performance is evaluated primarily using average voltage deviation (AVD), voltage violation rate (VVR), reactive power utilization (RPU), and reactive power fluctuation (RPF). Maximum voltage deviation (MVD), recovery time, and average computation time per control step are reported as supplementary indicators. The main five-bus training and implementation settings are summarized in Table 2.

The weighting coefficients in the stage cost and reward function are fixed for all learning-based controllers and all evaluation scenarios. They are selected to encode the following control priority: voltage-limit safety, voltage-deviation reduction, action smoothness, and reactive-power economy. Specifically, the voltage-limit violation term is assigned the largest weight to strongly penalize unsafe operating conditions, while the reactive-power effort and action-smoothness terms are assigned smaller weights to discourage excessive inverter utilization and abrupt control changes without suppressing necessary voltage support. The action-smoothness term is implemented as a normalized soft penalty rather than a hard ramp-rate constraint. The adopted values are summarized in Table 3.

6. Results

This section presents a comprehensive evaluation of the proposed controller from six complementary perspectives. Section 6.1 analyzes learning dynamics, Section 6.2 benchmarks control performance under the standard scenario, and Section 6.3 examines behavior under progressively stronger disturbances. Section 6.4 evaluates policy transferability under an unseen mixed-disturbance condition, Section 6.5 validates the same control logic on a modified IEEE 33-bus feeder, and Section 6.6 quantifies the contribution of graph modeling and smoothness-aware reward design.

6.1. Training Convergence

Figure 6 compares the training reward trajectories of the proposed method and the MLP-PPO baseline under a matched 4000-episode training budget and the same learning rate. Both methods improve during the early stage of training, and their smoothed rewards stabilize toward the end of training, supporting a matched-budget comparison before evaluation. The proposed method reaches a higher reward level earlier and exhibits smaller late-stage fluctuations, whereas MLP-PPO stabilizes at a lower reward level.

These trends are consistent with the model designs. The proposed method employs a topology-aware graph representation that captures electrical coupling among buses and inverter-interfaced resources, thereby providing topology-structured state information for policy optimization. By contrast, MLP-PPO uses a flattened state representation that does not explicitly encode topology-dependent Volt/Var interactions. The observed training behavior is therefore consistent with the benefit of using topology-aware state representation in the considered Volt/Var control task.

6.2. Comparative Performance Evaluation

Table 4 and Figure 7 present the quantitative comparison of the proposed method, Droop control, and MLP-PPO under the standard test scenario. The proposed method achieves the best overall voltage-regulation performance, yielding the lowest AVD, VVR, MVD, and RPF among all compared methods. These results indicate that the proposed controller more effectively maintains bus voltages within the desired operating range while reducing temporal fluctuations in reactive power commands. Although its RPU is higher than that of Droop control and MLP-PPO, this additional reactive power support is accompanied by clear improvements in voltage quality and control smoothness. In contrast, Droop control exhibits the lowest reactive power utilization and computation time, but its voltage-regulation performance is the weakest. MLP-PPO improves upon Droop control in all voltage-quality-related metrics, yet remains inferior to the proposed method. The computation time of the proposed method is higher than that of the two baselines, but it remains at the millisecond level per control step and is compatible with the adopted control interval.

Figure 8 shows the voltage trajectories at Bus 5, the terminal load bus and therefore the node most sensitive to feeder-impedance accumulation and voltage fluctuations in the radial microgrid. Compared with Droop control and MLP-PPO, the proposed method maintains the bus voltage closer to the nominal operating range throughout the episode and exhibits smaller temporal excursions. This observation is consistent with the lower AVD, VVR, and MVD values reported in Table 4. The result suggests that the proposed topology-aware policy is more effective in regulating voltage at this electrically sensitive terminal node under the considered operating condition.

Figure 9 compares the reactive power trajectories of the PV inverter and the storage inverter. The proposed method yields smoother inverter responses than the two baselines, consistent with its lower RPF value in Table 4. Droop control reacts directly to local voltage deviations and therefore shows less coordinated actuation across controllable resources. MLP-PPO, despite its adaptive learning capability, still exhibits less effective temporal regulation than the proposed method. By contrast, the proposed method produces more balanced and less fluctuating reactive-power support from both inverters, contributing to the observed improvements in voltage-regulation quality and control smoothness.

6.3. Robustness Evaluation

Table 5 and Figure 10 evaluate robustness under low-, medium-, and high-disturbance conditions. As disturbance intensity increases, performance deteriorates for all methods, but the deterioration rates differ. The proposed method remains the best performer across all disturbance levels, consistently yielding the lowest AVD, VVR, RPF, and MVD. By comparison, Droop control deteriorates most severely, while MLP-PPO remains between the proposed method and Droop control. Overall, these results indicate that the topology-aware graph reinforcement learning policy is less sensitive to disturbance escalation and maintains better voltage-regulation quality under operating uncertainty.

The advantage of the proposed method is most evident in the high-disturbance regime. From low to high disturbance, the absolute performance deterioration is milder for the proposed method than for Droop control and MLP-PPO. In particular, the proposed method preserves a substantially lower voltage violation rate in the most challenging condition, indicating a stronger ability to maintain secure voltage operation when photovoltaic fluctuations, load variations, and PCC disturbances intensify simultaneously.

Figure 11 and Figure 12 provide trajectory-level evidence under a representative high-disturbance episode. At Bus 5, the electrically sensitive terminal load bus, the proposed method keeps voltages closer to the admissible range and reduces temporal excursions relative to both baselines, consistent with the lower AVD, VVR, and MVD values in Table 5. The corresponding inverter responses are also smoother and less oscillatory, in agreement with the lower RPF values. Taken together, these observations show that the proposed controller sustains effective voltage support under severe disturbances without relying on highly fluctuating reactive-power commands.

6.4. Generalization Evaluation

Table 6 and Figure 13 report the performance of the proposed method, Droop control, and MLP-PPO under an unseen mixed-disturbance scenario. This test condition differs from the training setting and is used to examine whether the learned controller can retain effective voltage-reactive power control under a shifted operating distribution. The proposed method remains the best-performing controller in terms of AVD, VVR, RPF, MVD, and recovery time, indicating that its performance advantage is preserved under the considered unseen operating condition. MLP-PPO improves upon Droop control in all voltage-quality-related metrics, but it remains inferior to the proposed method. These results suggest that the topology-aware graph representation improves not only control performance under the nominal setting, but also the transfer of the learned policy to an unseen mixed-disturbance condition.

A closer examination of the quantitative results shows that the proposed method achieves the lowest average voltage deviation and the lowest voltage violation rate, while simultaneously maintaining the smallest reactive power fluctuation. It also attains the shortest recovery time, indicating faster voltage restoration after disturbance-induced deviation under the unseen scenario. Although its reactive power utilization is higher than those of Droop control and MLP-PPO, the additional reactive support is accompanied by clear improvements in voltage quality, recovery performance, and control smoothness. The proposed method therefore provides a more effective voltage-regulation response under the considered unseen operating condition.

Figure 14 shows the voltage trajectories at Bus 5 under a representative episode of the unseen mixed-disturbance scenario. As the terminal load bus, Bus 5 is the most sensitive node to feeder impedance accumulation and disturbance propagation in the radial microgrid. Compared with Droop control and MLP-PPO, the proposed method keeps the bus voltage closer to the admissible range and exhibits smaller temporal excursions throughout the episode. This observation is consistent with the lower AVD, VVR, and MVD values reported in Table 6. The result indicates that the proposed controller retains stronger voltage regulation capability at the electrically sensitive terminal node under the considered unseen operating condition.

The reactive power responses of the PV inverter and the storage inverter are presented in Figure 15. The proposed method yields smoother and more temporally consistent inverter actuation than the two baselines, which agrees with its lower RPF value in Table 6. Droop control reacts directly to local voltage deviations and therefore exhibits less coordinated responses, while MLP-PPO, despite its adaptive learning capability, still shows less stable temporal regulation than the proposed method. By contrast, the proposed method provides more balanced reactive power support from the two controllable inverters, thereby contributing to improved voltage regulation under the unseen scenario without inducing excessive control fluctuation.

6.5. Validation on the IEEE 33-Bus Feeder

The IEEE 33-bus case is used as an additional topology-scale validation rather than as a replacement for the controlled five-bus experiments. In this validation, the proposed GCN-PPO controller and the MLP-PPO baseline are trained for 400 episodes with the same learning rate of

2 \times 10^{- 4}

, the same PPO objective, the same reward formulation, and the same 96-step episode horizon. The Droop baseline is not trained and is evaluated under the same feeder model, disturbance profiles, PV/ESS placement, and inverter reactive-power limits. The evaluation uses 30 test episodes and five random seeds.

Table 7 and Figure 16 summarize the aggregate performance on the modified IEEE 33-bus feeder. The proposed method achieves the lowest AVD, VVR, MVD, and recovery time among the compared methods. In particular, VVR is reduced from 0.02999 for Droop control and 0.00726 for MLP-PPO to 0.00095, indicating a substantial reduction in voltage-limit violations on the larger radial feeder. AVD is also reduced from 0.02061 and 0.01664 to 0.01277, while MVD decreases from 0.03996 and 0.03197 to 0.02418. These results show that the benefit of topology-aware state representation is preserved when the same Volt/Var control logic is applied to a larger benchmark topology.

The training reward trajectories in Figure 17 further support the comparability of the learning-based methods. Under the same 400-episode training budget, the proposed method converges to a higher smoothed reward level than MLP-PPO. This observation is consistent with the quantitative results in Table 7 and indicates that the improvement is not due to a larger training budget.

Figure 18 presents the voltage trajectory at Bus 18, where the PV inverter is connected and low-voltage excursions are pronounced in the evaluation traces. Compared with Droop control and MLP-PPO, the proposed method keeps the Bus 18 voltage closer to the admissible range and substantially reduces the number of time steps outside

[0.95, 1.05]

per unit. Specifically, the Bus 18 trajectory contains one voltage-limit violation under the proposed method, compared with 25 under Droop control and six under MLP-PPO. This trajectory-level evidence is consistent with the lower AVD, VVR, and MVD values reported in Table 7.

The corresponding reactive-power trajectories are shown in Figure 19. The proposed controller provides stronger reactive-power support from both the PV and ESS inverters, which explains its higher RPU in Table 7. Its RPF is substantially lower than that of Droop control and remains close to that of MLP-PPO, although MLP-PPO attains the smallest RPF in this validation. Therefore, the IEEE 33-bus results should be interpreted as a voltage-quality improvement achieved with higher reactive-power utilization and slightly higher computational cost, rather than as evidence of uniform superiority across all metrics. Since the proposed method still requires only 3.28 ms per control step on average, the additional graph-processing cost remains negligible relative to the 15 min control interval considered in this study.

Overall, the IEEE 33-bus validation complements the five-bus experiments by showing that the proposed graph-based formulation remains effective when applied to a larger radial benchmark feeder. The results support the same qualitative conclusion as the main experiments: topology-aware state representation is associated with lower voltage-deviation and voltage-violation metrics under the tested conditions. The contribution of this representation is further isolated in the ablation study below.

6.6. Ablation Analysis

To further examine the contribution of the key design components in the proposed method, ablation studies are conducted under the standard test scenario. Two reduced variants are considered: w/o graph modeling, which replaces the topology-aware graph representation with the flat state representation used in the non-graph controller, and w/o smoothness term, which removes the smoothness penalty from the reward formulation. The quantitative results are summarized in Table 8 and Figure 20. Overall, the full model achieves the most favorable balance between voltage regulation quality and control behavior.

Removing graph modeling leads to the largest degradation in voltage-quality-related metrics. Compared with the full model, w/o graph modeling yields higher AVD, VVR, and MVD, indicating that the controller becomes less effective in maintaining bus voltages close to the admissible operating range once topology-aware state representation is removed. This result suggests that graph modeling is a major contributor to the performance gain of the proposed method. By contrast, removing the smoothness term has a more moderate effect on AVD, VVR, and MVD, but it causes a substantial increase in RPF, indicating that the smoothness-aware reward design plays a direct role in suppressing unnecessary temporal fluctuation in reactive power control. The increase in RPU observed for w/o smoothness term further indicates that removing smoothness regularization is associated with more intensive use of reactive power support.

Figure 21 further illustrates the role of the smoothness term through the reactive power trajectories of the PV and storage inverters. Compared with the full model, w/o smoothness term exhibits more fluctuating control actions over time, which is consistent with its significantly higher RPF value in Table 8. This observation indicates that the smoothness penalty not only regularizes training but also improves the temporal consistency of inverter actuation in the learned policy.

The effect of graph modeling is further illustrated in Figure 22, which compares the voltage trajectories at Bus 5 for the full model and w/o graph modeling. As the terminal load bus, Bus 5 is the most sensitive node to feeder impedance accumulation and voltage fluctuation in the radial microgrid. The full model maintains the terminal-bus voltage closer to the desired operating range, whereas removing graph modeling leads to larger temporal deviation from the nominal profile. This result is consistent with the increases in AVD, VVR, and MVD reported in Table 8, and suggests that topology-aware graph representation helps the controller utilize network-dependent information more effectively in voltage-reactive power regulation.

7. Discussion

The findings of this study indicate that topology-aware graph reinforcement learning provides an effective framework for coordinated voltage-reactive power control in grid-connected microgrids under uncertainty. The main significance of the proposed method lies not merely in replacing a conventional controller with a learning-based one, but in embedding electrical topology, resource heterogeneity, and sequential control objectives into a unified closed-loop decision process. Under the standard five-bus test scenario, the proposed method consistently outperforms Droop control and the non-graph MLP-PPO baseline in voltage-quality-related metrics, while also achieving smoother reactive power control. The IEEE 33-bus validation further shows that the same graph-based formulation retains lower voltage-deviation and voltage-violation metrics on a larger radial benchmark feeder. These results suggest that the performance gain cannot be attributed to the PPO framework alone. Rather, the explicit incorporation of topology-aware graph representation enables the controller to preserve network-dependent information that is directly relevant to coordinated Volt/Var regulation.

The robustness and generalization results further strengthen this interpretation. As disturbance intensity increases, all compared methods exhibit performance degradation, yet the proposed method shows the least deterioration in average voltage deviation, voltage violation rate, control fluctuation, and worst-case voltage deviation. In the unseen mixed-disturbance scenario, the proposed method also retains superior performance relative to Droop control and MLP-PPO. Taken together, these results suggest that the learned graph-based controller captures structural control regularities that remain effective beyond the nominal training setting. From a practical perspective, this is important because grid-connected microgrids rarely operate under a single fixed scenario; instead, they are continuously exposed to renewable intermittency, load uncertainty, and upstream grid disturbances. The proposed method therefore appears to improve not only nominal control quality, but also the adaptability of inverter coordination under distribution shift and disturbance escalation.

The ablation study provides additional insight into why the proposed framework is effective. Removing graph modeling leads to the largest degradation in voltage-quality-related metrics, indicating that topology-aware representation is a major contributor to the overall performance improvement. By contrast, removing the smoothness term causes the most pronounced increase in reactive power fluctuation and also weakens voltage regulation to a lesser extent, showing that reward design plays an important role in shaping temporally consistent control behavior. These observations imply that the two main design components of the proposed framework serve complementary functions: graph modeling improves the controller’s ability to utilize network-dependent information for voltage regulation, whereas the smoothness term enhances practical control behavior by suppressing unnecessary oscillatory actuation. At the same time, the results also reveal a meaningful trade-off. The proposed method generally uses more reactive power support, and its per-step computation cost is higher than those of the baselines. In the IEEE 33-bus validation, its RPF is also slightly higher than that of MLP-PPO, although it remains much lower than that of Droop control. Therefore, the main advantage of the proposed controller should be interpreted primarily as improved voltage quality and reduced voltage violations with acceptable control-effort and computation trade-offs. Since the inference time remains at the millisecond level, the added graph-encoding cost is still compatible with the adopted control interval.

Several limitations should also be noted. First, the five-bus system remains the primary controlled experimental environment, while the IEEE 33-bus feeder serves as an additional benchmark validation rather than a full representation of heterogeneous practical distribution systems. Second, the control setting focuses on reactive power coordination of one PV inverter and one storage inverter, whereas future applications may involve a larger number of controllable resources and more complex hierarchical interactions. Third, all evaluations are performed in simulation, where state observation, power-flow evolution, and disturbance generation remain internally consistent; real deployment may introduce additional challenges such as model mismatch, measurement noise, communication delay, and device-level constraints. These limitations point to several directions for future work, including extension to larger and more heterogeneous feeder topologies, incorporation of explicit safety or constraint-aware reinforcement learning, and validation in hardware-in-the-loop or real-time digital simulation environments. Nevertheless, the present results support the central conclusion of this work: topology-aware learning constitutes a meaningful and physically grounded direction for uncertainty-aware Volt/Var control in renewable-rich grid-connected microgrids.

8. Conclusions

In this paper, we investigated the coordinated voltage-reactive power control problem in grid-connected microgrids under photovoltaic fluctuations, load variations, and PCC disturbances, and we proposed a topology-aware graph reinforcement learning framework to address this challenge. We formulated the control task as a constrained sequential decision problem and integrated topology-aware graph representation with PPO-based actor–critic learning to generate continuous reactive power commands for the PV and storage inverters in a closed-loop manner. In this way, we combined physical system evolution, graph-based state embedding, and adaptive policy optimization within a unified control framework.

Our results show that the proposed method achieves the best overall performance among the compared methods under the standard five-bus test scenario. Relative to Droop control and the non-graph MLP-PPO baseline, it improves voltage regulation quality, reduces voltage-limit violations, and maintains smoother reactive power control. The robustness evaluation further shows that the proposed method degrades more gracefully as disturbance intensity increases, while the generalization evaluation indicates that it remains effective under an unseen mixed-disturbance operating condition. Additional validation on a modified IEEE 33-bus feeder confirms that the proposed graph-based formulation also reduces AVD, VVR, MVD, and recovery time on a larger radial benchmark feeder, although this improvement is accompanied by higher reactive-power utilization and higher per-step computation time. In addition, the ablation study shows that topology-aware graph modeling is a major contributor to the observed performance gain, whereas the smoothness-aware reward design plays an important role in suppressing unnecessary control fluctuation and improving practical control behavior.

Overall, we find that explicitly incorporating electrical topology into reinforcement learning improves the adaptive coordination of inverter-interfaced resources for Volt/Var control in uncertain microgrid environments and remains effective under the tested IEEE 33-bus benchmark setting. The proposed framework therefore provides a physically informed basis for uncertainty-aware voltage-reactive power control in renewable-rich grid-connected microgrids. In future work, we will extend the approach to larger and more heterogeneous network topologies with multiple controllable resources, investigate stronger safety-aware learning mechanisms, and further examine practical deployment through real-time simulation or hardware-in-the-loop validation.

Author Contributions

Conceptualization, Y.Z. and K.B.; methodology, Y.Z., G.L. and W.Z.; software, Y.Z., K.B. and W.Z.; validation, Y.Z., K.B., G.L. and L.Q.; formal analysis, Y.Z., L.Q. and D.T.; investigation, K.B., W.Z., D.T. and X.L.; resources, L.Q., D.T. and M.Z.; data curation, X.L. and M.Z.; writing—original draft, Y.Z. and K.B.; writing—review and editing, G.L. and W.Z.; visualization, K.B., G.L. and W.Z.; supervision, Y.Z. and K.B.; project administration, Y.Z.; funding acquisition, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Project of State Grid Jiangsu Electric Power Company (No. J2025160).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors (Y.Z., K.B., G.L. and L.Q.) are all employees of the company Xuzhou Power Supply Company, State Grid Jiangsu Electric Power Co., Ltd. All authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zia, M.F.; Elbouchikhi, E.; Benbouzid, M. Microgrids energy management systems: A critical review on methods, solutions, and prospects. Appl. Energy 2018, 222, 1033–1055. [Google Scholar] [CrossRef]
Uddin, M.; Mo, H.; Dong, D.; Elsawah, S.; Zhu, J.; Guerrero, J.M. Microgrids: A review, outstanding issues and future trends. Energy Strategy Rev. 2023, 49, 101127. [Google Scholar] [CrossRef]
Razavi, S.E.; Rahimi, E.; Javadi, M.S.; Nezhad, A.E.; Lotfi, M.; Shafie-khah, M.; Catalão, J.P. Impact of distributed generation on protection and voltage regulation of distribution systems: A review. Renew. Sustain. Energy Rev. 2019, 105, 157–167. [Google Scholar] [CrossRef]
Tayab, U.B.; Roslan, M.A.B.; Hwai, L.J.; Kashif, M. A review of droop control techniques for microgrid. Renew. Sustain. Energy Rev. 2017, 76, 717–727. [Google Scholar] [CrossRef]
Nair, R.P.; P., K. PR controller-based droop control strategy for AC microgrid using Ant Lion Optimization technique. Energy Rep. 2023, 9, 6189–6198. [Google Scholar] [CrossRef]
Shirkhani, M.; Tavoosi, J.; Danyali, S.; Sarvenoee, A.K.; Abdali, A.; Mohammadzadeh, A.; Zhang, C. A review on microgrid decentralized energy/voltage control structures and methods. Energy Rep. 2023, 10, 368–380. [Google Scholar] [CrossRef]
Gayatri, M.; Parimi, A.; Pavan Kumar, A. A review of reactive power compensation techniques in microgrids. Renew. Sustain. Energy Rev. 2018, 81, 1030–1036. [Google Scholar] [CrossRef]
Kamal, F.; Chowdhury, B. Model predictive control and optimization of networked microgrids. Int. J. Electr. Power Energy Syst. 2022, 138, 107804. [Google Scholar] [CrossRef]
Erazo-Caicedo, D.; Mojica-Nava, E.; Revelo-Fuelagán, J. Model predictive control for optimal power flow in grid-connected unbalanced microgrids. Electr. Power Syst. Res. 2022, 209, 108000. [Google Scholar] [CrossRef]
Escobar, E.D.; Betancur, D.; Manrique, T.; Isaac, I.A. Model predictive real-time architecture for secondary voltage control of microgrids. Appl. Energy 2023, 345, 121328. [Google Scholar] [CrossRef]
Li, P.; Wu, Z.; Yin, M.; Shen, J.; Qin, Y. Distributed data-driven distributionally robust Volt/Var control for distribution network via an accelerated alternating optimization procedure. Energy Rep. 2023, 9, 532–539. [Google Scholar] [CrossRef]
Petrusev, A.; Putratama, M.A.; Rigo-Mariani, R.; Debusschere, V.; Reignier, P.; Hadjsaid, N. Reinforcement learning for robust voltage control in distribution grids under uncertainties. Sustain. Energy Grids Netw. 2023, 33, 100959. [Google Scholar] [CrossRef]
Lai, H.; Xiong, K.; Zhang, Z.; Chen, Z. Droop control strategy for microgrid inverters: A deep reinforcement learning enhanced approach. Energy Rep. 2023, 9, 567–575. [Google Scholar] [CrossRef]
Lee, Y.; Choi, H.; Pagnier, L.; Kim, C.H.; Lee, J.; Jhun, B.; Kim, H.; Kurths, J.; Kahng, B. Reinforcement learning optimizes power dispatch in decentralized power grid. Chaos Solitons Fractals 2024, 186, 115293. [Google Scholar] [CrossRef]
Khosravi, N.; Çelik, D.; Bevrani, H.; Echalih, S. Microgrid Stability: A Comprehensive Review of Challenges, Trends, and Emerging Solutions. Int. J. Electr. Power Energy Syst. 2025, 170, 110829. [Google Scholar] [CrossRef]
Saranya, S.; Bhowmik, P. A comprehensive review of microgrid architectures, power management and resilient operation. Energy Rep. 2026, 15, 109021. [Google Scholar] [CrossRef]
Samal, K.B.; Mahapatra, M.; Pati, S.; Debnath, M.K. A review on microgrid control: Conventional, advanced and intelligent control approaches. Unconv. Resour. 2026, 9, 100297. [Google Scholar] [CrossRef]
Ali, Z.H.; Ahmed, A.H.; Saleh, Z.H.; Raisz, D. Review of control strategies in three phase grid connected renewable energy systems with practical case study. Results Eng. 2026, 29, 109286. [Google Scholar] [CrossRef]
Hasan, M.A.; Hossain, M.S.; Roslan, M.A.; Azmi, A.; Hwai, L.J.; Nazib, A.A.; Ahmad, N.S. A comprehensive review of control strategies and efficiency optimization for islanded AC microgrids. IFAC J. Syst. Control 2025, 33, 100326. [Google Scholar] [CrossRef]
Kumar, S.; Premkumar, M.; Giri, J.; Sharma, S.S.; Hasnain, S.M.M.; Sathish, T.; Zairov, R. Exploring the spectrum: A comprehensive review of control methods in microgrid systems. Results Eng. 2025, 28, 105470. [Google Scholar] [CrossRef]
Shehu, M.A.; Talapiden, K.; Chau, T.T.; Haruna, A.; Aly, M.; Gali, V.; Do, T.D.; Alhassan, A.B. Control and energy management of standalone microgrids in remote areas: A review of recent advances, challenges, and opportunities for future research. Eng. Sci. Technol. Int. J. 2026, 75, 102288. [Google Scholar] [CrossRef]
Sandeep, S.; Mohanty, S.; Mohanty, S.B.; Puhan, P.S. A comprehensive review on DC microgrid control and energy management strategies. Results Eng. 2025, 26, 105479. [Google Scholar] [CrossRef]
Osama, A.; EL-Fouly, T.H.; Zeineldin, H.H.; El-Saadany, E.F. A data-driven droop control strategy for reactive power sharing and stability enhancement in islanded AC microgrids. Int. J. Electr. Power Energy Syst. 2026, 176, 111736. [Google Scholar] [CrossRef]
Arcos–Aviles, D.; Salazar, A.; Rodriguez, M.; Martinez, W.; Guinjoan, F. Model predictive control-based energy management system for an isolated electro-thermal microgrid in the Amazon region of Ecuador. Energy Convers. Manag. 2024, 310, 118479. [Google Scholar] [CrossRef]
Anjaiah, K.; Dash, P.; Bisoi, R.; Dhar, S.; Mishra, S. A new approach for active and reactive power management in renewable based hybrid microgrid considering storage devices. Appl. Energy 2024, 367, 123429. [Google Scholar] [CrossRef]
Dangeti, L.; Ramakrishnan, M. Distributed model predictive control strategy for microgrid frequency regulation. Energy Rep. 2025, 13, 1158–1170. [Google Scholar] [CrossRef]
Razmi, D.; Babayomi, O.; Zhang, Z. Reinforcement learning-driven dynamic Model Predictive Control for adaptive real-time multi-agent management of microgrids. Int. J. Electr. Power Energy Syst. 2025, 170, 110823. [Google Scholar] [CrossRef]
Ingebrigtsen, K.; Bordin, C.; Chiesa, M.; Bakkejord, S. Optimizing and Implementing Volt-Var Control in Battery Storage Systems for Voltage Regulation in Remote Distribution Networks. Discov. Appl. Sci. 2026, 8, 101. [Google Scholar] [CrossRef]
Zhang, C.; Xu, Y.; Siano, P.; Wang, Z.; Mishra, S.; Caire, R.; Hill, D. Editorial—Advances in Volt/Var control for active distribution networks with high-level intermittent renewable energy resources. Int. J. Electr. Power Energy Syst. 2025, 164, 110423. [Google Scholar] [CrossRef]
Yao, F.; Zhao, W.; Forshaw, M.; Zhou, W. A unified data-driven approach under deep reinforcement learning with direct control responses for microgrid operations. Knowl.-Based Syst. 2025, 325, 113844. [Google Scholar] [CrossRef]
Ji, C.; Xiao, H.; Pei, W.; Wang, X.; Pu, X. Coordinated voltage control for distribution network and multi-microgrids based on improved EM-RACE multi-agent reinforcement learning. Int. J. Electr. Power Energy Syst. 2025, 172, 111315. [Google Scholar] [CrossRef]
Taher, A.M.; Aleem, S.H.A.; Al-Gahtani, S.F.; Ali, Z.M.; Hasanien, H.M. Modified deep reinforcement learning for frequency regulation in active distribution systems with soft open points, storage units and electric vehicles. Renew. Energy 2026, 256, 124537. [Google Scholar] [CrossRef]
Bui, V.H.; Mohammadi, S.; Das, S.; Hussain, A.; Hollweg, G.V.; Su, W. A critical review of safe reinforcement learning strategies in power and energy systems. Eng. Appl. Artif. Intell. 2025, 143, 110091. [Google Scholar] [CrossRef]
Ahmadi, M.; Aly, H. A comprehensive review of reinforcement learning-based voltage control in smart grids. Renew. Sustain. Energy Rev. 2026, 227, 116526. [Google Scholar] [CrossRef]
Ye, X.; Wang, Z.; Wang, Q.; Wang, S. Dynamic Leader Election and Model-Free Reinforcement Learning for Coordinated Voltage and Reactive Power Containment Control in Offshore Island AC Microgrids. J. Mar. Sci. Eng. 2025, 13, 1432. [Google Scholar] [CrossRef]
Chen, Y.; Liu, Y.; Zhao, J.; Qiu, G.; Yin, H.; Li, Z. Physical-assisted multi-agent graph reinforcement learning enabled fast voltage regulation for PV-rich active distribution network. Appl. Energy 2023, 351, 121743. [Google Scholar] [CrossRef]
Dong, L.; Lin, H.; Qiao, J.; Zhang, T.; Zhang, S.; Pu, T. A coordinated active and reactive power optimization approach for multi-microgrids connected to distribution networks with multi-actor-attention-critic deep reinforcement learning. Appl. Energy 2024, 373, 123870. [Google Scholar] [CrossRef]
Luo, F.; Wang, S.; Lv, Y.; Mu, R.; Fo, J.; Zhang, T.; Xu, J.; Wang, C. Domain knowledge-enhanced graph reinforcement learning method for Volt/Var control in distribution networks. Appl. Energy 2025, 398, 126409. [Google Scholar] [CrossRef]
Zhang, T.; Yu, L.; Yue, D.; Dou, C.; Xie, X.; Shi, T. Explainable deep reinforcement learning approach for smart voltage regulation of high renewable-penetrated distribution networks considering hydrogen-storage system. Electr. Power Syst. Res. 2025, 246, 111654. [Google Scholar] [CrossRef]
Li, B.; Wu, Q.; Cao, Y.; Jiao, W.; Li, C. Physically informed multi-agent deep reinforcement learning for distributed voltage control in distribution networks. Int. J. Electr. Power Energy Syst. 2026, 174, 111451. [Google Scholar] [CrossRef]
Xiong, W.; Tang, Z.; Cui, X. Distributed data-driven voltage control for active distribution networks with changing grid topologies. Control Eng. Pract. 2024, 147, 105933. [Google Scholar] [CrossRef]
Gulraiz, A.; Zaidi, S.S.H.; Gulraiz, H.; Khan, B.M.; Ali, M.; Khan, B. Leveraging dense layer hybrid graph neural networks for managing overvoltage in PV-dominated distribution systems. Results Eng. 2025, 27, 106169. [Google Scholar] [CrossRef]
Hua, D.; Peng, F.; Liu, S.; Lin, Q.; Fan, J.; Li, Q. Coordinated Volt/VAR Control in Distribution Networks Considering Demand Response via Safe Deep Reinforcement Learning. Energies 2025, 18, 333. [Google Scholar] [CrossRef]
Hassouna, M.; Holzhüter, C.; Lytaev, P.; Thomas, J.; Sick, B.; Scholz, C. Graph reinforcement learning for power grids: A comprehensive survey. Energy AI 2026, 23, 100671. [Google Scholar] [CrossRef]
Shirmohammadi, D.; Hong, H.; Semlyen, A.; Luo, G. A compensation-based power flow method for weakly meshed distribution and transmission networks. IEEE Trans. Power Syst. 1988, 3, 753–762. [Google Scholar] [CrossRef]
Eminoglu, U.; Hocaoglu, M.H. Distribution Systems Forward/Backward Sweep-based Power Flow Algorithms: A Review and Comparison Study. Electr. Power Compon. Syst. 2008, 37, 91–110. [Google Scholar] [CrossRef]
Baran, M.E.; Wu, F.F. Network Reconfiguration in Distribution Systems for Loss Reduction and Load Balancing. IEEE Trans. Power Deliv. 1989, 4, 1401–1407. [Google Scholar] [CrossRef]

Figure 1. Schematic of a grid-connected microgrid with PCC, PV, battery storage, and local loads, where PV intermittency, load variation, and PCC disturbances motivate coordinated voltage–reactive power control.

Figure 2. Configuration of the considered five-bus grid-connected microgrid with the chain radial topology used for simulation and graph construction.

Figure 3. Overall framework of the proposed approach.

Figure 4. PPO-based closed-loop training framework of the proposed method.

Figure 5. Modified IEEE 33-bus radial distribution feeder used for additional validation. Bus 1 is the PCC/slack bus, a PV inverter is connected at Bus 18, and an ESS inverter is connected at Bus 33.

Figure 6. Training reward trajectories of the proposed method and the MLP-PPO baseline under a matched 4000-episode training budget with learning rate

2 \times 10^{- 4}

.

Figure 6. Training reward trajectories of the proposed method and the MLP-PPO baseline under a matched 4000-episode training budget with learning rate

2 \times 10^{- 4}

.

Figure 7. Comparison of core performance metrics under the standard test scenario.

Figure 8. Voltage trajectories at Bus 5 under the standard test scenario.

Figure 9. Reactive power trajectories of the PV and storage inverters under the standard test scenario.

Figure 10. Robustness comparison of core performance metrics under different disturbance intensities.

Figure 11. Voltage trajectories at Bus 5 under the high-disturbance scenario.

Figure 12. Reactive power trajectories of the PV and storage inverters under the high-disturbance scenario.

Figure 13. Comparison of core performance metrics under the unseen mixed-disturbance scenario.

Figure 14. Voltage trajectories at Bus 5 under the unseen mixed-disturbance scenario.

Figure 15. Reactive power trajectories of the PV and storage inverters under the unseen mixed-disturbance scenario.

Figure 16. Comparison of aggregate performance metrics on the IEEE 33-bus validation feeder.

Figure 17. Training reward trajectories of the proposed method and MLP-PPO on the IEEE 33-bus feeder over 400 episodes.

Figure 18. Voltage trajectory at Bus 18 on the IEEE 33-bus validation feeder.

Figure 19. Reactive power trajectories of the PV and ESS inverters on the IEEE 33-bus validation feeder.

Figure 20. Comparison of ablation results under the standard test scenario.

Figure 21. Reactive power trajectories with and without the smoothness term under the standard test scenario.

Figure 22. Voltage trajectories with and without graph modeling under the standard test scenario.

Table 1. Key parameters of the simulation environment.

Parameter	Value
Feeder parameters
Power base	$S_{base} = 1.0$ per unit (normalized synthetic base)
Voltage base	$V_{base} = V^{ref} = 1.0$ per unit
Line 1–2 impedance	$(0.010, 0.050)$ per unit
Line 2–3 impedance	$(0.018, 0.075)$ per unit
Line 3–4 impedance	$(0.015, 0.060)$ per unit
Line 4–5 impedance	$(0.012, 0.045)$ per unit
DER and load configuration
PV inverter apparent-power rating	$S_{p v, inv}^{rated} = 1.3$ per unit
Initial PV active power	$P_{p v, 0} = 0.82$ per unit
PV active power range ^a	$[0, 1.2]$ per unit
PV reactive power control range	$Q_{p v} \in [- 0.4, 0.4]$ per unit
Storage inverter apparent-power rating	$S_{e s, inv}^{rated} = 0.4$ per unit
Storage reactive power control range	$Q_{e s} \in [- 0.4, 0.4]$ per unit
Bus 2 nominal load	$(P_{L 2}, Q_{L 2}) = (0.48, 0.18)$ per unit
Bus 5 nominal load	$(P_{L 5}, Q_{L 5}) = (0.35, 0.14)$ per unit
Operational constraints
Voltage range	$[0.95, 1.05]$ per unit
Severe voltage violation threshold	$V < 0.92$ per unit or $V > 1.08$ per unit
PV reactive power action range	$[- 0.4, 0.4]$ per unit
ESS reactive power action range	$[- 0.4, 0.4]$ per unit
Ramp-rate constraint	Not imposed as a hard constraint ^b
Simulation horizon
Control interval	$Δ t = 15$ min
Episode length	$T = 96$ steps
PV fluctuation
Low	std $= 0.02$ , periodic $= 0.03$
Medium	std $= 0.05$ , periodic $= 0.06$
High	std $= 0.10$ , periodic $= 0.10$
Load variation
Low	std $= 0.01$ , periodic $= 0.02$
Medium	std $= 0.025$ , periodic $= 0.04$
High	std $= 0.050$ , periodic $= 0.08$
PCC disturbance
Low	std $= 0.005$ , prob $= 0.01$ , mag $= 0.010$
Medium	std $= 0.010$ , prob $= 0.02$ , mag $= 0.020$
High	std $= 0.020$ , prob $= 0.04$ , mag $= 0.040$

^a The PV active power is an exogenous renewable input normalized to the nominal PV active-power base and clipped to

[0, 1.2]

per unit in the simulation environment. With

S_{p v, inv}^{rated} = 1.3

per unit, the fixed

| Q_{p v} | \leq 0.4

per unit control range remains within the modeled inverter apparent-power reserve over this range. ^b Temporal control smoothness is handled by the normalized soft penalty in (2) and (15), with

Δ Q_{j}^{scale} = 2 Q_{j}^{max}

.

Table 2. Main training and implementation settings for the five-bus experiments.

Parameter	Value
Five-bus training protocol
Training episodes (GCN-PPO/MLP-PPO)	4000/4000
Episode length	96 steps
Training disturbances (PV/load/PCC)	medium/medium/medium
Main comparison disturbances (PV/load/PCC)	medium/medium/medium
Robustness disturbances	low/medium/high
Generalization disturbances (PV/load/PCC)	high/medium/low
Proposed method
Graph encoder	2-layer edge-augmented GCN, hidden dimension 64
Graph directionality	Undirected physical feeder graph
Adjacency normalization	Symmetric normalization with self-loops, (28)
Edge feature incorporation	Feeder $(R_{i j}, X_{i j})$ projected into GCN messages, (29)
Readout layer	Mean pooling over bus embeddings
Actor hidden layers	128, 128
Critic hidden layers	128, 128
Discount factor $γ$	0.99
GAE parameter $λ_{GAE}$	0.95
PPO clip coefficient $ϵ$	0.2
Minibatch size/update epochs	64/8
MLP-PPO
State representation (five-bus)	12-dimensional flat state vector
Actor hidden layers	128, 128
Critic hidden layers	128, 128
Shared settings
Reward and action space	identical for both learning-based methods
PPO objective and Gaussian policy form	identical for both learning-based methods
Shared learning rate	$2 \times 10^{- 4}$ for both learning-based methods

Table 3. Reward and cost weighting coefficients used in the experiments.

Parameter	Value	Description
$α$	100	Weight of voltage-deviation penalty
$β$	0.10	Weight of reactive-power effort penalty
$λ_{Δ q}$	0.50	Weight of action-smoothness penalty
$ξ$	1000	Weight of voltage-limit violation penalty
$w_{i}$	1	Bus-priority weight for each regulated bus
$Δ Q_{j}^{scale}$	$2 Q_{j}^{max}$	Normalization scale for action-smoothness cost

For the five-bus system,

w_{i} = 1

is used for all regulated buses. For the IEEE 33-bus validation,

w_{i} = 1

is used for all non-slack regulated buses, i.e., Bus 2–Bus 33, while Bus 1 is treated as the PCC/slack bus. The same weighting coefficients are used for GCN-PPO and MLP-PPO. For the five-bus PV and ESS inverters,

Q_{j}^{max} = 0.4

per unit and therefore

Δ Q_{j}^{scale} = 0.8

per unit The coefficient

λ_{Δ q}

is distinct from the PPO GAE parameter

λ_{GAE}

.

Table 4. Quantitative comparison under the standard test scenario.

Metric	Proposed	Droop	MLP-PPO
AVD	$0.0138 \pm 0.0006$	$0.0226 \pm 0.0010$	$0.0198 \pm 0.0008$
VVR	$0.0048 \pm 0.0019$	$0.0578 \pm 0.0107$	$0.0316 \pm 0.0086$
RPU (trade-off)	$0.0724 \pm 0.0033$	$0.0116 \pm 0.0018$	$0.0579 \pm 0.0026$
RPF	$0.0008 \pm 0.0001$	$0.0030 \pm 0.0004$	$0.0015 \pm 0.0003$
MVD	$0.0386 \pm 0.0024$	$0.0612 \pm 0.0049$	$0.0518 \pm 0.0038$
Comp. Time (ms/step)	$1.800 \pm 0.200$	$0.200 \pm 0.000$	$1.300 \pm 0.100$

For AVD, VVR, RPF, MVD, and time-related metrics, lower values indicate better performance. RPU is reported as a trade-off metric that reflects reactive-power support effort.

Table 5. Performance comparison under different disturbance intensities.

Disturbance	Metric	Proposed	Droop	MLP-PPO
Low	AVD	$0.0109 \pm 0.0005$	$0.0185 \pm 0.0008$	$0.0164 \pm 0.0007$
	VVR	$0.0016 \pm 0.0007$	$0.0280 \pm 0.0065$	$0.0152 \pm 0.0044$
	RPF	$0.0006 \pm 0.0001$	$0.0022 \pm 0.0003$	$0.0011 \pm 0.0002$
	MVD	$0.0315 \pm 0.0018$	$0.0503 \pm 0.0035$	$0.0436 \pm 0.0031$
Medium	AVD	$0.0138 \pm 0.0006$	$0.0226 \pm 0.0010$	$0.0198 \pm 0.0008$
	VVR	$0.0048 \pm 0.0019$	$0.0578 \pm 0.0107$	$0.0316 \pm 0.0086$
	RPF	$0.0008 \pm 0.0001$	$0.0030 \pm 0.0004$	$0.0015 \pm 0.0003$
	MVD	$0.0386 \pm 0.0024$	$0.0612 \pm 0.0049$	$0.0518 \pm 0.0038$
High	AVD	$0.0186 \pm 0.0009$	$0.0318 \pm 0.0015$	$0.0264 \pm 0.0011$
	VVR	$0.0132 \pm 0.0042$	$0.1145 \pm 0.0178$	$0.0681 \pm 0.0129$
	RPF	$0.0011 \pm 0.0002$	$0.0046 \pm 0.0006$	$0.0023 \pm 0.0004$
	MVD	$0.0479 \pm 0.0031$	$0.0794 \pm 0.0060$	$0.0647 \pm 0.0047$

For AVD, VVR, RPF, and MVD, lower values indicate better performance.

Table 6. Performance comparison under the unseen mixed-disturbance scenario.

Metric	Proposed	Droop	MLP-PPO
AVD	$0.0162 \pm 0.0007$	$0.0269 \pm 0.0012$	$0.0221 \pm 0.0010$
VVR	$0.0098 \pm 0.0031$	$0.0815 \pm 0.0144$	$0.0456 \pm 0.0108$
RPU (trade-off)	$0.0768 \pm 0.0034$	$0.0124 \pm 0.0018$	$0.0615 \pm 0.0028$
RPF	$0.0010 \pm 0.002$	$0.0037 \pm 0.0005$	$0.0019 \pm 0.0003$
MVD	$0.0438 \pm 0.0028$	$0.0708 \pm 0.0053$	$0.0579 \pm 0.0042$
Recovery Time	$5.6 \pm 0.7$	$15.2 \pm 1.9$	$9.1 \pm 1.1$

For AVD, VVR, RPF, MVD, and time-related metrics, lower values indicate better performance. RPU is reported as a trade-off metric that reflects reactive-power support effort.

Table 7. Quantitative comparison on the IEEE 33-bus validation feeder.

Method	AVD	VVR	RPU	RPF	MVD	Recovery	Time
						(Steps)	(ms/Step)
Proposed	$0.0128 \pm 0.0009$	$0.0010 \pm 0.0015$	$0.0678 \pm 0.0054$	$0.0024 \pm 0.0003$	$0.0242 \pm 0.0019$	$5.90 \pm 0.83$	$3.28 \pm 0.38$
Droop	$0.0206 \pm 0.0021$	$0.0300 \pm 0.0054$	$0.0141 \pm 0.0017$	$0.0043 \pm 0.0007$	$0.0400 \pm 0.0040$	$13.40 \pm 1.88$	$0.32 \pm 0.06$
MLP-PPO	$0.0166 \pm 0.0015$	$0.0073 \pm 0.0016$	$0.0366 \pm 0.0029$	$0.0023 \pm 0.0003$	$0.0320 \pm 0.0029$	$8.70 \pm 1.22$	$1.74 \pm 0.21$

Values are reported as mean ± standard deviation over 30 test episodes and five random seeds. Lower is better for AVD, VVR, RPF, MVD, recovery time, and computation time. RPU is reported as a control-effort trade-off metric. VVR is computed over bus–time samples, and MVD is computed as the temporal mean of the instantaneous maximum bus-voltage deviation, i.e.,

\frac{1}{T} \sum_{t} {max}_{i} | V_{i, t} - 1 |

. The plotted trajectories provide trajectory-level evidence rather than substitutes for the aggregate statistics.

Table 8. Ablation comparison under the standard test scenario.

Metric	Proposed	w/o Graph	w/o Smoothness
AVD	$0.0138 \pm 0.0006$	$0.0198 \pm 0.0008$	$0.0156 \pm 0.0007$
VVR	$0.0048 \pm 0.0019$	$0.0316 \pm 0.0086$	$0.0109 \pm 0.0038$
RPU (trade-off)	$0.0724 \pm 0.0033$	$0.0579 \pm 0.0026$	$0.0798 \pm 0.0036$
RPF	$0.0008 \pm 0.0001$	$0.0015 \pm 0.0003$	$0.0022 \pm 0.0004$
MVD	$0.0386 \pm 0.0024$	$0.0518 \pm 0.0038$	$0.0439 \pm 0.0029$

For AVD, VVR, RPF, and MVD, lower values indicate better performance. RPU is reported as a trade-off metric that reflects reactive-power support effort. Variant labels: w/o graph denotes removing graph modeling, and w/o smoothness denotes removing the smoothness term.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Bao, K.; Liang, G.; Zhuang, W.; Qiang, L.; Tang, D.; Lu, X.; Zhang, M. Topology-Aware Graph Reinforcement Learning for Voltage-Reactive Power Control in Grid-Connected Microgrids. Electricity 2026, 7, 60. https://doi.org/10.3390/electricity7020060

AMA Style

Zhang Y, Bao K, Liang G, Zhuang W, Qiang L, Tang D, Lu X, Zhang M. Topology-Aware Graph Reinforcement Learning for Voltage-Reactive Power Control in Grid-Connected Microgrids. Electricity. 2026; 7(2):60. https://doi.org/10.3390/electricity7020060

Chicago/Turabian Style

Zhang, Yunfei, Kefan Bao, Gaige Liang, Wennan Zhuang, Longlong Qiang, Difei Tang, Xiangyu Lu, and Mingxiao Zhang. 2026. "Topology-Aware Graph Reinforcement Learning for Voltage-Reactive Power Control in Grid-Connected Microgrids" Electricity 7, no. 2: 60. https://doi.org/10.3390/electricity7020060

APA Style

Zhang, Y., Bao, K., Liang, G., Zhuang, W., Qiang, L., Tang, D., Lu, X., & Zhang, M. (2026). Topology-Aware Graph Reinforcement Learning for Voltage-Reactive Power Control in Grid-Connected Microgrids. Electricity, 7(2), 60. https://doi.org/10.3390/electricity7020060

Article Menu

Topology-Aware Graph Reinforcement Learning for Voltage-Reactive Power Control in Grid-Connected Microgrids

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. System Description

3.2. Problem Definition

3.3. Markov Decision Process Formulation

3.3.1. State

3.3.2. Action

3.3.3. State Transition

3.3.4. Reward

3.3.5. Policy Objective

4. Method

4.1. Overview of the Proposed Approach

4.2. Graph Modeling

4.2.1. Microgrid Graph Representation

4.2.2. Graph Feature Construction

4.2.3. Topology-Aware State Embedding

4.3. Graph Reinforcement Learning Controller

4.4. Control Implementation

5. Experimental Setup

5.1. Five-Bus Microgrid Test Case

5.2. IEEE 33-Bus Validation Setup

5.3. Benchmark Methods

5.4. Experimental Settings

6. Results

6.1. Training Convergence

6.2. Comparative Performance Evaluation

6.3. Robustness Evaluation

6.4. Generalization Evaluation

6.5. Validation on the IEEE 33-Bus Feeder

6.6. Ablation Analysis

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI