Multi-Agent Proximal Policy Optimization for Coordinated Adaptive Control of Photovoltaic Inverter Clusters in Active Distribution Networks

Wang, Gongrun; Sun, Shumin; Cheng, Yan; Yu, Peng; Wang, Shibo; Zhao, Xueshen

doi:10.3390/en19040978

Open AccessArticle

Multi-Agent Proximal Policy Optimization for Coordinated Adaptive Control of Photovoltaic Inverter Clusters in Active Distribution Networks

by

Gongrun Wang

¹,

Shumin Sun

^1,*

,

Yan Cheng

¹,

Peng Yu

¹,

Shibo Wang

¹ and

Xueshen Zhao

²

¹

State Grid Shandong Electric Power Research Institute, Jinan 250002, China

²

School of Electrical and Electronic Engineering, Shandong University of Technology, Zibo 255000, China

^*

Author to whom correspondence should be addressed.

Energies 2026, 19(4), 978; https://doi.org/10.3390/en19040978

Submission received: 21 December 2025 / Revised: 14 January 2026 / Accepted: 26 January 2026 / Published: 13 February 2026

(This article belongs to the Topic Intelligent, Flexible, and Effective Operation of Smart Grids with Novel Energy Technologies and Equipment)

Download

Browse Figures

Versions Notes

Abstract

High penetration of distributed photovoltaic (PV) generation has transformed active distribution networks into inverter-dominated systems, where maintaining voltage stability, minimizing power losses, and maximizing renewable utilization under uncertainty remain significant challenges. Conventional centralized optimal power flow (OPF) and ADMM-based distributed optimization methods suffer from scalability limitations, high computational latency, and reliance on accurate system models, while single-agent reinforcement learning approaches such as PPO struggle with non-stationarity and lack of coordination in multi-inverter settings. To address these limitations, this paper proposes a coordinated control framework based on Multi-Agent Proximal Policy Optimization (MAPPO) for photovoltaic inverter clusters. By adopting centralized training with decentralized execution, the proposed approach enables effective coordination among heterogeneous inverter agents while preserving real-time autonomy. The framework explicitly incorporates network-level objectives, inverter operational constraints, and stochastic irradiance and load uncertainties, allowing agents to learn adaptive and robust control strategies. Simulation studies on a modified IEEE 33-bus active distribution network demonstrate that the proposed MAPPO-based method reduces voltage deviations by more than 40%, decreases network losses by approximately 25%, and lowers photovoltaic curtailment ratios by nearly 50% compared with centralized optimization approaches. In addition, MAPPO achieves significantly faster and more stable convergence than independent PPO under highly variable operating conditions.b These results indicate that MAPPO provides a scalable and resilient alternative to conventional optimization and single-agent learning methods, offering a practical pathway to enhance hosting capacity, operational robustness, and renewable integration in future active distribution networks.

Keywords:

multi-agent reinforcement learning; proximal policy optimization; photovoltaic inverter clusters; active distribution networks; stochastic robust optimization; voltage stability and loss minimization

1. Introduction

The rapid growth of distributed photovoltaic (PV) generation has fundamentally altered the operational characteristics of modern distribution networks, transforming them into inverter-dominated systems with reduced inertia, strong coupling among control actions, and increased sensitivity to disturbances [1,2,3]. Under high PV penetration, distribution networks are required to simultaneously maintain voltage stability, reduce network losses, and maximize renewable utilization, all while operating under significant irradiance and load uncertainty [4]. These challenges become particularly pronounced in weak-grid regions, where low short-circuit ratios amplify the interaction between inverter dynamics and network voltages, increasing the risk of instability and excessive curtailment [5,6].

In recent years, reinforcement learning (RL) and, more specifically, multi-agent reinforcement learning (MARL) have attracted growing attention as promising tools for adaptive control in active distribution networks [7,8,9]. By learning control policies directly through interaction with the environment, RL-based approaches offer the potential to overcome the limitations of model-dependent optimization and fixed-parameter control schemes. However, despite this growing body of work, existing MARL approaches have not yet been able to simultaneously ensure physical feasibility, scalability, and robustness under weak-grid conditions. This unresolved trade-off between learning flexibility and power-system realism constitutes a critical research gap that remains insufficiently addressed in the current literature.

Independent learning strategies, such as independent Proximal Policy Optimization (PPO), model each inverter or local controller as an autonomous agent that learns its policy independently of others [10]. While this paradigm benefits from simplicity and decentralized implementation, it suffers from severe non-stationarity in multi-inverter environments, as the learning dynamics of each agent continuously alter the environment experienced by others. As a result, independent PPO-based methods often exhibit unstable training behavior and fail to achieve reliable coordination, particularly in networks with dense inverter deployment and strong electrical coupling [11].

To mitigate coordination issues, several studies have explored centralized deep reinforcement learning (DRL) frameworks in which a single global policy is trained using full system information [12]. Although centralized DRL can, in principle, capture network-wide interactions, its applicability is limited by scalability constraints, high-dimensional state–action spaces, and reliance on centralized communication and computation. These characteristics hinder real-time deployment in large-scale distribution networks and conflict with the decentralized nature of practical inverter control architectures [13].

In parallel, optimization-based distributed control methods—most notably ADMM-based Volt/Var and OPF formulations—have been widely investigated as a means to coordinate distributed energy resources while enforcing physical constraints [14]. By decomposing centralized optimization problems into coordinated subproblems, ADMM-based approaches provide theoretical convergence guarantees under convex assumptions. However, their performance relies on synchronized iterations, accurate system models, and convex relaxations of inherently nonlinear power flow equations. Under fast stochastic variations and weak-grid conditions, these requirements often lead to excessive communication overhead and computational latency, limiting their effectiveness for real-time adaptive control [15].

Within this family of methods, value decomposition networks, actor–critic frameworks, and independent learners with coordination signals have been explored. They offer the advantage of scalability and adaptability but bring new challenges related to instability in high-dimensional action spaces, difficulty in exploration, and the risk of violating hard physical constraints. The literature has grappled with these issues by experimenting with centralized training structures, reward shaping mechanisms, and hybrid formulations that combine learning with explicit constraint handling. Yet none of these strategies has fully resolved the tension between local autonomy, global coordination, and strict adherence to grid physics [16,17].

Proximal Policy Optimization (PPO) has emerged as one of the most effective reinforcement learning algorithms for complex, nonlinear environments due to its clipped surrogate objective and stability in updating neural policies. While initially developed in robotics and continuous control tasks, PPO has found applications in energy systems, particularly for energy storage scheduling, microgrid resilience, and distributed resource coordination. Its advantages include sample efficiency, resistance to policy collapse, and compatibility with actor–critic structures. Still, most existing works have applied PPO either as a single-agent framework or within centralized formulations, leaving its multi-agent variant relatively unexplored in the context of inverter-dominated distribution networks. MAPPO, extends the stability of PPO into multi-agent settings by employing a centralized critic that evaluates joint states and actions during training, while retaining decentralized execution for real-time decision-making. This hybrid structure provides agents with coherent training signals that mitigate nonstationarity, yet it allows each inverter to act autonomously based on its local observations once deployed [18,19,20].

The promise of MAPPO resonates particularly strongly when combined with detailed physical modeling of inverter operations. Contemporary research has already developed comprehensive representations of inverter constraints, including quadratic apparent power limits, ramp-rate restrictions, frequency–active power and voltage–reactive power droop behaviors, and operational switching between grid-following and grid-forming modes. Such models are crucial for embedding realism into learning frameworks, preventing agents from pursuing policies that are technically infeasible or unsafe. Parallel advances in stochastic modeling of solar irradiance and load demand further enrich this integration. Chance-constrained formulations, probabilistic scenario generation, and distributionally robust optimization provide a means of accounting for variability without oversimplifying uncertainty into deterministic margins. By combining these modeling traditions with advanced reinforcement learning algorithms, researchers have laid the groundwork for a new generation of methods capable of addressing both the algorithmic and physical challenges of inverter coordination [21,22,23].

The approach developed in this paper builds upon these foundations but also departs from them in decisive ways. It conceptualizes distributed photovoltaic inverter clusters as autonomous decision-making agents situated in a multi-agent reinforcement learning environment, thereby shifting the focus away from static control laws or rigid centralized optimizations. Each agent’s actions encompass setpoint adjustments, droop coefficient selection, and operational mode switching, with the overarching objective of enhancing grid resilience by minimizing voltage deviations, reducing system losses, and maximizing solar utilization. The optimization problem is formally defined in a mathematical framework that incorporates power flow feasibility, inverter limits, curtailment mechanisms, and network strength metrics such as SCR. Uncertainty from irradiance and load volatility is explicitly modeled through stochastic constraints, ensuring robustness. The MAPPO algorithm orchestrates this environment by coupling a centralized critic for training stability with decentralized execution for real-time autonomy, thereby achieving both coordination and scalability. Simulation studies reported in related works have shown that centralized optimization methods, while rigorous, suffer from computation time bottlenecks, and independent deep reinforcement learning agents fail to stabilize under nonstationary conditions. By contrast, the MAPPO approach has been shown in other domains to outperform baseline strategies by leveraging coordinated learning without sacrificing autonomy. Translating this paradigm into power systems represents a significant conceptual shift, as it allows distributed photovoltaic inverters not merely to be controlled but to learn adaptive strategies that evolve with the system. It is precisely this dual emphasis on rigorous modeling and advanced learning algorithms that situates the proposed work within the frontier of both operations research and reinforcement learning applications. In developing this framework, four intertwined contributions are advanced. The first lies in the introduction of a novel MAPPO-based optimization paradigm that directly addresses the nonlinearity, heterogeneity, and large-scale coordination challenges inherent in inverter-dominated distribution systems. The second is the explicit formalization of inverter operations as a multi-agent decision process, providing a rigorous methodological foundation for bringing state-of-the-art reinforcement learning techniques into distribution network operation. The third is the construction of a mathematical model that integrates power flow equations, inverter operational constraints, droop behaviors, and stochastic irradiance variability into a unified decision-making problem, ensuring that learned policies remain technically feasible. The fourth is the demonstration, through extensive case studies, that the MAPPO-based optimization consistently outperforms conventional methods such as centralized mixed-integer programming and independent reinforcement learning, yielding superior technical performance, improved resilience, and greater operational flexibility under diverse scenarios.

2. Mathematical Modeling

To rigorously characterize the adaptive operation of inverter-dominated distribution networks, we begin by formulating a mathematical model that integrates the multiple, and often conflicting, objectives encountered in photovoltaic (PV)-rich systems. The formulation is designed to reflect the physical reality of network operations, where voltage stability, power loss minimization, and renewable utilization must be addressed simultaneously while ensuring compliance with operational limits. In this context, the decision space encompasses inverter-level control parameters, such as active and reactive power setpoints, droop coefficients, and grid-forming or grid-following modes, which collectively influence both local bus performance and global network stability. The objective function consolidates these diverse goals into a single optimization criterion, while the constraints enforce power flow physics, inverter capacity boundaries, and resilience indices such as SCR. To further enhance realism, the model incorporates stochastic robustness by embedding irradiance and load uncertainties, thereby ensuring that learned policies remain effective across a wide spectrum of possible operating conditions. By constructing such a detailed mathematical foundation, the section provides the necessary framework upon which the subsequent learning-based optimization methodology is developed.

Figure 1 shows the distributed photovoltaic inverter clusters interact with environmental states through multi-agent reinforcement learning to adaptively determine grid-supporting actions.

\begin{matrix} min_{\begin{matrix} Π, Θ, Υ \end{matrix}} & \sum_{τ = 1}^{T} {ϖ_{1} \sum_{n \in N} (α_{n, τ}^{ν} \cdot {| V_{n, τ} - V_{n}^{ref} |}^{2}) \\ + ϖ_{2} \sum_{(n, m) \in E} (β_{n m, τ}^{ℓ} \cdot ℜ (Y_{n m}) \cdot {| V_{n, τ} - V_{m, τ} |}^{2}) \\ + ϖ_{3} \sum_{g \in G} (χ_{g, τ}^{pv} \cdot max {0, Λ_{g, τ}^{irr} - P_{g, τ}^{inj}}) \\ + ϖ_{4} \sum_{k \in K} (ρ_{k, τ}^{σ} \cdot {∥ ϕ_{k, τ}^{p} - θ_{k, τ}^{q} ∥}_{2}^{2}) \\ + \int_{Ω} Γ (ω) \cdot [\sum_{n \in N} (υ_{n, τ}^{ω} \cdot {| Δ V_{n, τ} (ω) |}^{2}) + \sum_{g \in G} (ζ_{g, τ}^{ω} \cdot | Δ P_{g, τ} (ω) |)] d P (ω)} \end{matrix}

(1)

As shown in Table 1, to ensure clarity and consistency of notation, all set-based symbols appearing in the objective function are explicitly aligned with the terminology used throughout the manuscript. Specifically, the set of photovoltaic inverters is denoted by

G

, the set of network buses by

N

, the discrete time horizon by

T

, and the set of stochastic scenarios by

Ω

.

The opportunity cost associated with photovoltaic curtailment in (1) is evaluated over the inverter set

G

and the time set

T

, ensuring consistency with the system model and simulation setup. All summation indices and set notations in the objective function have been carefully reviewed and, where necessary, revised so that each symbol uniquely corresponds to its defined physical entity.

This notation alignment avoids ambiguity between inverter-level, bus-level, and scenario-level quantities, and clarifies the interpretation of the photovoltaic curtailment cost within the overall optimization objective.

This composite optimization target synthesizes several competing yet interlinked performance measures into a unified formulation. The first summation, weighted by the scalar coefficient

ϖ_{1}

, penalizes quadratic deviations of nodal voltages

V_{n, τ}

from their prescribed references

V_{n}^{ref}

, reflecting the emphasis on sustaining voltage stability across the distribution network. The second summation, associated with

ϖ_{2}

, captures resistive losses in each branch

(n, m)

by embedding the real component of the admittance matrix

Y_{n m}

and the squared voltage magnitude differences, thereby quantifying Joule heating effects and their temporal accumulation. A third term, weighted by

ϖ_{3}

, regulates the curtailment of photovoltaic injections: whenever the available irradiance-driven potential

Λ_{g, τ}^{irr}

exceeds the actual injected power

P_{g, τ}^{inj}

, the surplus is treated as an opportunity cost that erodes utilization efficiency. To capture dynamic coordination penalties among inverter setpoints, droop coefficients, and other adaptive operational parameters, the fourth summation incorporates quadratic norms of discrepancies between generic decision variables

ϕ_{k, τ}^{p}

and

θ_{k, τ}^{q}

, modulated by weights

ρ_{k, τ}^{σ}

, thus embedding intra-agent heterogeneity directly into the optimization landscape. Finally, in order to faithfully integrate exogenous uncertainty due to stochastic irradiance and load variations, an expectation integral over the probability space

Ω

is introduced, with distribution

P (ω)

and scenario weight

Γ (ω)

. This risk-sensitive term ensures that both voltage deviations

Δ V_{n, τ} (ω)

and power injection fluctuations

Δ P_{g, τ} (ω)

contribute explicitly to the optimization criterion, with coefficients

υ_{n, τ}^{ω}

and

ζ_{g, τ}^{ω}

governing the degree of robustness. By aggregating these multifaceted objectives—voltage stability, loss minimization, PV utilization, coordination penalties, and stochastic robustness—into a single scalar functional, the expression encapsulates the fundamental trade-offs in adaptive operation of inverter-dominated distribution networks while maintaining a mathematically rigorous multi-layered structure.

\begin{matrix} P_{n, τ}^{inj} + j Q_{n, τ}^{inj} = V_{n, τ} \cdot \sum_{m \in N} (V_{m, τ}^{*} \cdot Y_{n m}^{*}), \forall n \in N, \forall τ \in T \end{matrix}

(2)

This complex-valued AC power flow equation enforces Kirchhoff’s current law at every bus n, linking the injected active and reactive powers

(P_{n, τ}^{inj}, Q_{n, τ}^{inj})

with the product of bus voltages

V_{n, τ}

, conjugate voltages

V_{m, τ}^{*}

, and the conjugate admittance coefficients

Y_{n m}^{*}

for all network connections. The representation ensures network feasibility and captures nonlinearity inherent in AC load flow formulations across all time intervals.

\begin{matrix} \sum_{g \in G_{n}} P_{g, τ}^{inj} - P_{n, τ}^{load} = P_{n, τ}^{inj}, \sum_{g \in G_{n}} Q_{g, τ}^{inj} - Q_{n, τ}^{load} = Q_{n, τ}^{inj}, \end{matrix}

(3)

These active and reactive power balance conditions ensure that the sum of all photovoltaic and inverter-based injections at bus n equals the net nodal injections after accounting for local demand. The formulation couples resource-level injections with aggregated bus-level quantities, preserving conservation of energy at each node.

\begin{matrix} V_{n}^{min} \leq | V_{n, τ} | \leq V_{n}^{max}, \forall n \in N, \forall τ \end{matrix}

(4)

This inequality constraint enforces operational voltage limits at every bus, bounded by predefined lower and upper thresholds

V_{n}^{min}

and

V_{n}^{max}

, thereby ensuring voltage stability and compliance with distribution code standards under all operating conditions.

\begin{matrix} | I_{n m, τ} |^{2} \leq {(I_{n m}^{max})}^{2}, I_{n m, τ} = Y_{n m} \cdot (V_{n, τ} - V_{m, τ}), \end{matrix}

(5)

Here, current magnitudes along branch

(n, m)

are constrained by thermal ampacity limits

I_{n m}^{max}

, linking line admittance

Y_{n m}

with bus voltage differences. This prevents overloading and ensures safe operation under various load and generation conditions.

\begin{matrix} {(P_{g, τ}^{inj})}^{2} + {(Q_{g, τ}^{inj})}^{2} \leq {(S_{g}^{max})}^{2}, \forall g \in G, \forall τ \end{matrix}

(6)

The quadratic inverter capacity constraint restricts active and reactive power outputs to lie within the apparent power rating

S_{g}^{max}

of each inverter, preserving device safety margins and avoiding over-excitation.

\begin{matrix} Δ f_{τ} = f_{τ} - f^{ref}, Δ f_{τ} = - κ_{g}^{p} \cdot (P_{g, τ}^{inj} - P_{g}^{set}), \end{matrix}

(7)

This pair of relations introduces the active power–frequency droop mechanism, where frequency deviations

Δ f_{τ}

are linearly coupled with variations of injected active power from the nominal setpoint through droop coefficient

κ_{g}^{p}

.

\begin{matrix} Δ V_{n, τ} = V_{n, τ} - V_{n}^{ref}, Δ V_{n, τ} = - κ_{g}^{q} \cdot (Q_{g, τ}^{inj} - Q_{g}^{set}), \end{matrix}

(8)

This voltage–reactive power droop relation ties bus voltage deviations with inverter reactive power adjustments, with slope determined by

κ_{g}^{q}

. The equation ensures distributed voltage support capability across heterogeneous inverter agents.

\begin{matrix} δ_{g, τ}^{mode} \in {0, 1}, P_{g, τ}^{inj} \leq δ_{g, τ}^{mode} \cdot P_{g}^{max} + (1 - δ_{g, τ}^{mode}) \cdot ϵ, \end{matrix}

(9)

Here, binary variable

δ_{g, τ}^{mode}

encodes operational switching between grid-forming and grid-following modes. The inequality constrains injected active power accordingly, allowing flexible representation of mode-dependent operational regimes.

\begin{matrix} \frac{S_{sc, τ}}{P_{g, τ}^{inj}} \geq Ψ^{min}, \forall g, τ \end{matrix}

(10)

This constraint requires that the SCR of the network relative to inverter injections remains above a minimum threshold

Ψ^{min}

, guaranteeing small-signal stability and system strength adequacy, particularly under weak-grid scenarios.

\begin{matrix} 0 \leq P_{g, τ}^{inj} \leq Λ_{g, τ}^{irr}, \forall g, τ \end{matrix}

(11)

The curtailment condition restricts inverter active power outputs not to exceed irradiance-driven available generation

Λ_{g, τ}^{irr}

, thus formalizing solar availability limits and embedding variability directly.

\begin{matrix} | P_{g, τ + 1}^{inj} - P_{g, τ}^{inj} | \leq Δ_{g}^{ramp}, \forall g, τ \end{matrix}

(12)

This inequality imposes ramp-rate limits, ensuring that inverter active power changes between consecutive periods remain within admissible range

Δ_{g}^{ramp}

, thereby preventing stress on power electronic devices.

\begin{matrix} \sum_{g \in G} ϕ_{g, τ}^{p} = Ξ_{τ}^{p}, \sum_{g \in G} θ_{g, τ}^{q} = Ξ_{τ}^{q}, \forall τ \end{matrix}

(13)

This coordination constraint couples agent-level decision variables

ϕ_{g, τ}^{p}

and

θ_{g, τ}^{q}

with global system requirements

(Ξ_{τ}^{p}, Ξ_{τ}^{q})

, ensuring that local inverter actions remain consistent with aggregate operational feasibility.

\begin{matrix} P_{g, τ}^{inj} (ω) = P_{g, τ}^{inj} + Δ P_{g, τ} (ω), P (| Δ P_{g, τ} (ω) | \leq Υ_{g, τ}) \geq 1 - ϵ \end{matrix}

(14)

This stochastic constraint defines uncertain injections under scenario

ω

, bounding deviations

Δ P_{g, τ} (ω)

within chance tolerance

Υ_{g, τ}

with high probability

1 - ϵ

. It captures the stochastic impact of renewable intermittency.

\begin{matrix} L_{τ} = \sum_{(n, m) \in E} ℜ (Y_{n m}) \cdot {| I_{n m, τ} |}^{2} \end{matrix}

(15)

The system loss definition quantifies resistive power dissipation

L_{τ}

across all lines

(n, m)

, embedding squared current magnitudes with real admittance coefficients, thereby completing the explicit cost representation of technical losses.

3. Solution Approach

Building upon the multi-objective and uncertainty-aware optimization model, the methodology section introduces the proposed learning-based solution that leverages MAPPO. In this approach, each inverter cluster is modeled as an autonomous agent capable of interacting with the environment, observing system states, and executing control actions. To ensure both scalability and coordination, MAPPO adopts a hybrid structure of centralized training with decentralized execution, whereby a shared global critic stabilizes learning while local agents retain autonomy in real-time operation. This structure is particularly well-suited for active distribution networks, where localized dynamics must be accommodated without sacrificing system-wide performance. The algorithm is implemented with reward functions explicitly linked to the mathematical objectives—penalizing voltage deviations, curtailment, and losses—while also embedding robustness through scenario-based sampling of uncertain renewable generation and load patterns. By systematically aligning the reinforcement learning framework with the formulated optimization problem, this methodology ensures that the resulting policies are not only technically feasible but also adaptive, scalable, and robust under diverse operating conditions.

\begin{matrix} M = (S, {A_{ι}}_{ι \in I}, P (ς^{'} ∣ ς, a), {r_{ι} (ς, a_{ι})}_{ι \in I}, γ) \end{matrix}

(16)

The methodological core begins by casting the distributed PV inverter coordination problem as a Markov game

M

, where the system state space

S

encapsulates voltages, frequencies, and load variations, the joint action set

{A_{ι}}

corresponds to inverter-level decisions across all agents

ι \in I

,

P (ς^{'} ∣ ς, a)

governs the stochastic transitions of the physical system from state

ς

to successor

ς^{'}

under joint action

a

, agent-specific reward signals

r_{ι}

capture localized objectives, and

γ

represents the temporal discount factor shaping long-term optimization.

\begin{matrix} π_{ι}^{θ_{ι}} (a_{ι} ∣ ς) = \frac{exp (w_{ι}^{⊤} σ (H_{ι} ς))}{\sum_{a^{'} \in A_{ι}} exp (w_{ι}^{⊤} σ (H_{ι} ς))} \end{matrix}

(17)

Policy functions

π_{ι}^{θ_{ι}}

define the stochastic mapping of each agent

ι

from the current system state

ς

to an action

a_{ι}

, parameterized by neural weights

θ_{ι}

. Here,

H_{ι}

projects state features into latent dimensions,

σ (\cdot)

denotes nonlinear activations, and

w_{ι}

encodes trainable coefficients. The softmax structure guarantees exploration while embedding continuous dependence on network dynamics.

\begin{matrix} V^{ϕ} (ς) = E_{a \sim π} [\sum_{τ = 0}^{\infty} γ^{τ} (\sum_{ι \in I} r_{ι} (ς_{τ}, a_{ι, τ})) | ς_{0} = ς] \end{matrix}

(18)

The centralized critic function

V^{ϕ} (ς)

, with parameters

ϕ

, provides a system-wide evaluation of state

ς

under the joint policy

π

. By aggregating discounted future returns across all agents, it allows stability during training and captures collective welfare signals even though execution remains decentralized.

\begin{matrix} A_{ι} (ς, a_{ι}) = Q_{ι} (ς, a_{ι}) - V^{ϕ} (ς), Q_{ι} (ς, a_{ι}) = E [r_{ι} + γ V^{ϕ} (ς^{'})] \end{matrix}

(19)

Advantage estimators

A_{ι}

measure the marginal value of executing action

a_{ι}

against the baseline of critic

V^{ϕ} (ς)

. They balance exploration and exploitation by revealing which local deviations improve collective system performance.

\begin{matrix} L^{CLIP} (θ_{ι}) = E_{τ} [min (ϱ_{ι, τ} (θ_{ι}) A_{ι, τ}, clip (ϱ_{ι, τ} (θ_{ι}), 1 - ϵ, 1 + ϵ) A_{ι, τ})] \end{matrix}

(20)

The clipped surrogate loss

L^{CLIP}

enforces training stability by preventing policy ratios

ϱ_{ι, τ} (θ_{ι})

from drifting excessively. By taking the minimum between unclipped and clipped expressions, one guarantees conservative yet effective gradient steps in highly nonlinear optimization landscapes.

\begin{matrix} \nabla_{θ_{ι}} J (θ_{ι}) = E_{τ} [\nabla_{θ_{ι}} log π_{ι}^{θ_{ι}} (a_{ι, τ} ∣ ς_{τ}) A_{ι, τ}] \end{matrix}

(21)

Policy gradient updates directly adjust parameters

θ_{ι}

through the log-likelihood trick, weighted by the advantage signal

A_{ι, τ}

. This mechanism embeds system-level objectives into local neural network updates, aligning distributed agent learning.

\begin{matrix} L^{VF} (ϕ) = \frac{1}{2} E_{τ} [{(V^{ϕ} (ς_{τ}) - {\hat{R}}_{τ})}^{2}] \end{matrix}

(22)

The value function loss

L^{VF}

regularizes critic accuracy by penalizing squared errors between predicted state values and empirical returns

{\hat{R}}_{τ}

, reinforcing reliable evaluation signals for all agents.

\begin{matrix} L^{ENT} (θ_{ι}) = - β E_{τ} [\sum_{a \in A_{ι}} π_{ι}^{θ_{ι}} (a ∣ ς_{τ}) log π_{ι}^{θ_{ι}} (a ∣ ς_{τ})] \end{matrix}

(23)

Entropy regularization enhances exploration by rewarding stochasticity in policy distributions. The coefficient

β

balances deterministic convergence against the need to avoid premature exploitation, which is critical in volatile distribution networks.

\begin{matrix} L^{TOTAL} = \sum_{ι \in I} (L^{CLIP} (θ_{ι}) - L^{ENT} (θ_{ι}) + ϖ L^{VF} (ϕ)) \end{matrix}

(24)

A composite training objective

L^{TOTAL}

integrates clipped policy improvement, entropy-driven exploration, and critic-based regularization. The scalar

ϖ

controls the tradeoff, ensuring that convergence speed and robustness are balanced under centralized training conditions.

\begin{matrix} a_{ι, τ} = π_{ι}^{θ_{ι}^{*}} (ς_{ι, τ}), \forall ι \in I, \forall τ \in T \end{matrix}

(25)

Once parameters

θ_{ι}^{*}

are optimized, decentralized execution becomes feasible: each agent

ι

independently maps its local observation

ς_{ι, τ}

into an action

a_{ι, τ}

through the trained policy, thereby achieving real-time autonomy while still benefiting from prior centralized training.

4. Results

The case study is conducted on a modified active distribution network modeled after the IEEE 33-bus radial test system, which has been widely adopted in distributed generation studies due to its representative scale and feeder structure. The network is configured to include four major clusters of photovoltaic units, each ranging from 2.5 MW to 5 MW in capacity, distributed across buses 6, 13, 18, and 30. Within each cluster, individual inverters are aggregated into subgroups of 0.25 MW rating, yielding a total of 48 autonomous inverter agents. Load data are synthesized from regional residential and commercial consumption patterns, with a peak demand of 3.7 MW and reactive demand of 2.3 MVar distributed across the feeder nodes. Temporal resolution is set at 5-minute intervals to capture short-term irradiance fluctuations, producing 288 time steps per day, while the case study spans a 7-day horizon to evaluate sustained adaptation under variable conditions. Solar irradiance data are drawn from the NREL National Solar Radiation Database, scaled to average midday peaks of 950 W/m² with standard deviation of 210 W/m², capturing both diurnal cycles and stochastic cloud-driven variations.

Inverter operational constraints are parameterized to reflect realistic hardware limitations. Each inverter is modeled with an apparent power rating consistent with its nominal capacity, maximum ramp rates of 0.15 pu per minute for active power, and 0.10 pu per minute for reactive power. Droop coefficients are initialized in the range of 0.02–0.05 for active power–frequency control and 0.04–0.08 for reactive power–voltage support, ensuring heterogeneity among agents. Voltage magnitudes are constrained within 0.95–1.05 pu at all nodes, and thermal line limits are imposed at 100 A per branch to prevent overload. Weak-grid conditions are represented by reducing the SCR at bus 18 to 2.8, testing the robustness of learned policies under low system strength. Curtailment is explicitly allowed but penalized in the reward function, with curtailment ratios historically observed in comparable networks ranging from 5–12% under midday congestion. To incorporate stochasticity, 500 irradiance scenarios are generated via autoregressive processes fitted to historical time series, with scenario reduction applied to retain 50 representative trajectories for training and evaluation.

All simulations and learning experiments are executed in Python 3.10 using the PyTorch 2.2 framework, with reinforcement learning environments built on OpenAI Gym interfaces extended for power system dynamics. Power flow calculations are performed using the Pandapower 2.11 engine to ensure consistency with AC network physics, while scenario sampling and probability-weighted evaluations are carried out with NumPy and SciPy packages. Training of the MAPPO agents is conducted on a high-performance workstation equipped with an AMD EPYC 7543 CPU (32 cores, 2.8 GHz), 256 GB RAM, and four NVIDIA A100 GPUs with 40 GB memory each. The training process spans 5000 episodes, each consisting of 288 steps, with a batch size of 2048 and learning rate initialized at

3 \times 10^{- 4}

. Entropy regularization coefficients are set to 0.01 to encourage exploration, and discount factor

γ

is fixed at 0.99 to prioritize long-term resilience objectives. Total wall-clock training time for convergence is approximately 36 h, while evaluation runs require less than 20 min per scenario, confirming the framework’s feasibility for large-scale computational studies.

The visualization in Figure 2 illustrates the hourly distributions of irradiance over a full week, where each box represents the statistical spread of solar irradiance in W/m² at a given hour of the day. Across the seven simulated days, peak irradiance values cluster around 900 to 1000 W/m² during midday hours (11:00–14:00), while early morning (06:00–08:00) and late evening (18:00–20:00) exhibit medians below 200 W/m². The interquartile ranges widen significantly during the transition periods of sunrise and sunset, where irradiance variability is most pronounced due to stochastic cloud effects. For example, at 08:00 the 25th percentile is approximately 120 W/m² while the 75th percentile is already near 400 W/m², indicating strong dispersion even in early ramping periods. From a system operation perspective, the boxplots demonstrate not only the expected diurnal pattern but also the uncertainty that reinforcement learning agents must contend with. The whiskers extend close to 1100 W/m² during the brightest midday hours, whereas dips to near-zero irradiance are observed sporadically even in late morning, reflecting the possibility of sudden cloud coverage. Such fluctuations, occurring within the same hour across different days, are a source of short-term imbalance and are highly relevant for inverter curtailment strategies. Over the week, the coefficient of variation during midday hours is observed around 18 percent, while during morning ramp-up it rises to nearly 45 percent, underscoring the asymmetric risk across the diurnal cycle.

In Figure 3, the histogram presents the frequency distribution of photovoltaic curtailment ratios across 500 generated scenarios, highlighting the probability and intensity of curtailed energy under network and operational constraints. The majority of curtailment ratios are concentrated below 10 percent, with the peak of the histogram occurring between 3 and 7 percent. Specifically, the mean curtailment ratio across all samples is 6.1 percent, while the 95th percentile reaches approximately 18 percent, suggesting that extreme curtailment events are relatively rare but still significant in stress scenarios. Only a small fraction of cases exceed 15 percent, confirming that although curtailment is not a dominant outcome, it cannot be ignored in resilience-oriented optimization. From a system-level perspective, the distribution reflects both the physical limits of line capacities and the operational voltage constraints imposed on the feeder. High irradiance conditions combined with weak-grid scenarios at bus 18 are primary drivers of curtailment, and these are well captured by the right tail of the histogram. The left skew toward low curtailment values indicates that under most scenarios, photovoltaic resources are largely utilized, which aligns with empirical observations from feeders of comparable size. For example, in field studies of distribution systems with similar penetration levels, average curtailment typically ranges from 5 to 12 percent under midday congestion, which corresponds closely to the spread observed here. The results reveal an inherent trade-off between voltage regulation performance and photovoltaic curtailment, which becomes particularly evident under high irradiance and weak-grid conditions. When aggressive reactive power support is prioritized to tightly regulate nodal voltages, localized congestion and voltage rise constraints can necessitate increased active power curtailment. Conversely, strategies that minimize curtailment tend to tolerate larger voltage deviations, especially at electrically remote buses. The proposed MAPPO-based framework effectively navigates this trade-off by learning coordinated control policies that balance voltage deviation penalties against curtailment costs, rather than optimizing either objective in isolation.

Compared with centralized optimization, MAPPO achieves a more favorable Pareto balance, reducing voltage deviations by over 40% while simultaneously lowering curtailment ratios by nearly 50%. This indicates that the learned policies do not merely shift the burden from one objective to another, but instead exploit coordination among inverter agents to achieve system-level improvements. The results suggest that multi-agent learning enables implicit negotiation among inverter clusters, allowing voltage support responsibilities to be distributed spatially and temporally, thereby alleviating excessive curtailment at individual locations.

A second trade-off is observed between control stability and learning convergence speed. Independent reinforcement learning methods converge more slowly and exhibit higher variance due to non-stationarity, while centralized optimization produces stable but rigid solutions that lack adaptability. MAPPO strikes a balance by leveraging centralized training to stabilize learning dynamics, while decentralized execution preserves responsiveness. As a result, MAPPO converges approximately 50% faster than independent PPO while maintaining stable voltage trajectories under rapidly changing operating conditions.

In Figure 4, the plotted curves represent photovoltaic generation over a seven-day horizon, simulated with 5-min resolution to capture diurnal and stochastic fluctuations. Each day shows the expected sinusoidal shape, with peak outputs approaching 4.5 to 5.0 MW around midday (11:30–13:00) and nearly zero output during nighttime hours. Day-to-day variability is evident: for example, Day 2 exhibits stronger midday fluctuations with dips as low as 2.5 MW during peak hours, likely reflecting short-lived cloud cover events, while Day 5 sustains higher stability with output consistently above 4.0 MW during the midday plateau. Across the week, average daily peak generation is approximately 4.2 MW, with a coefficient of variation of 12 percent, providing a realistic picture of the stochastic solar resource that agents must manage. Temporal resolution at 288 time steps per day allows for fine-grained capture of intraday variability. During morning ramp-up (06:00–09:00), growth rates can exceed 0.7 MW per hour, while evening ramp-down occurs with slightly smaller gradients, averaging around 0.5 MW per hour. Such asymmetry in morning versus evening transitions is consistent with empirical solar data, where rapid irradiance increases after sunrise can drive more volatile generation profiles. Midday plateaus show narrower variance, but even then noise-induced dips can reduce generation by nearly 1.5 MW within a single hour, underscoring the operational stress that short-term fluctuations impose on the distribution feeder. These variations highlight why learning-based optimization must incorporate not just average diurnal cycles but also stochastic disturbances.

The centralized mixed-integer optimization method used in this study is intended as a practical engineering benchmark rather than a theoretical global upper bound. Specifically, the centralized formulation is implemented as a static, snapshot-based mixed-integer optimization solved independently at each time step using available system measurements and forecasts. Due to computational complexity and real-time constraints, the centralized optimization does not employ a long-horizon or rolling-horizon model predictive control (MPC) framework, nor does it assume perfect foresight of future irradiance or load trajectories. As a result, while the centralized method produces optimal solutions for each individual snapshot, it does not capture the temporal coupling, inter-temporal trade-offs, or anticipatory behavior that emerge in sequential decision-making under uncertainty. In contrast, the proposed MAPPO-based approach learns a dynamic control policy through repeated interaction with the environment, enabling agents to implicitly account for future consequences of current actions, uncertainty propagation, and coordination across time. Therefore, the observed performance improvements of MAPPO relative to the centralized mixed-integer optimization baseline should be interpreted as improvements over a computationally tractable, real-time implementable centralized control strategy, rather than a violation of the optimality properties of idealized, fully informed centralized optimization. This comparison reflects realistic operational conditions faced by distribution system operators, where limited computation time, forecast uncertainty, and dynamic system evolution preclude the use of fully optimal long-horizon centralized solutions. The load duration curve in Figure 5 depicts the sorted load demand values over the full sampling horizon, illustrating the proportion of time that demand remains at or above certain thresholds. The curve starts near 4.5 MW at the highest percentiles, gradually tapering down toward 1.5 MW at the lower end. The steep initial decline between the top 10 percent of hours, where demand falls from 4.5 MW to 3.8 MW, shows that peak load periods are relatively rare. Beyond the 50 percent mark, the curve flattens, with demand hovering around 2.7 MW for most of the time horizon. The mean demand across all samples is approximately 3.0 MW, with a standard deviation of 0.5 MW, which aligns with typical medium-voltage feeder systems supplying a mix of residential and commercial customers. The shape of the curve is critical for understanding not just average load but also the extremes that drive system stress. For example, the upper tail indicates that 5 percent of the time, demand exceeds 4.2 MW, stressing both voltage stability and thermal limits. Conversely, the bottom 20 percent of hours fall below 2.2 MW, which may cause under-utilization of distributed PV if generation remains high during those periods. The fact that the middle 60 percent of the curve stays between 2.5 and 3.5 MW suggests that most operational decisions will revolve around this intermediate demand regime. This information provides both baseline context for optimization and empirical justification for including curtailment penalties in the objective function.

The time series in Figure 6 illustrates the voltage trajectory at bus 18, chosen deliberately as a weak-grid location with a SCR of only 2.8, under a volatile irradiance day. The centralized optimization approach produces oscillatory behavior with swings from 0.94 pu to 1.06 pu, occasionally breaching the accepted operational limits of 0.95–1.05 pu. In contrast, the MAPPO-based framework keeps voltages tightly regulated, with most values staying between 0.97 pu and 1.03 pu, well inside the permissible envelope. Quantitatively, the root mean square deviation from nominal (1.0 pu) is reduced from 0.031 pu under centralized optimization to 0.015 pu with MAPPO, representing an improvement of more than 50 percent in voltage stability. The robustness of the MAPPO response is especially evident during rapid irradiance ramps, where centralized optimization struggles to track sudden power injections. For instance, around time step 120, centralized control results in a sharp drop to 0.94 pu, while MAPPO limits the dip to only 0.975 pu, a difference of nearly 3.5 percent in relative voltage deviation. Similarly, at time step 200, when irradiance sharply declines, centralized optimization overshoots to 1.06 pu, while MAPPO keeps the maximum value below 1.03 pu. These quantitative improvements are critical in maintaining system integrity under weak-grid conditions, where even small deviations can destabilize inverter clusters or trigger protective disconnections. While the proposed framework has been validated through extensive high-fidelity simulations, future work will focus on hardware-in-the-loop (HIL) experimental validation to further assess its practical deployability. In a realistic deployment setting, inverter controllers operate under strict timing and communication constraints, where control latency, measurement delays, and asynchronous updates can affect closed-loop performance. Evaluating the proposed MAPPO-based control framework in an HIL environment will enable systematic investigation of these non-ideal factors, including controller execution latency, communication delays between centralized training modules and local inverter controllers, and the impact of discretized control cycles on voltage regulation performance.

From an implementation perspective, the decentralized execution nature of the proposed approach is well aligned with practical inverter hardware. Once training is completed offline, each inverter agent executes a lightweight policy inference step locally, resulting in millisecond-level computational overhead that is compatible with typical inverter control cycles. Future HIL studies will quantify the sensitivity of the learned policies to varying control and communication latencies and evaluate robustness under realistic timing constraints. Such validation will provide a critical bridge between simulation-based performance assessment and field deployment in inverter-dominated distribution networks. The training curves in Figure 7 display the progression of average episodic rewards over 5000 episodes for two approaches: independent PPO agents and the MAPPO framework with a shared global critic. Independent PPO agents exhibit slower convergence, reaching an average reward of only about 120 by episode 5000, with significant oscillations in the learning trajectory. MAPPO, by contrast, achieves an average reward of nearly 160 by the same point, with smoother, more stable growth. The performance gap of roughly 25 to 30 percent in final reward values clearly indicates that the centralized training mechanism accelerates learning and yields more effective policies. A closer examination of the convergence rate reveals that MAPPO surpasses the 100-reward threshold within the first 1500 episodes, while independent PPO requires closer to 3000 episodes to achieve the same benchmark. This halving of training time directly translates to computational efficiency, particularly relevant when training on large networks with dozens of agents and thousands of stochastic scenarios. The reduced variance of the MAPPO curve, with fluctuations constrained within ±10 reward units compared to ±20 for independent PPO, further illustrates the stability benefits of parameter sharing and coordinated gradient updates. The implications of this convergence behavior extend beyond training efficiency. Faster and more reliable convergence ensures that policies trained under MAPPO are more likely to generalize across unseen scenarios, a critical requirement in power systems where operating conditions vary widely. By stabilizing training, MAPPO reduces the risk of suboptimal or divergent policies, thereby strengthening confidence in its practical deployability. The figure therefore underscores the methodological contribution of combining centralized training with decentralized execution, which is particularly suited to complex, nonlinear systems like distribution grids with heterogeneous inverter agents.

5. Conclusions

This study has proposed a MAPPO-based coordinated control framework for photovoltaic inverter clusters in active distribution networks, distinguishing itself from existing optimization-based and single-agent learning approaches through its ability to explicitly address multi-agent non-stationarity, scalability, and stochastic operating conditions. Unlike conventional centralized optimization methods, which rely on synchronous computation and accurate system models, the proposed framework enables decentralized real-time execution while preserving global coordination through centralized training.

Beyond the quantitative performance improvements, this work fundamentally changes how distribution system operation can be conceived in inverter-dominated networks. Rather than relying on static droop settings, sequential optimization, or centrally coordinated control schemes, the proposed MAPPO-based framework enables adaptive, learning-driven, and inherently coordinated inverter operation that evolves with network conditions. Control decisions are no longer precomputed offline or imposed hierarchically, but instead emerge from continuous interaction between inverter agents and the physical grid, allowing the system to respond proactively to stochastic generation, load variability, and weak-grid conditions. This shift transforms inverter coordination from a rule-based or optimization-centric process into a data-driven operational paradigm that embeds physical feasibility, scalability, and robustness directly into real-time control. In addition, the proposed framework establishes a concrete pathway for integrating advanced inverter functionalities—such as coordinated voltage support, curtailment-aware dispatch, and adaptive mode selection—into future distribution system operation. By demonstrating that multi-agent learning can simultaneously enhance voltage regulation, reduce curtailment, and maintain stable convergence behavior, this work confirms that learning-based coordination is not merely an auxiliary optimization tool, but a core operational mechanism for next-generation active distribution networks.

Compared with centralized mixed-integer optimization, the proposed MAPPO approach reduces the root-mean-square voltage deviation by more than 40% under high photovoltaic penetration and weak-grid conditions, while achieving an average 25% reduction in network losses and nearly 50% reduction in photovoltaic curtailment ratios. In contrast to independent reinforcement learning methods, MAPPO demonstrates significantly faster and more stable convergence, reaching comparable reward levels in approximately 50% fewer training episodes and exhibiting substantially lower variance in policy performance across stochastic irradiance scenarios.

These quantitative improvements indicate that the proposed method not only enhances voltage stability and energy utilization but also provides a more robust and scalable solution for inverter-dominated distribution systems than existing learning-based and optimization-based strategies reported in the literature. By integrating detailed physical constraints with coordinated multi-agent learning, the framework bridges the gap between theoretical reinforcement learning advances and practical power system operation. The results suggest that MAPPO offers a viable pathway for improving hosting capacity, operational resilience, and renewable integration in future active distribution networks.

Author Contributions

Conceptualization, G.W.; methodology, S.W. and P.Y.; software, S.W.; validation, G.W., Y.C. and S.W.; formal analysis, S.S., Y.C. and X.Z.; investigation, Y.C.; resources, P.Y., X.Z.; data curation, S.S.; writing—original draft, G.W.; visualization, S.S. and P.Y.; supervision, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Science and Technology Project of State Grid Shandong Electric Power Company (Research on System Analysis, Optimal Control, and Planning Layout Technologies for Power Grid-Forming Device Integration Supporting Large Power Grid Security and Stable Operation, No. 52062625000Z).

Data Availability Statement

Data is available upon resonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Boakye-Boateng, K.; Ghorbani, A.A.; Lashkari, A.H. Implementation of a Trust-Based Framework for Substation Defense in the Smart Grid. Smart Cities 2024, 7, 99–140. [Google Scholar] [CrossRef]
Yang, H.; Xu, Y.; Sun, H.; Guo, Q.; Liu, Q. Electric Vehicles Management in Distribution Network: A Data-Efficient Bi-level Safe Deep Reinforcement Learning Method. IEEE Trans. Power Syst. 2024, 40, 256–271. [Google Scholar] [CrossRef]
Kumar, B.V.; Farhan, M.A.A. Optimal Simultaneous Allocation of Electric Vehicle Charging Stations and Capacitors in Radial Distribution Network Considering Reliability. J. Mod. Power Syst. Clean Energy 2024, 12, 1584–1595. [Google Scholar] [CrossRef]
Zhao, A.P.; Alhazmi, M.; Huo, D.; Li, W. Psychological modeling for community energy systems. Energy Rep. 2025, 13, 2219–2229. [Google Scholar] [CrossRef]
Bonfiglio, A.; Bruno, S.; Martino, M.; Minetti, M.; Procopio, R.; Velini, A. Renewable Energy Communities Virtual Islanding: A Novel Service for Smart Distribution Networks. In Proceedings of the 2024 IEEE/IAS 60th Industrial and Commercial Power Systems Technical Conference (I&CPS), Las Vegas, NV, USA, 19–23 May 2024; pp. 1–8. [Google Scholar] [CrossRef]
Lu, X.; Fan, X.; Qiu, H.; Gan, W.; Gu, W.; Xia, S.; Luo, X. Machine Learning Based Uncertainty-Alleviating Operation Model for Distribution Systems with Energy Storage. J. Mod. Power Syst. Clean Energy 2024, 12, 1605–1616. [Google Scholar] [CrossRef]
Chen, L.; Yang, D.; Cai, J.; Yan, Y. Robust optimization based coordinated network and source planning of integrated energy systems. Int. J. Electr. Power Energy Syst. 2024, 157, 109864. [Google Scholar] [CrossRef]
Li, T.T.; Zhao, A.P.; Wang, Y.; Li, S.; Fei, J.; Wang, Z.; Xiang, Y. Integrating solar-powered electric vehicles into sustainable energy systems. Nat. Rev. Electr. Eng. 2025, 2, 467–479. [Google Scholar] [CrossRef]
Jing, X.; Qin, W.; Yao, H.; Han, X.; Wang, P. Resilience-oriented planning strategy for the cyber-physical ADN under malicious attacks. Appl. Energy 2024, 353, 122052. [Google Scholar] [CrossRef]
Javadi, M.S.; Nezhad, A.E.; Jordehi, A.R.; Gough, M.; Santos, S.F.; Catalão, J.P.S. Transactive energy framework in multi-carrier energy hubs: A fully decentralized model. Energy 2022, 238, 121717. [Google Scholar] [CrossRef]
Xu, T.; Wu, W. Accelerated ADMM-based Fully Distributed Inverter-based Volt/Var Control Strategy for Active Distribution Networks. IEEE Trans. Ind. Inform. 2020, 16, 7532–7543. [Google Scholar] [CrossRef]
Ergun, S.; Dik, A.; Boukhanouf, R.; Omer, S. Large-scale renewable energy integration: Tackling technical obstacles and exploring energy storage innovations. Sustainability 2025, 17, 1311. [Google Scholar] [CrossRef]
Ma, M.; Long, Z.; Liu, X.; Lee, K.Y. Distributionally robust optimization of electric–thermal–hydrogen integrated energy system considering source–load uncertainty. Energy 2025, 316, 134568. [Google Scholar] [CrossRef]
Arévalo, P.; Jurado, F. Impact of Artificial Intelligence on the Planning and Operation of Distributed Energy Systems in Smart Grids. Energies 2024, 17, 4501. [Google Scholar] [CrossRef]
Chen, S.; Liu, J.; Cui, Z.; Chen, Z.; Wang, H.; Xiao, W. A Deep Reinforcement Learning Approach for Microgrid Energy Transmission Dispatching. Appl. Sci. 2024, 14, 3682. [Google Scholar] [CrossRef]
Li, Y.; Ding, Y.; He, S.; Hu, F.; Duan, J.; Wen, G.; Geng, H.; Wu, Z.; Gooi, H.B.; Zhao, Y.; et al. Artificial intelligence-based methods for renewable power system operation. Nat. Rev. Electr. Eng. 2024, 1, 163–179. [Google Scholar] [CrossRef]
Aghahadi, M.; Bosisio, A.; Merlo, M.; Berizzi, A.; Pegoiani, A.; Forciniti, S. Digitalization Processes in Distribution Grids: A Comprehensive Review of Strategies and Challenges. Appl. Sci. 2024, 14, 4528. [Google Scholar] [CrossRef]
Valogianni, K.; Ketter, W.; Collins, J.; Adomavicius, G. Toward Sustainable Electricity Markets: Capacity-Based Pricing for Electric Vehicle Smart Charging. Inf. Syst. Res. 2025. [Google Scholar] [CrossRef]
Leijon, J.; Döhler, J.S.; Hjalmarsson, J.; Brandell, D.; Castellucci, V.; Boström, C. An Analysis of Vehicle-to-Grid in Sweden Using MATLAB/Simulink. World Electr. Veh. J. 2024, 15, 153. [Google Scholar] [CrossRef]
Li, T.T.; Li, S.; Ding, C.X.; Bao, Z.; Alhazmi, M. Intelligent Wireless Power Scheduling for Lunar Multienergy Systems: Deep Reinforcement Learning for Real-Time Adaptive Beam Steering and Vehicle-to-Grid Energy Optimization. Int. Trans. Electr. Energy Syst. 2025, 2025, 9877968. [Google Scholar] [CrossRef]
Ding, Y.; Morstyn, T.; McCulloch, M.D. Distributionally Robust Joint Chance-Constrained Optimization for Networked Microgrids Considering Contingencies and Renewable Uncertainty. IEEE Trans. Smart Grid 2022, 13, 2467–2478. [Google Scholar] [CrossRef]
Rayati, M.; Bozorg, M.; Cherkaoui, R.; Carpita, M. Distributionally Robust Chance Constrained Optimization for Providing Flexibility in an Active Distribution Network. IEEE Trans. Smart Grid 2022, 13, 2920–2934. [Google Scholar] [CrossRef]
Huo, D.; Gu, C.; Ma, K.; Wei, W.; Xiang, Y.; Blond, S.L. Chance-Constrained Optimization for Multienergy Hub Systems in a Smart City. IEEE Trans. Ind. Electron. 2019, 66, 1402–1412. [Google Scholar] [CrossRef]

Figure 1. System architecture of the MAPPO-based optimization framework.

Figure 2. Daily irradiance distributions across a 7-day horizon.

Figure 3. Distribution of photovoltaic curtailment ratios.

Figure 4. Weekly photovoltaic generation profiles.

Figure 5. Load duration curve.

Figure 6. Resilience of bus voltage under weak-grid conditions (SCR = 2.8).

Figure 7. Policy learning convergence comparison.

Table 1. Decision variables and physical interpretations.

Symbol	Description	Physical Meaning	Controlled by
$P_{g, τ}^{inj}$	Active power injection	Real power output of inverter g at time $τ$	Agent
$Q_{g, τ}^{inj}$	Reactive power injection	Reactive power for voltage regulation	Agent
$V_{n, τ}$	Bus voltage	Voltage magnitude at bus n	Environment
$κ_{g}^{p}$	P–f droop coefficient	Active power sensitivity to frequency deviation	Agent
$κ_{g}^{q}$	Q–V droop coefficient	Reactive power sensitivity to voltage deviation	Agent
$δ_{g, τ}^{mode}$	Operating mode	Grid-forming or grid-following mode	Agent
$ϕ_{g, τ}^{p}$	Active coordination variable	Coordination signal for active power	Agent
$θ_{g, τ}^{q}$	Reactive coordination variable	Coordination signal for reactive power	Agent
$Λ_{g, τ}^{irr}$	Available PV power	Maximum power from irradiance	Environment
$P_{n, τ}^{load}$	Active load	Active power demand at bus n	Environment
$Q_{n, τ}^{load}$	Reactive load	Reactive power demand at bus n	Environment
$Δ P_{g, τ} (ω)$	Power deviation	Stochastic power fluctuation	Environment
$Δ V_{n, τ} (ω)$	Voltage deviation	Voltage fluctuation under uncertainty	Environment

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, G.; Sun, S.; Cheng, Y.; Yu, P.; Wang, S.; Zhao, X. Multi-Agent Proximal Policy Optimization for Coordinated Adaptive Control of Photovoltaic Inverter Clusters in Active Distribution Networks. Energies 2026, 19, 978. https://doi.org/10.3390/en19040978

AMA Style

Wang G, Sun S, Cheng Y, Yu P, Wang S, Zhao X. Multi-Agent Proximal Policy Optimization for Coordinated Adaptive Control of Photovoltaic Inverter Clusters in Active Distribution Networks. Energies. 2026; 19(4):978. https://doi.org/10.3390/en19040978

Chicago/Turabian Style

Wang, Gongrun, Shumin Sun, Yan Cheng, Peng Yu, Shibo Wang, and Xueshen Zhao. 2026. "Multi-Agent Proximal Policy Optimization for Coordinated Adaptive Control of Photovoltaic Inverter Clusters in Active Distribution Networks" Energies 19, no. 4: 978. https://doi.org/10.3390/en19040978

APA Style

Wang, G., Sun, S., Cheng, Y., Yu, P., Wang, S., & Zhao, X. (2026). Multi-Agent Proximal Policy Optimization for Coordinated Adaptive Control of Photovoltaic Inverter Clusters in Active Distribution Networks. Energies, 19(4), 978. https://doi.org/10.3390/en19040978

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Proximal Policy Optimization for Coordinated Adaptive Control of Photovoltaic Inverter Clusters in Active Distribution Networks

Abstract

1. Introduction

2. Mathematical Modeling

3. Solution Approach

4. Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI