Maskable PPO-Based Topology Control for Reverse Power Flow Mitigation in PV-Rich Distribution Networks

Lan, Tu; Diao, Ruisheng; Xu, Wangjie; Ju, Jiehua; Xiang, Xuanchen; Jia, Kunqi

doi:10.3390/electronics14224525

Open AccessArticle

Maskable PPO-Based Topology Control for Reverse Power Flow Mitigation in PV-Rich Distribution Networks

by

Tu Lan

¹

,

Ruisheng Diao

^1,*,

Wangjie Xu

¹,

Jiehua Ju

²,

Xuanchen Xiang

¹ and

Kunqi Jia

²

¹

Zhejiang Key Laboratory of Electrical Technology and System on Renewable Energy, ZJU-UIUC Institute, Zhejiang University, Haining 314400, China

²

State Grid Shanghai Municipal Electric Power Company, Shanghai 200122, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(22), 4525; https://doi.org/10.3390/electronics14224525

Submission received: 18 October 2025 / Revised: 16 November 2025 / Accepted: 17 November 2025 / Published: 19 November 2025

(This article belongs to the Special Issue AI-Driven Solutions for Operation and Control of Future Smart Grids)

Download

Browse Figures

Versions Notes

Abstract

The rapid proliferation of photovoltaic (PV) generation has transformed conventional distribution systems, resulting in frequent reverse power flow (RPF) and associated overvoltage issues. This paper presents a deep reinforcement learning (DRL)-based topology control method to autonomously mitigate RPF and voltage violations. A novel multi-discrete Maskable Proximal Policy Optimization (MPPO) algorithm is proposed, combining topology-aware action masking with a multi-discrete action representation to ensure constraint satisfaction and enhance training stability. The approach efficiently explores the feasible switching space while maintaining network radiality, load connectivity, and power flow solvability. Extensive case studies based on one year of operational data from a practical distribution system show that the proposed agent achieves an average RPF reduction of 24.3% across the test cases and restores normal voltage conditions in about 65% of scenarios, while satisfying other operational constraints. The results confirm that the proposed method provides a scalable, data-driven solution for topology reconfiguration in PV-rich distribution networks.

Keywords:

reverse power flow; topology reconfiguration; deep reinforcement learning; maskable proximal policy optimization; voltage regulation; PV-rich distribution networks

1. Introduction

The growing global commitment to carbon neutrality and the rapid deployment of distributed renewable energy resources are driving a paradigm shift in the operation of power systems. Among these resources, photovoltaic (PV) generation has experienced the most rapid proliferation due to declining installation costs, policy incentives, and advancements in inverter and control technologies [1,2]. Consequently, modern distribution networks, historically designed for unidirectional power delivery from substations to end users, are now accommodating large-scale, intermittent and spatially dispersed photovoltaic installations [3,4]. The stochastic and location-dependent nature of PV generation introduces substantial variability and uncertainty in local power flows, leading to frequent bidirectional power transfers within distribution feeders [5,6]. This operational shift has revealed new technical challenges that were not anticipated in the design of conventional distribution systems. Among them, one of the most prominent is reverse power flow (RPF), which arises when local PV output exceeds the downstream power demand and forces surplus power to propagate upstream towards higher voltage levels [7,8]. The occurrence of RPF is particularly pronounced in residential feeders with dense PV deployment, correlated solar irradiance patterns, and limited demand diversity [1,9]. Persistent RPF not only modifies the intended power flow patterns but also undermines operational stability and introduces significant economic implications for distribution network operators.

RPF imposes multifaceted technical and economic impacts on distribution systems. Excessive upstream power injection can lead to voltage rise beyond limits, adversely affecting power quality and potentially damaging customer equipment [4,10]. Transformers and substations exposed to reverse loading are subjected to thermal stress, accelerated aging, and possible overloading [5,11]. Moreover, conventional protection schemes designed under the assumption of unidirectional fault currents may misoperate under reversed flow conditions, posing risks to system reliability and safety [12]. In addition, persistent RPF contributes to increased line losses, complicates voltage regulation, and limits the renewable hosting capacity of distribution feeders, ultimately constraining the broader integration of distributed energy resources [13,14]. These issues motivate the need for advanced RPF mitigation strategies that can ensure safe, reliable, and economically efficient operation of active distribution networks under high PV penetration.

Various mitigation strategies have been proposed in the literature, including PV curtailment, reactive power support from smart inverters, coordinated control of on-load tap changers (OLTCs), and deployment of energy storage systems [8,15,16,17]. While effective in specific cases, these strategies have inherent limitations. Curtailment reduces renewable utilization and economic benefits, reactive power control may elevate network losses and accelerate inverter degradation, and hardware-based solutions such as capacitor banks or STATCOMs demand substantial capital investment [18,19]. An increasingly promising alternative is distribution network reconfiguration (DNR), which dynamically alters the status of sectionalization and tie switches to reshape network topology. DNR can redistribute power flows, mitigate congestion, and alleviate RPF without extensive infrastructure upgrades [20]. However, operationalizing DNR in PV-rich networks requires selecting from a large set of feasible switch actions while PV and loads vary rapidly, motivating adaptive, topology-constrained decision-making [21].

In this context, recent advances in artificial intelligence, particularly deep reinforcement learning (DRL), offer a promising and adaptive approach to the complex challenges of DNR. DRL is well-suited for high-dimensional state spaces, nonlinear power system dynamics, and sequential decision-making under uncertainty, enabling autonomous learning of control policies without requiring explicit system models [22]. In power systems, DRL has been successfully applied to voltage regulation, optimal power flow, demand response, and DNR [23,24,25,26,27,28,29,30,31]. Recent studies increasingly explore multi-agent [15,32,33], hierarchical [28,34], and safe DRL [31,35], to improve scalability, coordinate distributed decision-making, decompose tasks across temporal and spatial scales, and explicitly account for operational risks and safety constraints. In parallel, graph neural networks (GNNs) are integrated as topology-aware feature extractors, enabling DRL policies and value functions to operate on graph-structured grid representations and generalize more effectively across varying network topologies and operating conditions [21,36].

Despite these advances, relatively few studies have explicitly focused on RPF mitigation. In [37], a double deep Q-network approach is employed to mitigate RPF and reduce transformer load rates by optimizing the operation of battery energy storage systems and electric vehicle charging stations. Besides, ref. [34] introduces a hierarchical DRL framework that enhances local PV absorption while explicitly constraining PV injection to prevent RPF, achieved through dynamic topology reconfiguration. Furthermore, ref. [38] presents a DRL-based method for analyzing the distributed photovoltaic carrying capacity in distribution grids, incorporating transformer RPF as a constraint to ensure system stability. Despite these contributions, the practical deployment of DRL in distribution networks presents several critical challenges. The combinatorial explosion of action spaces resulting from multiple switch operations complicates the learning process and often leads to inefficient exploration. Moreover, maintaining radiality and ensuring load connectivity during reconfiguration is nontrivial, as unrestrained exploration can result in infeasible network topologies. To address these challenges, specialized DRL algorithms are required—ones that can effectively integrate operational constraints, handle large combinatorial action spaces, and prioritize RPF mitigation as a central control objective.

To address these challenges, this paper proposes an intelligent topology control method based on a multi-discrete Maskable PPO (MPPO) algorithm. The main contributions of this work are summarized as follows:

A reinforcement learning-based topology control method is developed to mitigate RPF and voltage violations in distribution networks with high PV penetration. The method explicitly incorporates practical operational constraints, including network radiality, load connectivity, and line loading limits.
A multi-discrete MPPO algorithm is introduced to handle the combinatorial nature of switch operations in large distribution networks. The multi-discrete action space allows simultaneous consideration of multiple switches and restricts the maximum number of controlled switches, improving exploration efficiency and practical feasibility, while the mask mechanism dynamically adapts to the valid actions available in each scenario, accounting for variations in controllable breakers across different operating conditions.
Extensive validation using one-year historical data from a practical distribution system demonstrates that the proposed method achieves fast convergence, stable training, and effective mitigation of RPF and voltage violations, highlighting its potential for scalable application in PV-rich distribution networks.

2. Proposed Method

2.1. System Overview

The transition towards PV-rich distribution networks requires a shift from static, manually-operated systems to dynamic frameworks capable of intelligent reconfiguration. This study considers a representative 10 kV distribution network, illustrated in Figure 1, which captures the operational challenges associated with high PV penetration. The network integrates a large-scale deployment of distributed PV units, in addition to two centralized PV plants, PV1 and PV2. RPF arises from the combined effects of PV1 and PV2, and contributions from the distributed PV units within the 10 kV grid. The resulting upstream power is injected into the two main transformers, T1 (63 MVA) and T2 (31.5 MVA), subjecting them to atypical thermal stress, accelerating insulation aging, and potentially exceeding their rated capacity, which can compromise transformer lifespan and induce voltage rise in the upstream network.

The system architecture is hierarchically organized into distinct operational zones to delineate power flow paths and control responsibilities. The 110 kV main grid forms the high-voltage backbone, while the downstream 10 kV network bifurcates into a PV generation zone and a load zone. The load zone is critically equipped with a reconfigurable mesh of remotely controlled switches, comprising normally open tie switches and normally closed head-end or sectionalizing switches. The dynamic reconfiguration of this network achieved by altering the open/closed status of these switches serves as the primary mechanism to redirect power flows, alleviate RPF, and mitigate associated voltage violations and equipment overloads. However, this process is highly constrained: any closure of a tie switch to reroute power must be accompanied by the simultaneous opening of another switch to preserve radiality and avoid unintended loops. Under fluctuating generation and load profiles, the resulting combinatorial decision problem can become challenging and time-sensitive for operators. These practical difficulties motivate the exploration of automated, learning-based approaches that can assist in systematically evaluating feasible topological actions and supporting informed decision-making in distribution network operation.

2.2. Problem Description and Mathematical Modeling

The operational challenge of mitigating RPF in PV-rich distribution networks is formalized as a dynamic topology optimization problem. The primary objective is to minimize the total RPF of primary transformers through adjustment of switch configurations while maintaining secure operation under stochastic generation and load conditions. The optimization objective is mathematically defined as:

min_{h_{t}} F_{t} = \sum_{i \in S_{tr}} P_{t, i}^{rev}

(1)

where

P_{t, i}^{rev} = max {0, - P_{t, i}^{tr}}

is the RPF of transformer i at time t, i.e., the magnitude of active power exported from transformer i to the higher-voltage level, and

P_{t, i}^{tr}

denotes the signed active power of transformer i at time t, where

P_{t, i}^{tr} > 0

indicates power import and

P_{t, i}^{tr} < 0

indicates power export.

S_{tr}

is the set of 110 kV transformers. Control is executed through manipulation of the binary switch state vector:

h_{t} = [h_{t, 1}, h_{t, 2}, \dots, h_{t, i}, \dots, h_{t, N_{sw}}], h_{t, i} \in {0, 1}

(2)

where

h_{t}

denotes the switch state vector at time t,

h_{t, i}

represents the status of individual switches (0 for open, 1 for closed), and

N_{sw}

is the total number of controllable switches.

The optimization is constrained by several critical operational requirements. The radiality constraint ensures the network maintains a tree structure, and no cycles are allowed:

\sum_{(i, j) \in E} x_{i j} = | N | - N_{sub}

(3)

where

G = (N, E)

represents the undirected distribution system topology graph,

N

is the bus set and

E

is the line set;

x_{i j} \in {0, 1}

denotes the connectivity of the line

(i, j) \in E

, where 0 is disconnected and 1 is connected;

| N |

is the total number of nodes and

N_{sub}

represents the number of substations.

The power flow constraints govern the physical network behavior:

\begin{matrix} P_{i}^{G} - P_{i}^{L} & = V_{i} \sum_{j = 1}^{N} V_{j} (G_{i j} cos θ_{i j} + B_{i j} sin θ_{i j}), \forall i, j \in N \\ Q_{i}^{G} - Q_{i}^{L} & = V_{i} \sum_{j = 1}^{N} V_{j} (G_{i j} sin θ_{i j} - B_{i j} cos θ_{i j}), \forall i, j \in N \end{matrix}

(4)

where

P_{i}^{G}, Q_{i}^{G}

represent active and reactive power injections to node i,

P_{i}^{L}, Q_{i}^{L}

denote power demands,

V_{i}, θ_{i}

are bus voltage magnitudes and angles, and

G_{i j}, B_{i j}

are the conductance and susceptance of line

(i, j)

.

Security constraints include bus voltage limits:

v_{i}^{min} \leq v_{t, i} \leq v_{i}^{max}, i \in N

(5)

where

v_{i}

is the voltage per-unit (p.u.) value of bus i at time t;

v_{i}^{min}

and

v_{i}^{max}

denote the required minimum and maximum voltage p.u. values of bus i.

Line load rate:

0 \leq ρ_{t, i}^{ℓ} \leq ρ_{i}^{ℓ, max}, i \in E

(6)

where

ρ_{t, i}^{ℓ} = P_{t, i}^{ℓ} / P_{i}^{ℓ, max}

represents the load rate of line i at time t,

P_{t, i}^{ℓ}

is the active power of line i,

P_{i}^{ℓ, max}

is the thermal limit of line i, and

ρ_{i}^{ℓ, max}

is the maximum allowed load rate of line i.

Power outage:

\sum_{i \in S_{out}} P_{t, i}^{L} = 0

(7)

where

P_{t, i}^{L}

is the active power of load i at time t, and

S_{out}

is the set of loads which are out of supply.

The formulation involves discrete switching decisions coupled with nonlinear AC power-flow physics and multiple operational constraints, including radiality, voltage limits, thermal ratings, and load connectivity. Complexity arises from the combinatorial explosion of feasible switch configurations, the non-convexity of network equations, and the stochastic, time-varying behavior of PV generation and loads. These characteristics call for feasibility-aware, time-efficient decision making tailored to large action spaces and uncertainty. Motivated by this, we develop a distributed deep reinforcement learning framework that learns scalable policies for topology control, incorporates constraint handling to maintain operational feasibility, and accounts for variability while targeting RPF mitigation and voltage compliance.

2.3. Problem Formulation as Multi-Armed Bandit

The topology reconfiguration task can be formulated as a contextual decision problem and described concretely by the tuple

(A, S, R)

. At each decision timestep t the agent observes a state

s_{t} \in S

from the environment, selects an action

a_{t} \in A

that induces a new switch configuration

s_{t + 1}

, and receives a scalar reward

r_{t} \sim R (s_{t}, a_{t})

that quantifies the degree to which the action mitigates reverse power flow while preserving operational security. The following paragraphs provide an explicit, mathematically precise specification of the state space, the action space, and the reward design used in this work.

The state vector

s_{t}

is constructed to give the agent a complete and compact description of the electrical and topological conditions of the distribution network at time t, which is defined as follows:

s_{t} = [h_{t}, v_{t}, P_{t}^{PV}, Q_{t}^{PV}, P_{t}^{L}, Q_{t}^{L}, P_{t}^{ℓ}, Q_{t}^{ℓ}, P_{t}^{tr}, Q_{t}^{tr}]

(8)

where

h_{t} = [h_{t, 1}, \dots, h_{t, N_{sw}}]

is the binary switch status vector with

h_{t, i} \in {0, 1}

,

v_{t} = [v_{t, 1}, \dots, v_{t, | N |}]

are bus voltage magnitudes (p.u.),

P_{t}^{PV}

and

Q_{t}^{PV}

collect active and reactive power of PV stations,

P_{t}^{L}

and

Q_{t}^{L}

are active and reactive load demands,

P_{t}^{ℓ}

and

Q_{t}^{ℓ}

are per-line active and reactive flows,

P_{t}^{tr}

and

Q_{t}^{tr}

denote transformer active and reactive power exchanges at the substations. Min-max normalization is applied to each continuous component prior to use by learning algorithms to improve numerical conditioning; the binary switch vector is preserved to retain explicit topological information. By design,

s_{t}

embeds both electrical magnitudes and topology; therefore, contextual evaluation of switching actions accounts for their network-wide electrical consequences under stochastic PV and load realizations.

The action space

A

is intuitively defined as a multi-binary space:

A_{t} = {a_{t, i} \in {0, 1}; i \in S_{sw}}

(9)

where

a_{t, i}

represents the action of whether toggling the switch i, value 0 means to do nothing, and 1 stands for toggling the switch.

S_{sw}

is the complete set of controllable switches in the system. The switch state transfer equation is given by:

h_{t + 1, i} = h_{t, i} \oplus a_{t, i} i \in S_{sw}

(10)

where ⊕ denotes bitwise XOR.

a_{t, i}

has an alternative representation as the state of the switch configuration i, where the value 0 means the switch is set open and 1 represents that the switch is set closed. In this case,

h_{t + 1, i} = a_{t, i}

is the transfer equation of the switching state.

The reward function is designed to align directly with the operational objective of mitigating RPF, while simultaneously ensuring voltage and thermal security, maintaining the topology radiality, and discouraging load shedding. The instantaneous reward is given by:

r_{t} = \{\begin{matrix} C_{dis}, & if not converged \\ C_{cyc}, & elif N_{cyc} > 0 \\ R_{d}, & elif F_{t} = 0 and all constraints met \\ r_{t}^{n}, & else \end{matrix}

(11)

where

C_{dis}

is a constant penalty for unconvergent power flow,

C_{cyc}

is a constant penalty for creating

N_{cyc}

cycles in the topology,

R_{d}

is a constant reward of RPF completely solved and all constraints met, and

r_{t}^{n}

is the reward function for other cases, which is defined as:

\begin{matrix} r_{t}^{n} & = r_{t}^{rev} + λ_{1} \cdot r_{t}^{vol} + λ_{2} \cdot r_{t}^{flo} + λ_{3} \cdot r_{t}^{out} \end{matrix}

(12a)

\begin{matrix} r_{t}^{rev} & = \frac{\sum_{i \in S_{tr}} P_{t - 1, i}^{rev} - \sum_{i \in S_{tr}} P_{t, i}^{rev}}{\sum_{i \in S_{tr}} P_{t - 1, i}^{rev}} \end{matrix}

(12b)

\begin{matrix} r_{t}^{vol} & = \begin{matrix} \sum_{i \in N [v_{t - 1, i} < v_{i}^{min}]} (v_{i}^{min} - v_{t - 1, i}) + \sum_{i \in N [v_{t - 1, i} > v_{i}^{max}]} (v_{t - 1, i} - v_{i}^{max}) \\ - (\sum_{i \in N [v_{t, i} < v_{i}^{min}]} (v_{i}^{min} - v_{t, i}) + \sum_{i \in N [v_{t, i} > v_{i}^{max}]} (v_{t, i} - v_{i}^{max})) \end{matrix} \end{matrix}

(12c)

\begin{matrix} r_{t}^{flo} & = - \sum_{i \in E [ρ_{t, i}^{l} > ρ_{i}^{l, max}]} (ρ_{t, i}^{l} - ρ_{i}^{l, max}) \end{matrix}

(12d)

\begin{matrix} r_{t}^{out} & = - \sum_{i \in S_{out}} P_{t, i}^{L} \end{matrix}

(12e)

where

r_{t}^{rev}

is the reward for the decrease of the RPF ratio,

r_{t}^{vol}

is the reward for the decrease of overvoltage and undervoltage of buses,

r_{t}^{flo}

is the penalty for line overflow,

r_{t}^{out}

is the penalty for power outages; reward shaping via the coefficients

λ_{1}

,

λ_{2}

, and

λ_{3}

is used to align the relative influence of each component with operational priorities and to place heterogeneous magnitudes on comparable scales during training. These weights serve precisely to balance dimensionless indices against the power outage penalty so that no single component dominates the learning signal. This reward structure ensures that actions that materially reduce transformer reverse flow without inducing voltage or thermal violations yield the highest returns, while actions that eliminate RPF at the expense of violating hard limits, shedding load are disfavored. By embedding these physical penalties into the reward

r_{t}

, the DRL agent is guided toward policies that trade off RPF mitigation, security constraints, and operational pragmatism in a manner consistent with real-world operational priorities.

3. Proposed Method Incorporating Multi-Discrete Maskable PPO

To effectively address the distribution network reconfiguration problem under uncertain renewable generation and demand fluctuations, this study develops a decision-making framework based on the proximal policy optimization (PPO) algorithm enhanced with two key extensions: multi-discrete action representation and maskable action constraints. These modifications are specifically designed to align reinforcement learning with the operational requirements of distribution grids, thereby improving both the efficiency of training and the feasibility of the derived control policies.

3.1. Multi-Discrete Action Space Formulation

In distribution system reconfiguration, the action space exhibits a combinatorial nature because switching operations must be selected from a set of candidate lines under radiality and operational constraints. A naive single-discrete representation, in which every possible switching configuration is enumerated as a distinct action, is computationally prohibitive since the cardinality of the space grows exponentially with the number of controllable switches. The multi-binary formulation in (9) shrinks the action space. However, the exploration might be extremely slow when the number of controllable switches is large, and the maximum number of toggled switches is hard to control in this formulation.

To overcome these limitations, this work adopts a multi-discrete action representation tailored to distribution network reconfiguration. The key is that the agent is allowed to control at most

N_{act}

switches during a single decision epoch, and each control decision is encoded as a discrete variable. Formally, the action is represented as a vector:

a_{t} = [a_{t, 1}, a_{t, 2}, \dots, a_{t, i}, \dots, a_{t, N_{act}}]

(13)

where each element

a_{t, i}

takes a value in

{0, 1, \dots, N_{sw}}

,

a_{t, i} \in {0, 1, \dots, N_{sw} - 1}

denotes the index of the selected switch, while

a_{t, i} = N_{sw}

corresponds to a “no-operation” action. This formulation yields a multi-discrete space of dimension

N_{act}

, where the effective size of the space is

{(N_{sw} + 1)}^{N_{act}}

.

Compared with alternative formulations, the proposed multi-discrete design offers several advantages. Relative to a single-discrete encoding, it avoids exponential blow-up, since the agent is no longer required to consider the entire set of feasible topologies as atomic actions. In contrast to multi-binary encoding, which treats all switches simultaneously, the multi-discrete representation provides explicit control over the maximum number of switches that can be operated, thereby reducing the exploration burden and aligning more closely with realistic operational practices where only a limited number of switching actions are permissible in a given time frame. Moreover, the inclusion of a “do-nothing” action at each sub-decision grants the policy the flexibility to adaptively determine how many switches to operate, ranging from zero up to

N_{act}

.

3.2. Maskable Proximal Policy Optimization

Conventional PPO is a policy-gradient algorithm that optimizes a clipped surrogate objective, providing stable policy updates while balancing exploration and exploitation. At decision step t, the policy

π_{θ} (a_{t} ∣ s_{t})

parameterized by

θ

is updated using the surrogate objective introduced in [39]. However, standard PPO assumes a fixed and static action space. In DNR, the feasible action set varies dynamically with the system topology, switch availability, and operational status. For instance, switches that lack remote actuation, are locked for maintenance, constrained by interlocking rules, or already isolated should not be treated as controllable actions. If such context-dependent infeasibilities are not explicitly handled, the agent may waste exploration effort on actions that cannot be executed in practice, slowing convergence and reducing policy interpretability.

To overcome these limitations, the MPPO integrates an adaptive masking mechanism into the policy [40]. In this work, the valid action space is determined by the currently controllable switches extracted from the case file. To incorporate the multi-discrete formulation in (13), the mask matrix

M_{t}

is constructed as:

M_{t} = [\begin{matrix} m_{t, 1, 1} & \dots & m_{t, 1, j} & \dots & m_{t, 1, N_{sw} + 1} \\ ⋮ & ⋱ & ⋮ & ⋮ \\ m_{t, k, 1} & \dots & m_{t, k, j} & \dots & m_{t, k, N_{sw} + 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ m_{t, N_{act}, 1} & \dots & m_{t, N_{act}, j} & \dots & m_{t, N_{act}, N_{sw} + 1} \end{matrix}] \in {0, 1}^{N_{act} \times (N_{sw} + 1)}

(14)

where the entry

m_{t, k, j} \in {0, 1}

indicates the validity of selecting action j in the k-th action dimension. For

j = 1, \dots, N_{sw}

, a value of

m_{t, k, j} = 1

denotes that switch j is currently controllable, whereas

m_{t, k, j} = 0

denotes that it is unavailable due to network or operational constraints. All rows of

M_{t}

are identical, so that each action dimension shares the same set of admissible switches. The last column

m_{t, k, N_{sw} + 1}

corresponds to the “no-operation” action and is fixed to 1 for all k, since this action is always permissible. This masking strategy dynamically adapts to changes in the set of controllable switches, ensuring that the policy explores only feasible actions without violating the current topological configuration. Consequently, the agent’s exploration is focused on valid operational regions, which enhances convergence stability and training efficiency in the multi-discrete decision space.

In the MPPO scheme, operational constraints on switch controllability are encoded through the binary action–switch mask

M_{t}

. At each time step t, the policy network outputs, for each action slot

k \in {1, \dots, N_{act}}

, an unmasked logit vector

z_{t, k} \in R^{N_{sw} + 1}

. The mask is then applied at the logit level to obtain the masked logits [40]:

{\tilde{z}}_{t, k, j} = \{\begin{matrix} z_{t, k, j}, & if m_{t, k, j} = 1, \\ C, & if m_{t, k, j} = 0, \end{matrix} j \in {1, \dots, N_{sw} + 1}

(15)

where C is a large negative constant, and

- 1 \times 10^{8}

is used in this work. The categorical policy over the k-th slot is defined as:

π_{θ} (a_{t, k} = j ∣ s_{t}, M_{t}) = \frac{exp ({\tilde{z}}_{t, k, j})}{\sum_{b = 1}^{N_{sw} + 1} exp ({\tilde{z}}_{t, k, b})}

(16)

Therefore, invalid actions receive identically zero probability mass and are never sampled during either data collection or policy evaluation. Policy entropy and log-probabilities are computed with respect to this masked distribution, i.e., over the set of indices j satisfying

m_{t, k, j} = 1

. The overall multi-discrete policy factorizes across slots as:

π_{θ} (a_{t} ∣ s_{t}, M_{t}) = \prod_{k = 1}^{N_{act}} π_{θ} (a_{t, k} ∣ s_{t}, M_{t})

(17)

while the value function

V_{ϕ} (s_{t})

is trained on the resulting trajectories and is not directly masked. In practice,

M_{t}

can be viewed as a time-varying constraint signal that conditions the policy on the currently admissible actions without altering the critic architecture.

Training follows the standard PPO procedure with clipping, applied to the masked policy. During the policy update, we recompute the masked logits

{\tilde{z}}_{t, k, j}

, evaluate the new log-probabilities

log π_{θ} (a_{t} ∣ s_{t}, M_{t})

, and form the importance sampling ratio:

r_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t}, M_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t}, M_{t})} = exp (log π_{θ} (a_{t} ∣ s_{t}, M_{t}) - log π_{θ_{old}} (a_{t} ∣ s_{t}, M_{t}))

(18)

The actor is then updated by maximizing the clipped surrogate objective with entropy regularization:

L^{clip} (θ) = E_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t}) + β H (π_{θ} (\cdot ∣ s_{t}, M_{t}))]

(19)

where

ϵ

is the clipping hyperparameter,

β > 0

is the entropy regularization coefficient, and

H (π_{θ} (\cdot ∣ s_{t}, M_{t}))

denotes the entropy of the masked policy, which can be implemented as the sum of the entropies of the

N_{act}

per-slot categorical distributions over indices j with

m_{t, k, j} = 1

. The critic parameters

ϕ

are updated by minimizing a squared-error loss between

V_{ϕ} (s_{t})

and the empirical returns. Since invalid actions have zero probability by construction, they never appear in the collected trajectories, and the gradients are concentrated on valid action choices only. This integration of the mask into the PPO framework ensures that the policy optimization respects topology-dependent action constraints while retaining the stability and robustness properties of standard PPO. The detailed pseudo-code of multi-discrete MPPO is given in Algorithm 1.

Algorithm 1: Multi-Discrete Maskable PPO (MPPO)
	Input: Historical scenarios ${D_{b}}$ ; environment $Env$ with AC solver; policy network $π_{θ}$ ; value network $V_{ϕ}$ ; horizon T; mini-batch size B; PPO clip $ϵ$ ; entropy coefficient $β$ ; GAE parameter $λ$ ; discount factor $γ$ ; PPO epochs K; training episodes $n_{ep}$ ; other parameters in Table A1.
	Output: Trained parameters $(θ, ϕ)$ .
1	Initialize $θ, ϕ$ ; buffer $B \leftarrow \emptyset$ .
2	for each training iteration do
3		Sample a scenario $D$ and reset $Env$ with its snapshot.
4		for $t = 0$ to $T - 1$ do
5			Observe state $s_{t}$ .
6			Construct the binary mask matrix $M_{t}$ .
7			for $k = 1$ to $N_{act}$ do
8				Compute masked logits ${\tilde{z}}_{t, k}$ using (15).
9				Sample $a_{t, k} \sim Categorical (softmax ({\tilde{z}}_{t, k}))$ .
10			Assemble $a_{t} = (a_{t, 1}, \dots, a_{t, N_{act}})$ .
11			Apply $a_{t}$ to $Env$ ; run solver; obtain next state $s_{t + 1}$ , convergence flag, done flag d, and other system information, including $N_{cyc}$ and $P_{t, i}^{L}$ ; compute reward $r_{t}$ using (11).
12			Compute $v_{t} \leftarrow V_{ϕ} (s_{t})$ ; compute $log π_{t}^{old} \leftarrow log π_{θ} (a_{t} ∣ s_{t}, M_{t})$ using (17).
13			Store $(s_{t}, M_{t}, a_{t}, r_{t}, s_{t + 1}, v_{t}, log π_{t}^{old})$ into $B$ .
14			if $s_{t + 1}$ is terminal then
15				break

16		Compute returns ${\hat{R}}_{t}$ and advantages ${\hat{A}}_{t}$ for all transitions in $B$ using GAE( $γ, λ$ ).
17		for $k = 1$ to K do
18			Sample a mini-batch of size B from $B$ .
19			Recompute masked logits ${\tilde{z}}_{t, k}$ and log-probabilities $log π_{θ} (a_{t} ∣ s_{t}, M_{t})$ for the mini-batch.
20			Compute importance ratios $r_{t} (θ)$ using (18).
21			Compute the masked policy entropy $H_{t}$ .
22			Form the actor loss
			$L^{π} (θ) = - \frac{1}{B} \sum (min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t}) + β H_{t})$
			and the critic loss
			$L^{V} (ϕ) = \frac{1}{B} \sum {(V_{ϕ} (s_{t}) - {\hat{R}}_{t})}^{2}$

23			Update $θ$ by taking a gradient step on $L^{π} (θ)$ ; update $ϕ$ by taking a gradient step on $L^{V} (ϕ)$ .
24		Clear buffer $B \leftarrow \emptyset$ .

3.3. Overall Framework

The overall architecture of the proposed DRL-based topology control framework is illustrated in Figure 2. The framework consists of two main stages: (i) offline data analysis and case selection based on historical Supervisory Control and Data Acquisition (SCADA) measurements, and (ii) training of MPPO agents using a commercial AC power flow solver and parallel environments. In the first stage, historical SCADA data are processed by a data analysis and operational risk identification module, which scans one year of recorded operation to identify system snapshots with RPF, overvoltage, or other undesirable operating conditions while ensuring that the underlying network topology is valid. From these filtered snapshots, representative cases with RPF, overvoltage, and valid topology are selected as training and evaluation scenarios. In parallel, a complete list of candidate breakers is constructed, including sectionalizing and tie switches that can participate in reconfiguration while respecting the existing protection and interlocking design. This pre-processing step provides a realistic ensemble of stressed operating points and a well-defined action space for the subsequent learning phase.

During the training stage, a set of parallel environments is instantiated, each coupling the AC power flow solver with security checks for topology radiality, cycle detection, voltage and line loading limits, and load supply. For each historical snapshot, the environment is first initialized and the solver computes the base-case AC power flow, producing the system state and associated diagnostic information under the current topology. Based on this initial solution and switch metadata, a binary mask

M_{t}

is constructed to indicate which switches are admissible for reconfiguration in that snapshot, and this mask is provided to the MPPO policy so that infeasible actions receive zero probability. The policy then samples a multi-discrete switching action, which is applied once to the simulator; the solver resolves the AC power flow for the post-switch configuration, checks all security constraints, and the resulting operating condition is used to compute a scalar reward. Each episode thus consists of a single decision step, and independent episodes are generated in parallel across environments using different historical snapshots. Policy networks are shared across all environments to collect these single-step transitions efficiently, while a centralized buffer aggregates samples from all parallel simulations. The value network is updated using the aggregated data, and the policy parameters are optimized using the MPPO objective with entropy regularization. The agent is periodically evaluated on a separate validation set, and the best-performing model is retained. Overall, this design enables a reinforcement learning–based topology control scheme that mitigates RPF and voltage violations under realistic operational constraints, while the multi-discrete MPPO with scenario-dependent action masks handles the combinatorial switch space efficiently and can be trained at scale using parallel simulations.

4. Case Studies

4.1. Data Pre-Processing and Analysis

One-year historical operational data of the distribution system shown in Figure 1 were utilized, covering the period from 1 January 2023, to 31 December 2023. The dataset contains hourly measurements, yielding a total of 8760 samples. Each data sample includes both the electrical and topological states required for the state variable definition in (8). Before model training, the raw data were carefully pre-processed to ensure physical consistency and to extract representative operating conditions. First, all samples exhibiting non-radial topologies were removed to guarantee that the analyzed configurations comply with the radial operation of distribution networks. Second, samples without RPF at the primary transformers were excluded, since this study specifically focuses on the mitigation of RPF of primary transformers. After these filtering steps, 3016 cases remained, each containing one or more overvoltage buses.

Figure 3 provides a statistical overview of the processed dataset. Figure 3a shows the histograms of active power of the two 110 kV main transformers. Transformer 1 operates under reverse power conditions in nearly all cases, with a maximum export of approximately 35 MW. Transformer 2 exhibits reverse power flow in about 70% of the cases, with a maximum magnitude of 38 MW. These results indicate that reverse power flow has become a persistent and significant phenomenon in the studied system. Figure 3b depicts the distribution of overvoltage occurrences among all buses. A total of 36 buses experienced overvoltage events under the initial configurations. Among them, buses 8, 9, 10, 11, 20, and 27 consistently exhibited overvoltage across all samples, as they are directly connected to large-scale PV plant feeders. Since these buses are inherently associated with local generation surplus and are hardly influenced by switching control from the current action space, they were excluded from the optimization objective set. Figure 3c presents a one-month time-series of the total reverse power at the substation. It can be observed that only two days in January 2023 were free from reverse power flow, and the maximum reverse power during the month reached approximately 34 MW. This further confirms the high penetration level of distributed photovoltaic generation in the studied network.

To enhance the training stability of the DRL agent, all continuous variables within the state vector were normalized prior to input to the neural network. Specifically, active and reactive power values of buses, branches, and transformers were scaled using min-max normalization based on their historical extrema, ensuring they lie within the range

[0, 1]

. Bus voltage magnitudes were normalized with respect to their nominal value. Topological indicators, such as switch statuses and connectivity features, were encoded as binary or categorical variables according to their operational meaning. This normalization not only improves numerical conditioning and convergence of gradient-based optimization but also facilitates generalization across different operating conditions and network configurations.

For the reinforcement learning process, the dataset was partitioned into a training and testing subset. A total of 500 randomly selected system states were assigned to the testing set, while the remaining samples were used for training. During both training and evaluation, the agent’s objective was to relieve transformer reverse power flow and bus overvoltage simultaneously, while maintaining power flow solvability, enforcing network radiality, preventing any load disconnection, and ensuring that line loading limits were not violated.

4.2. Experiment Settings

The experiments were conducted on the distribution network illustrated in Figure 1, whose parameters are summarized in Table 1. The system is supplied by two 110 kV/ 35 kV/10 kV three-winding transformers, rated at 63 MVA and 31.5 MVA, respectively. Two PV plants, rated at 15 MW and 9 MW, are connected to the 35 kV network. The selected main system components to construct the state space include 21 main feeders, five two-winding transformers, a total of 101 medium-voltage buses and six 110 kV main grid buses. The controllable switching devices are restricted to 10 kV sectionalizing and tie switches, with a total number of 350. Correspondingly, the system state vector

s_{t}

has a dimension of 607, capturing all electrical and topological variables described in (8), while the action dimension

| A |

equals 350.

All experiments were conducted under the following operational constraints: the bus voltage magnitude was limited within

v_{i}^{min} = 0.9

p.u. and

v_{i}^{max} = 1.1

p.u., and the line loading ratio was constrained by

ρ_{i}^{l, max} = 1.0

. The reward function incorporated multiple penalty terms to ensure both operational feasibility and learning stability. The detailed parameters are listed in Table 2. Other training parameters, such as network structure, learning rate, batch size, etc., are provided in the Appendix A for completeness.

The simulation environment utilizes a commercial solver deployed by Jiangsu Electric Power Company Research Institute, Nanjing, China for real-time security assessment to solve the power flow and validate topological feasibility. All implementations were carried out in Python v3.10, where the PPO and maskable PPO algorithms were implemented based on the Stable-Baselines3 library. To accelerate training, a distributed implementation was deployed on a Linux-based computation server equipped with four NVIDIA RTX 3090 GPUs and an AMD EPYC processor with 384 threads.

4.3. Experimental Results and Analysis

4.3.1. Comparison Between the Proposed Method and Baselines

To evaluate the effectiveness of the proposed method, comparative experiments were conducted among three reinforcement learning algorithms: A2C, PPO, and the proposed Maskable PPO. The training performance is illustrated in Figure 4, which presents the evolution of both the training and evaluation rewards. As shown in Figure 4a, Maskable PPO demonstrates the fastest convergence rate and the most stable learning process, followed by PPO and A2C. The incorporation of action masking effectively reduces infeasible exploration by dynamically excluding invalid switching operations, thereby accelerating policy learning and improving sample efficiency. Figure 4b further shows that Maskable PPO achieves the highest evaluation reward throughout the training process, indicating its better generalization performance on unseen states. Although the reward difference between PPO and Maskable PPO is moderate, the latter consistently maintains higher performance and stability across the entire training horizon.

The performance of all trained agents was evaluated on a test set of 500 unseen operating scenarios. The evaluation metrics correspond directly to the reward components defined in (12): the reverse power flow reduction reward (

r_{t}^{rev}

), the bus voltage improvement reward (

r_{t}^{vol}

), the line overflow penalty (

r_{t}^{flo}

), and the load disconnection penalty (

r_{t}^{out}

). Additional operational indicators, including the absolute reverse power reduction, the number of overvoltage buses, and the number of switching actions executed, were also analyzed. The results are summarized in Table 3.

Several key observations can be drawn from Table 3. Maskable PPO achieves the highest reverse power flow reduction reward (

r_{t}^{rev} = 0.239

), corresponding to an average decrease of 6.54 MW, which represents an improvement of 28% and 72% compared to PPO and A2C, respectively. This indicates the algorithm’s enhanced capability to identify effective switching actions that reduce reverse power flow while maintaining network feasibility. The line overflow and cycle violation counts remain zero for all methods, confirming that topological and operational constraints were consistently respected during policy execution.

Regarding voltage performance, Maskable PPO also attains the largest improvement in bus voltage compliance, reflected by the highest

r_{t}^{vol}

value (0.067) and the greatest reduction in the number of overvoltage buses (2.64). The slightly higher number of overvoltage cases (162) compared with PPO (149) is attributed to the more aggressive switching strategies adopted to maximize reverse power flow mitigation. Critically, unlike A2C, which frequently caused load disconnections (

r_{t}^{out} = - 16.816

), both PPO and Maskable PPO maintain full power supply across all test scenarios.

The operational efficiency is further reflected in the average number of switching actions per case. Maskable PPO executes approximately seven breaker operations on average, which is practical for real-world network reconfiguration, considering substantial improvements in both reverse power flow mitigation and voltage regulation. Collectively, these results demonstrate that Maskable PPO provides effective, stable, and feasible topology control for PV-rich distribution networks, balancing reward maximization with practical operational constraints.

4.3.2. Experiment on Different Numbers of Maximum Controlled Switches

To investigate the impact of action space size on learning efficiency and control effectiveness, the multi-discrete Maskable PPO algorithm was evaluated under four different maximum numbers of controllable breakers,

N_{act} = 5, 10, 15, 20

. The corresponding training and evaluation progress is illustrated in Figure 5. Limiting the number of controllable switches significantly affects both convergence behavior and policy performance. Specifically, smaller

N_{act}

values yield faster and smoother convergence, with

N_{act} = 5

exhibiting the most stable training trajectory, followed by

N_{act} = 10

, 15, and 20. However, policies with moderate control flexibility (

N_{act} = 10

and 15) achieve better evaluation performance, indicating an optimal balance between action efficiency and system optimization capability.

The quantitative test results on 500 unseen operating scenarios are summarized in Table 4. When

N_{act} = 5

, the agent is highly constrained, resulting in limited improvements in reverse power flow reduction (

r_{t}^{rev} = 0.143

) and voltage regulation (

r_{t}^{vol} = 0.044

). Although convergence is stable, the small number of switching actions (average of 4 breakers per case) restricts the ability to relieve system stress, leaving a large number of buses experiencing overvoltage (194 buses). Increasing

N_{act}

to 10 or 15 enlarges the feasible action space, enabling more effective exploration and improved control performance. In particular,

N_{act} = 15

achieves the highest reverse power flow reduction ratio (

r_{t}^{rev} = 0.243

) and the largest average decrease in reverse active power (6.887 MW), while maintaining full compliance with network constraints, including radiality, connectivity, and thermal limits. The corresponding voltage violation penalty is also minimized, and the overvoltage issue is completely solved in about 65% cases, although there are 173 cases still having overvoltage buses. When

N_{act}

is further increased to 20, a slight performance degradation is observed. This is likely due to the enlarged action space introducing redundant switching options, increasing exploration difficulty, and slightly hindering convergence, despite the reduction in overvoltage cases.

These results indicate that appropriately limiting the maximum number of controllable switches is crucial for effective and stable learning. In this study,

N_{act} = 10

and 15 achieve the best trade-off between convergence speed, training stability, and control performance, allowing the agent to effectively balance reverse power flow mitigation and voltage regulation without compromising system reliability.

4.4. Apply Trained Agent on Full Dataset

To further validate the capability and practical effectiveness of the proposed control strategy, the best-trained agent—corresponding to the multi-discrete Maskable PPO model with a maximum controllable breaker number of

N_{act} = 15

, was applied to the complete one-year dataset consisting of 3016 operational cases exhibiting reverse power flow (RPF) and overvoltage conditions. The objective of this experiment is to assess whether the trained policy can maintain stable and effective performance under diverse operating conditions that were not directly observed during training, while satisfying all physical and topological constraints.

Figure 6 illustrates the comparison of reverse power flow conditions before and after applying the learned control strategy, which depicts the one-year time-series variation of the total reverse power flow aggregated across the main transformers. It confirms the robustness of the proposed method throughout the entire year, where the total system RPF (orange line) is consistently lower than the initial value (blue line), achieving an average reduction of approximately 25%. This performance demonstrates that the learned control policy generalizes effectively across different seasonal and diurnal variations of load and PV output, enabling the mitigation of RPF in a stable and continuous manner.

The improvement in voltage profiles is further analyzed in Figure 7, which presents the comparison of overvoltage occurrences before and after control. As shown in Figure 7a, in the initial states most operational scenarios suffer from severe voltage violations, with 6 or 12 buses exhibiting voltages beyond the upper limit of 1.1 p.u., and extreme cases reaching up to 21 affected buses. After applying the trained agent, Figure 7b reveals a substantial reduction in the number of overvoltage buses. Approximately 60% of the post-control cases exhibit completely normal voltage profiles with no violations, while the remaining cases show only minor residual overvoltages limited to a few buses, typically 3 or 6. The maximum number of overvoltage buses also decreases from 21 to 12, indicating that the proposed control approach effectively reconfigures the network to alleviate local voltage rise problems caused by high PV generation and reverse flows.

Overall, these results demonstrate that the proposed multi-discrete MPPO method can generalize effectively from training to unseen operational scenarios, maintaining constraint satisfaction and stable performance across an entire year of historical data. The control strategy not only mitigates reverse power flow by around 25% on average, but also restores voltage profiles to acceptable levels in the majority cases, confirming its capability for large-scale effective reconfiguration of active distribution systems with high renewable penetration.

5. Conclusions

This paper has introduced a reinforcement learning-based method for RPF and voltage violations in photovoltaic-rich distribution networks. A novel multi-discrete Maskable PPO algorithm was developed to address the combinatorial complexity and strict operational constraints inherent in network reconfiguration. By integrating topology-aware action masking with a multi-discrete action structure, the proposed approach enables the agent to efficiently explore feasible switching operations while ensuring radiality, load connectivity, and adherence to operational limits.

Extensive case studies based on historical operational data from a real distribution system demonstrated the effectiveness and practicality of the proposed method. Comparative evaluations showed that the proposed multi-discrete Maskable PPO outperformed standard PPO and A2C in terms of learning stability, convergence speed, and operational performance. On unseen test scenarios, the trained agent achieved an average RPF reduction of approximately 24.3% and significantly improved voltage profiles, with about 65% of cases restored to normal voltage ranges. Analyses on varying the maximum number of controllable switches highlighted that moderate action limits (

N_{act} = 10, 15

) offer an optimal trade-off between convergence efficiency and control effectiveness, confirming the importance of structured action constraints in high-dimensional reconfiguration problems.

Application of the best-trained agent to the full dataset of 3016 operational cases further demonstrated robust generalization across diverse seasonal and diurnal conditions, consistently maintaining constraint-compliant performance. These results establish the proposed multi-discrete Maskable PPO method as a scalable and efficient tool for network reconfiguration, capable of autonomously mitigating RPF and voltage rise phenomena in PV-rich distribution systems.

Future work will focus on two directions. First, topology reconfiguration alone has limited capability to fully mitigate reverse power flow, particularly under scenarios of heavy RPF. To address this limitation, integration with additional control measures, such as large-scale battery energy storage systems or other flexible resources, will be investigated to provide coordinated mitigation strategies. Second, the framework will be extended to real-time control by incorporating forecasted PV generation and load data, enabling proactive adjustment of network configurations under anticipated operating conditions. In addition, more practical constraints will be investigated and integrated. Overall, these directions aim to further enhance the effectiveness, robustness, and practical applicability of the proposed Maskable PPO-based control framework in PV-rich distribution networks.

Author Contributions

Conceptualization, T.L. and R.D.; methodology, T.L.; validation, R.D., W.X. and J.J.; resources, R.D. and J.J.; data curation, J.J. and K.J.; writing—original draft preparation, T.L.; writing—review and editing, T.L., R.D., W.X., J.J., X.X. and K.J.; visualization, T.L. and X.X.; supervision, R.D.; project administration, R.D. and J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Key Research and Development Program, “Key Technology and Demonstration Application of Active Virtual Power Plant to Support Low Carbon City Grids” (Project No. 2024YFE0110900).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author J.J. and K.J. were employed by the company State Grid Shanghai Municipal Electric Power Company, Shanghai, China. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DNR	Distribution Network Reconfiguration
DRL	Deep Reinforcement Learning
GNNs	Graph Neural Networks
MPPO	Maskable Proximal Policy Optimization
OLTC	On-Load Tap Changer
PPO	Proximal Policy Optimization
PV	Photovoltaic
RPF	Reverse Power Flow
STATCOM	Static Synchronous Compensator
SCADA	Supervisory Control and Data Acquisition

Appendix A

The detailed training configurations of the PPO and Maskable PPO algorithms adopted in this study are summarized in Table A1. The learning rate followed a linear decay schedule starting from

1 \times 10^{- 4}

, and each training update utilized a mini-batch of 32 samples for 20 optimization epochs. The horizon is 1 for the Multi-Armed Bandit. The discount factor

γ

and the generalized advantage estimation parameter

λ

were set to 0.99 and 0.98, respectively, to balance long-term reward propagation and training stability. The clipping range for the policy ratio was fixed at 0.2 to constrain excessive policy updates, while the entropy coefficient

c_{ent}

was chosen as

1 \times 10^{- 3}

to regulate policy exploration. The gradient norm was limited to 0.5 to prevent gradient explosion during training.

Table A1. Training parameters of PPO and MPPO algorithm.

Parameter	Value
Learning rate	Linear schedule $(1 \times 10^{- 4})$
Horizon	1
Batch size	32
Number of epochs per update K	20
Discount factor $γ$	0.99
GAE parameter $λ$	0.98
Clipping range $ϵ$	0.2
Entropy coefficient $c_{ent}$	$1 \times 10^{- 3}$
Max gradient norm	0.5
Policy network structure	[512, 256, 128]
Value network structure	[512, 128, 32]
Number of parallel environments $n_{envs}$	32
Total episodes $n_{ep}$	60,000/40,000
Evaluation frequency	every 400 iterations
Evaluation episodes	50

The actor and critic networks were parameterized as multilayer perceptrons with ReLU activation functions, consisting of three hidden layers with dimensions [512, 256, 128] for the policy and [512, 128, 32] for the value network, respectively. Training was executed in a distributed environment with 32 parallel simulation environments (

n_{envs} = 32

) to accelerate sample collection and improve convergence stability. Each training experiment consisted of 60,000 or 40,000 episodes for the primary and comparative settings, respectively. Model evaluation was conducted every 400 training iterations using 50 independent test episodes to assess policy generalization and constraint satisfaction.

References

Sharma, V.; Aziz, S.M.; Haque, M.H.; Kauschke, T. Effects of High Solar Photovoltaic Penetration on Distribution Feeders and the Economic Impact. Renew. Sustain. Energy Rev. 2020, 131, 110021. [Google Scholar] [CrossRef]
Gandhi, O.; Kumar, D.S.; Rodríguez-Gallegos, C.D.; Srinivasan, D. Review of Power System Impacts at High PV Penetration Part I: Factors Limiting PV Penetration. Sol. Energy 2020, 210, 181–201. [Google Scholar] [CrossRef]
Karimi, M.; Mokhlis, H.; Naidu, K.; Uddin, S.; Bakar, A. Photovoltaic Penetration Issues and Impacts in Distribution Network—A Review. Renew. Sustain. Energy Rev. 2016, 53, 594–605. [Google Scholar] [CrossRef]
Majeed, I.B.; Nwulu, N.I. Impact of Reverse Power Flow on Distributed Transformers in a Solar-Photovoltaic-Integrated Low-Voltage Network. Energies 2022, 15, 9238. [Google Scholar] [CrossRef]
Iioka, D.; Fujii, T.; Orihara, D.; Tanaka, T.; Harimoto, T.; Shimada, A.; Goto, T.; Kubuki, M. Voltage Reduction Due to Reverse Power Flow in Distribution Feeder with Photovoltaic System. Int. J. Electr. Power Energy Syst. 2019, 113, 411–418. [Google Scholar] [CrossRef]
Stetz, T.; Marten, F.; Braun, M. Improved Low Voltage Grid-Integration of Photovoltaic Systems in Germany. IEEE Trans. Sustain. Energy 2013, 4, 534–542. [Google Scholar] [CrossRef]
Bin Turiman, M.S.; Nizam Bin Mohd Sarmin, M.K. Reverse Power Flow Analysis in Distribution Network. In Proceedings of the 2021 IEEE International Conference in Power Engineering Application (ICPEA), Shah Alam, Malaysia, 8–9 March 2021; pp. 127–132. [Google Scholar] [CrossRef]
Abraham, S.; Mishra, Y.; Cholette, M.E. Rule-Based Model Predictive Control to Mitigate Reverse Power Flow in Distribution Feeders. In Proceedings of the 2024 IEEE Power & Energy Society General Meeting (PESGM), Seattle, WA, USA, 21–25 July 2024; pp. 1–5. [Google Scholar] [CrossRef]
Hasheminamin, M.; Agelidis, V.G.; Salehi, V.; Teodorescu, R.; Hredzak, B. Index-Based Assessment of Voltage Rise and Reverse Power Flow Phenomena in a Distribution Feeder Under High PV Penetration. IEEE J. Photovolt. 2015, 5, 1158–1168. [Google Scholar] [CrossRef]
De Carne, G.; Buticchi, G.; Zou, Z.; Liserre, M. Reverse Power Flow Control in a ST-Fed Distribution Grid. IEEE Trans. Smart Grid 2018, 9, 3811–3819. [Google Scholar] [CrossRef]
Tounsi Fokui, W.S.; Saulo, M.; Ngoo, L. Controlled Electric Vehicle Charging for Reverse Power Flow Correction in the Distribution Network with High Photovoltaic Penetration: Case of an Expanded IEEE 13 Node Test Network. Heliyon 2022, 8, e09058. [Google Scholar] [CrossRef]
Holguin, J.P.; Rodriguez, D.C.; Ramos, G. Reverse Power Flow (RPF) Detection and Impact on Protection Coordination of Distribution Systems. IEEE Trans. Ind. Appl. 2020, 56, 2393–2401. [Google Scholar] [CrossRef]
Shabbir, N.; Kütt, L.; Astapov, V.; Daniel, K.; Jawad, M.; Husev, O.; Rosin, A.; Martins, J. Enhancing PV Hosting Capacity and Mitigating Congestion in Distribution Networks with Deep Learning Based PV Forecasting and Battery Management. Appl. Energy 2024, 372, 123770. [Google Scholar] [CrossRef]
Sgouras, K.I.; Bouhouras, A.S.; Gkaidatzis, P.A.; Doukas, D.I.; Labridis, D.P. Impact of Reverse Power Flow on the Optimal Distributed Generation Placement Problem. IET Gener. Transm. Distrib. 2017, 11, 4626–4632. [Google Scholar] [CrossRef]
Gao, H.; Jiang, S.; Li, Z.; Wang, R.; Liu, Y.; Liu, J. A Two-Stage Multi-Agent Deep Reinforcement Learning Method for Urban Distribution Network Reconfiguration Considering Switch Contribution. IEEE Trans. Power Syst. 2024, 39, 7064–7076. [Google Scholar] [CrossRef]
Al-Saffar, M.; Musilek, P. Reinforcement Learning-Based Distributed BESS Management for Mitigating Overvoltage Issues in Systems with High PV Penetration. IEEE Trans. Smart Grid 2020, 11, 2980–2994. [Google Scholar] [CrossRef]
Agalgaonkar, Y.P.; Pal, B.C.; Jabr, R.A. Distribution Voltage Control Considering the Impact of PV Generation on Tap Changers and Autonomous Regulators. IEEE Trans. Power Syst. 2014, 29, 182–192. [Google Scholar] [CrossRef]
Jo, S.; Oh, J.Y.; Yoon, Y.T.; Jin, Y.G. Self-Healing Radial Distribution Network Reconfiguration Based on Deep Reinforcement Learning. Results Eng. 2024, 22, 102026. [Google Scholar] [CrossRef]
Shafiullah, G.; Arif, M.T.; Oo, A.M. Mitigation Strategies to Minimize Potential Technical Challenges of Renewable Energy Integration. Sustain. Energy Technol. Assess. 2018, 25, 24–42. [Google Scholar] [CrossRef]
Bui, V.H.; Su, W. Real-Time Operation of Distribution Network: A Deep Reinforcement Learning-Based Reconfiguration Approach. Sustain. Energy Technol. Assess. 2022, 50, 101841. [Google Scholar] [CrossRef]
Wang, R.; Bi, X.; Bu, S. Real-Time Coordination of Dynamic Network Reconfiguration and Volt-VAR Control in Active Distribution Network: A Graph-Aware Deep Reinforcement Learning Approach. IEEE Trans. Smart Grid 2024, 15, 3288–3302. [Google Scholar] [CrossRef]
Chen, X.; Qu, G.; Tang, Y.; Low, S.; Li, N. Reinforcement Learning for Selective Key Applications in Power Systems: Recent Advances and Future Challenges. IEEE Trans. Smart Grid 2022, 13, 2935–2958. [Google Scholar] [CrossRef]
Cao, D.; Zhao, J.; Hu, W.; Yu, N.; Ding, F.; Huang, Q.; Chen, Z. Deep Reinforcement Learning Enabled Physical-Model-Free Two-Timescale Voltage Control Method for Active Distribution Systems. IEEE Trans. Smart Grid 2022, 13, 149–165. [Google Scholar] [CrossRef]
Liu, S.; Wu, C.; Zhu, H. Topology-Aware Graph Neural Networks for Learning Feasible and Adaptive AC-OPF Solutions. IEEE Trans. Power Syst. 2023, 38, 5660–5670. [Google Scholar] [CrossRef]
Xiang, Y.; Lu, Y.; Liu, J. Deep Reinforcement Learning Based Topology-Aware Voltage Regulation of Distribution Networks with Distributed Energy Storage. Appl. Energy 2023, 332, 120510. [Google Scholar] [CrossRef]
Damjanović, I.; Pavić, I.; Puljiz, M.; Brcic, M. Deep Reinforcement Learning-Based Approach for Autonomous Power Flow Control Using Only Topology Changes. Energies 2022, 15, 6920. [Google Scholar] [CrossRef]
Wang, S.; Duan, J.; Shi, D.; Xu, C.; Li, H.; Diao, R.; Wang, Z. A Data-Driven Multi-Agent Autonomous Voltage Control Framework Using Deep Reinforcement Learning. IEEE Trans. Power Syst. 2020, 35, 4644–4654. [Google Scholar] [CrossRef]
Sun, X.; Qiu, J. Two-Stage Volt/Var Control in Active Distribution Networks With Multi-Agent Deep Reinforcement Learning Method. IEEE Trans. Smart Grid 2021, 12, 2903–2912. [Google Scholar] [CrossRef]
Lan, T.; Duan, J.; Zhang, B.; Shi, D.; Wang, Z.; Diao, R.; Zhang, X. AI-Based Autonomous Line Flow Control via Topology Adjustment for Maximizing Time-Series ATCs. In Proceedings of the 2020 IEEE Power & Energy Society General Meeting (PESGM), Montreal, QC, Canada, 2–6 August 2020; pp. 1–5. [Google Scholar] [CrossRef]
Pei, Y.; Zhao, J.; Yao, Y.; Ding, F. Multi-Task Reinforcement Learning for Distribution System Voltage Control with Topology Changes. IEEE Trans. Smart Grid 2023, 14, 2481–2484. [Google Scholar] [CrossRef]
Su, T.; Zhao, J.; Yao, Y.; Selim, A.; Ding, F. Safe Reinforcement Learning-Based Transient Stability Control for Islanded Microgrids with Topology Reconfiguration. IEEE Trans. Smart Grid 2025, 16, 3432–3444. [Google Scholar] [CrossRef]
Cao, D.; Hu, W.; Zhao, J.; Huang, Q.; Chen, Z.; Blaabjerg, F. A Multi-Agent Deep Reinforcement Learning Based Voltage Regulation Using Coordinated PV Inverters. IEEE Trans. Power Syst. 2020, 35, 4120–4123. [Google Scholar] [CrossRef]
Liu, H.; Wu, W. Online Multi-Agent Reinforcement Learning for Decentralized Inverter-Based Volt-VAR Control. IEEE Trans. Smart Grid 2021, 12, 2980–2990. [Google Scholar] [CrossRef]
Wu, Y.; Li, Q.; Qiu, J. Hierarchical Deep Reinforcement Learning for Dynamic Reconfiguration of Photovoltaic Integration in Distribution Network. Electr. Power Syst. Res. 2025, 246, 111587. [Google Scholar] [CrossRef]
Wang, W.; Yu, N.; Gao, Y.; Shi, J. Safe Off-Policy Deep Reinforcement Learning Algorithm for Volt-VAR Control in Power Distribution Systems. IEEE Trans. Smart Grid 2020, 11, 3008–3018. [Google Scholar] [CrossRef]
Zhan, H.; Jiang, C.; Lin, Z. A Novel Graph Reinforcement Learning-Based Approach for Dynamic Reconfiguration of Active Distribution Networks with Integrated Renewable Energy. Energies 2024, 17, 6311. [Google Scholar] [CrossRef]
Quakernack, L.; Kelker, M.; Haubrock, J. Deep Reinforcement Learning For Autonomous Control Of Low Voltage Grids with Focus On Grid Stability In Future Power Grids. In Proceedings of the 2022 IEEE PES Innovative Smart Grid Technologies Conference Europe (ISGT-Europe), Novi Sad, Serbia, 10–12 October 2022; pp. 1–5. [Google Scholar] [CrossRef]
Sun, S.; Yang, S.; Yu, P.; Cheng, Y.; Xing, J.; Wang, Y.; Yi, Y.; Hu, Z.; Yao, L.; Pang, X. A Reinforcement Learning-Based Approach for Distributed Photovoltaic Carrying Capacity Analysis in Distribution Grids. Energies 2025, 18, 5029. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Huang, S.; Ontañón, S. A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. Int. FLAIRS Conf. Proc. 2022, 35, 1–6. [Google Scholar] [CrossRef]

Figure 1. System topology.

Figure 2. Overview of the proposed framework.

Figure 3. Analysis of 1-year historical data. (a) Histogram of active power of 110 kV transformers (gray bars: overlap of transformer 1 and 2). (b) Histogram of overvoltage buses. (c) 1-month time-series plot of RPF.

Figure 4. Training progress of different models. (a) Training reward. (b) Evaluation reward.

Figure 5. Training progress of multi-discrete Maskable PPO with different numbers of maximum controlled breakers. (a) Training reward. (b) Evaluation reward.

Figure 6. Comparison of total reverse power flow before and after control.

Figure 7. Comparison of the number of overvoltage buses before and after control. (a) Histogram of initial status. (b) Histogram of after control.

Table 1. System information.

Parameter	Value
Main transformer 1	63 MVA (110 kV/35 kV/10 kV)
Main transformer 2	31.5 MVA (110 kV/35 kV/10 kV)
PV plant 1	15 MW
PV plant 2	9 MW
Number of 10 kV switches	350
Number of main feeders	21
Number of two-winding transformers	5
Number of selected 10 kV buses	101
Number of selected 35 kV buses	6
Number of selected main loads	39
State dimension $\| s_{t} \|$	607
Action dimension $\| A \|$	350

Table 2. Reward and constraint hyperparameters.

Parameter	Value
Voltage limits $(v_{i}^{min}, v_{i}^{max})$	(0.9, 1.1) p.u.
Maximum line loading rate $ρ_{i}^{l, max}$	1.0
Penalty for power flow non-convergence $C_{dis}$	$- 5$
Penalty for creating cycles $C_{cyc}$	$- 5$
Reward of completed solved $R_{d}$	5
Overflow penalty coefficient $λ_{1}$	10
Voltage violation coefficient $λ_{2}$	20
Load disconnection coefficient $λ_{3}$	$0.005$

Table 3. Test results comparison among different algorithms.

Metric (Average)	A2C	PPO	Maskable PPO
Reverse power decrease ratio $r_{t}^{rev}$	0.143	0.189	0.239
Voltage violation decrease (p.u.) $r_{t}^{vol}$	0.053	0.060	0.067
Line overflow penalty $r_{t}^{flo}$	0	0	0
Load disconnection penalty $r_{t}^{out}$	−16.816	0	0
Reverse power reduction (MW)	3.798	5.112	6.539
Number of action breakers	6.242	5.428	7.000
Number of cases with cycles	0	0	0
Number of undervoltage cases	0	0	0
Number of overvoltage cases	154	149	162
Number of reduced overvoltage buses	2.082	2.440	2.642

Table 4. Test results comparison under different

N_{act}

settings.

Table 4. Test results comparison under different

N_{act}

settings.

Metric (Average)	$N_{act} = 5$	$N_{act} = 10$	$N_{act} = 15$	$N_{act} = 20$
Reverse power decrease ratio $r_{t}^{rev}$	0.143	0.224	0.243	0.190
Voltage violation decrease (p.u.) $r_{t}^{vol}$	0.044	0.067	0.068	0.060
Line overflow penalty $r_{t}^{flo}$	0	0	0	0
Load disconnection penalty $r_{t}^{out}$	−8.000	0	0	0
Reverse power reduction (MW)	3.806	6.323	6.887	5.145
Number of action breakers	4.000	8.000	9.474	9.348
Number of cases with cycles	0	0	0	0
Number of undervoltage cases	0	0	0	0
Number of overvoltage cases	194	162	173	148
Number of reduced overvoltage buses	1.764	2.630	2.676	2.420

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lan, T.; Diao, R.; Xu, W.; Ju, J.; Xiang, X.; Jia, K. Maskable PPO-Based Topology Control for Reverse Power Flow Mitigation in PV-Rich Distribution Networks. Electronics 2025, 14, 4525. https://doi.org/10.3390/electronics14224525

AMA Style

Lan T, Diao R, Xu W, Ju J, Xiang X, Jia K. Maskable PPO-Based Topology Control for Reverse Power Flow Mitigation in PV-Rich Distribution Networks. Electronics. 2025; 14(22):4525. https://doi.org/10.3390/electronics14224525

Chicago/Turabian Style

Lan, Tu, Ruisheng Diao, Wangjie Xu, Jiehua Ju, Xuanchen Xiang, and Kunqi Jia. 2025. "Maskable PPO-Based Topology Control for Reverse Power Flow Mitigation in PV-Rich Distribution Networks" Electronics 14, no. 22: 4525. https://doi.org/10.3390/electronics14224525

APA Style

Lan, T., Diao, R., Xu, W., Ju, J., Xiang, X., & Jia, K. (2025). Maskable PPO-Based Topology Control for Reverse Power Flow Mitigation in PV-Rich Distribution Networks. Electronics, 14(22), 4525. https://doi.org/10.3390/electronics14224525

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Maskable PPO-Based Topology Control for Reverse Power Flow Mitigation in PV-Rich Distribution Networks

Abstract

1. Introduction

2. Proposed Method

2.1. System Overview

2.2. Problem Description and Mathematical Modeling

2.3. Problem Formulation as Multi-Armed Bandit

3. Proposed Method Incorporating Multi-Discrete Maskable PPO

3.1. Multi-Discrete Action Space Formulation

3.2. Maskable Proximal Policy Optimization

3.3. Overall Framework

4. Case Studies

4.1. Data Pre-Processing and Analysis

4.2. Experiment Settings

4.3. Experimental Results and Analysis

4.3.1. Comparison Between the Proposed Method and Baselines

4.3.2. Experiment on Different Numbers of Maximum Controlled Switches

4.4. Apply Trained Agent on Full Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI