1. Introduction
The growing global commitment to carbon neutrality and the rapid deployment of distributed renewable energy resources are driving a paradigm shift in the operation of power systems. Among these resources, photovoltaic (PV) generation has experienced the most rapid proliferation due to declining installation costs, policy incentives, and advancements in inverter and control technologies [
1,
2]. Consequently, modern distribution networks, historically designed for unidirectional power delivery from substations to end users, are now accommodating large-scale, intermittent and spatially dispersed photovoltaic installations [
3,
4]. The stochastic and location-dependent nature of PV generation introduces substantial variability and uncertainty in local power flows, leading to frequent bidirectional power transfers within distribution feeders [
5,
6]. This operational shift has revealed new technical challenges that were not anticipated in the design of conventional distribution systems. Among them, one of the most prominent is reverse power flow (RPF), which arises when local PV output exceeds the downstream power demand and forces surplus power to propagate upstream towards higher voltage levels [
7,
8]. The occurrence of RPF is particularly pronounced in residential feeders with dense PV deployment, correlated solar irradiance patterns, and limited demand diversity [
1,
9]. Persistent RPF not only modifies the intended power flow patterns but also undermines operational stability and introduces significant economic implications for distribution network operators.
RPF imposes multifaceted technical and economic impacts on distribution systems. Excessive upstream power injection can lead to voltage rise beyond limits, adversely affecting power quality and potentially damaging customer equipment [
4,
10]. Transformers and substations exposed to reverse loading are subjected to thermal stress, accelerated aging, and possible overloading [
5,
11]. Moreover, conventional protection schemes designed under the assumption of unidirectional fault currents may misoperate under reversed flow conditions, posing risks to system reliability and safety [
12]. In addition, persistent RPF contributes to increased line losses, complicates voltage regulation, and limits the renewable hosting capacity of distribution feeders, ultimately constraining the broader integration of distributed energy resources [
13,
14]. These issues motivate the need for advanced RPF mitigation strategies that can ensure safe, reliable, and economically efficient operation of active distribution networks under high PV penetration.
Various mitigation strategies have been proposed in the literature, including PV curtailment, reactive power support from smart inverters, coordinated control of on-load tap changers (OLTCs), and deployment of energy storage systems [
8,
15,
16,
17]. While effective in specific cases, these strategies have inherent limitations. Curtailment reduces renewable utilization and economic benefits, reactive power control may elevate network losses and accelerate inverter degradation, and hardware-based solutions such as capacitor banks or STATCOMs demand substantial capital investment [
18,
19]. An increasingly promising alternative is distribution network reconfiguration (DNR), which dynamically alters the status of sectionalization and tie switches to reshape network topology. DNR can redistribute power flows, mitigate congestion, and alleviate RPF without extensive infrastructure upgrades [
20]. However, operationalizing DNR in PV-rich networks requires selecting from a large set of feasible switch actions while PV and loads vary rapidly, motivating adaptive, topology-constrained decision-making [
21].
In this context, recent advances in artificial intelligence, particularly deep reinforcement learning (DRL), offer a promising and adaptive approach to the complex challenges of DNR. DRL is well-suited for high-dimensional state spaces, nonlinear power system dynamics, and sequential decision-making under uncertainty, enabling autonomous learning of control policies without requiring explicit system models [
22]. In power systems, DRL has been successfully applied to voltage regulation, optimal power flow, demand response, and DNR [
23,
24,
25,
26,
27,
28,
29,
30,
31]. Recent studies increasingly explore multi-agent [
15,
32,
33], hierarchical [
28,
34], and safe DRL [
31,
35], to improve scalability, coordinate distributed decision-making, decompose tasks across temporal and spatial scales, and explicitly account for operational risks and safety constraints. In parallel, graph neural networks (GNNs) are integrated as topology-aware feature extractors, enabling DRL policies and value functions to operate on graph-structured grid representations and generalize more effectively across varying network topologies and operating conditions [
21,
36].
Despite these advances, relatively few studies have explicitly focused on RPF mitigation. In [
37], a double deep Q-network approach is employed to mitigate RPF and reduce transformer load rates by optimizing the operation of battery energy storage systems and electric vehicle charging stations. Besides, ref. [
34] introduces a hierarchical DRL framework that enhances local PV absorption while explicitly constraining PV injection to prevent RPF, achieved through dynamic topology reconfiguration. Furthermore, ref. [
38] presents a DRL-based method for analyzing the distributed photovoltaic carrying capacity in distribution grids, incorporating transformer RPF as a constraint to ensure system stability. Despite these contributions, the practical deployment of DRL in distribution networks presents several critical challenges. The combinatorial explosion of action spaces resulting from multiple switch operations complicates the learning process and often leads to inefficient exploration. Moreover, maintaining radiality and ensuring load connectivity during reconfiguration is nontrivial, as unrestrained exploration can result in infeasible network topologies. To address these challenges, specialized DRL algorithms are required—ones that can effectively integrate operational constraints, handle large combinatorial action spaces, and prioritize RPF mitigation as a central control objective.
To address these challenges, this paper proposes an intelligent topology control method based on a multi-discrete Maskable PPO (MPPO) algorithm. The main contributions of this work are summarized as follows:
A reinforcement learning-based topology control method is developed to mitigate RPF and voltage violations in distribution networks with high PV penetration. The method explicitly incorporates practical operational constraints, including network radiality, load connectivity, and line loading limits.
A multi-discrete MPPO algorithm is introduced to handle the combinatorial nature of switch operations in large distribution networks. The multi-discrete action space allows simultaneous consideration of multiple switches and restricts the maximum number of controlled switches, improving exploration efficiency and practical feasibility, while the mask mechanism dynamically adapts to the valid actions available in each scenario, accounting for variations in controllable breakers across different operating conditions.
Extensive validation using one-year historical data from a practical distribution system demonstrates that the proposed method achieves fast convergence, stable training, and effective mitigation of RPF and voltage violations, highlighting its potential for scalable application in PV-rich distribution networks.
2. Proposed Method
2.1. System Overview
The transition towards PV-rich distribution networks requires a shift from static, manually-operated systems to dynamic frameworks capable of intelligent reconfiguration. This study considers a representative 10 kV distribution network, illustrated in
Figure 1, which captures the operational challenges associated with high PV penetration. The network integrates a large-scale deployment of distributed PV units, in addition to two centralized PV plants, PV1 and PV2. RPF arises from the combined effects of PV1 and PV2, and contributions from the distributed PV units within the 10 kV grid. The resulting upstream power is injected into the two main transformers, T1 (63 MVA) and T2 (31.5 MVA), subjecting them to atypical thermal stress, accelerating insulation aging, and potentially exceeding their rated capacity, which can compromise transformer lifespan and induce voltage rise in the upstream network.
The system architecture is hierarchically organized into distinct operational zones to delineate power flow paths and control responsibilities. The 110 kV main grid forms the high-voltage backbone, while the downstream 10 kV network bifurcates into a PV generation zone and a load zone. The load zone is critically equipped with a reconfigurable mesh of remotely controlled switches, comprising normally open tie switches and normally closed head-end or sectionalizing switches. The dynamic reconfiguration of this network achieved by altering the open/closed status of these switches serves as the primary mechanism to redirect power flows, alleviate RPF, and mitigate associated voltage violations and equipment overloads. However, this process is highly constrained: any closure of a tie switch to reroute power must be accompanied by the simultaneous opening of another switch to preserve radiality and avoid unintended loops. Under fluctuating generation and load profiles, the resulting combinatorial decision problem can become challenging and time-sensitive for operators. These practical difficulties motivate the exploration of automated, learning-based approaches that can assist in systematically evaluating feasible topological actions and supporting informed decision-making in distribution network operation.
2.2. Problem Description and Mathematical Modeling
The operational challenge of mitigating RPF in PV-rich distribution networks is formalized as a dynamic topology optimization problem. The primary objective is to minimize the total RPF of primary transformers through adjustment of switch configurations while maintaining secure operation under stochastic generation and load conditions. The optimization objective is mathematically defined as:
where
is the RPF of transformer
i at time
t, i.e., the magnitude of active power exported from transformer
i to the higher-voltage level, and
denotes the signed active power of transformer
i at time
t, where
indicates power import and
indicates power export.
is the set of 110 kV transformers. Control is executed through manipulation of the binary switch state vector:
where
denotes the switch state vector at time
t,
represents the status of individual switches (0 for open, 1 for closed), and
is the total number of controllable switches.
The optimization is constrained by several critical operational requirements. The radiality constraint ensures the network maintains a tree structure, and no cycles are allowed:
where
represents the undirected distribution system topology graph,
is the bus set and
is the line set;
denotes the connectivity of the line
, where 0 is disconnected and 1 is connected;
is the total number of nodes and
represents the number of substations.
The power flow constraints govern the physical network behavior:
where
represent active and reactive power injections to node
i,
denote power demands,
are bus voltage magnitudes and angles, and
are the conductance and susceptance of line
.
Security constraints include bus voltage limits:
where
is the voltage per-unit (p.u.) value of bus
i at time
t;
and
denote the required minimum and maximum voltage p.u. values of bus
i.
Line load rate:
where
represents the load rate of line
i at time
t,
is the active power of line
i,
is the thermal limit of line
i, and
is the maximum allowed load rate of line
i.
Power outage:
where
is the active power of load
i at time
t, and
is the set of loads which are out of supply.
The formulation involves discrete switching decisions coupled with nonlinear AC power-flow physics and multiple operational constraints, including radiality, voltage limits, thermal ratings, and load connectivity. Complexity arises from the combinatorial explosion of feasible switch configurations, the non-convexity of network equations, and the stochastic, time-varying behavior of PV generation and loads. These characteristics call for feasibility-aware, time-efficient decision making tailored to large action spaces and uncertainty. Motivated by this, we develop a distributed deep reinforcement learning framework that learns scalable policies for topology control, incorporates constraint handling to maintain operational feasibility, and accounts for variability while targeting RPF mitigation and voltage compliance.
2.3. Problem Formulation as Multi-Armed Bandit
The topology reconfiguration task can be formulated as a contextual decision problem and described concretely by the tuple . At each decision timestep t the agent observes a state from the environment, selects an action that induces a new switch configuration , and receives a scalar reward that quantifies the degree to which the action mitigates reverse power flow while preserving operational security. The following paragraphs provide an explicit, mathematically precise specification of the state space, the action space, and the reward design used in this work.
The state vector
is constructed to give the agent a complete and compact description of the electrical and topological conditions of the distribution network at time
t, which is defined as follows:
where
is the binary switch status vector with
,
are bus voltage magnitudes (p.u.),
and
collect active and reactive power of PV stations,
and
are active and reactive load demands,
and
are per-line active and reactive flows,
and
denote transformer active and reactive power exchanges at the substations. Min-max normalization is applied to each continuous component prior to use by learning algorithms to improve numerical conditioning; the binary switch vector is preserved to retain explicit topological information. By design,
embeds both electrical magnitudes and topology; therefore, contextual evaluation of switching actions accounts for their network-wide electrical consequences under stochastic PV and load realizations.
The action space
is intuitively defined as a multi-binary space:
where
represents the action of whether toggling the switch
i, value 0 means to do nothing, and 1 stands for toggling the switch.
is the complete set of controllable switches in the system. The switch state transfer equation is given by:
where ⊕ denotes bitwise XOR.
has an alternative representation as the state of the switch configuration
i, where the value 0 means the switch is set open and 1 represents that the switch is set closed. In this case,
is the transfer equation of the switching state.
The reward function is designed to align directly with the operational objective of mitigating RPF, while simultaneously ensuring voltage and thermal security, maintaining the topology radiality, and discouraging load shedding. The instantaneous reward is given by:
where
is a constant penalty for unconvergent power flow,
is a constant penalty for creating
cycles in the topology,
is a constant reward of RPF completely solved and all constraints met, and
is the reward function for other cases, which is defined as:
where
is the reward for the decrease of the RPF ratio,
is the reward for the decrease of overvoltage and undervoltage of buses,
is the penalty for line overflow,
is the penalty for power outages; reward shaping via the coefficients
,
, and
is used to align the relative influence of each component with operational priorities and to place heterogeneous magnitudes on comparable scales during training. These weights serve precisely to balance dimensionless indices against the power outage penalty so that no single component dominates the learning signal. This reward structure ensures that actions that materially reduce transformer reverse flow without inducing voltage or thermal violations yield the highest returns, while actions that eliminate RPF at the expense of violating hard limits, shedding load are disfavored. By embedding these physical penalties into the reward
, the DRL agent is guided toward policies that trade off RPF mitigation, security constraints, and operational pragmatism in a manner consistent with real-world operational priorities.
3. Proposed Method Incorporating Multi-Discrete Maskable PPO
To effectively address the distribution network reconfiguration problem under uncertain renewable generation and demand fluctuations, this study develops a decision-making framework based on the proximal policy optimization (PPO) algorithm enhanced with two key extensions: multi-discrete action representation and maskable action constraints. These modifications are specifically designed to align reinforcement learning with the operational requirements of distribution grids, thereby improving both the efficiency of training and the feasibility of the derived control policies.
3.1. Multi-Discrete Action Space Formulation
In distribution system reconfiguration, the action space exhibits a combinatorial nature because switching operations must be selected from a set of candidate lines under radiality and operational constraints. A naive single-discrete representation, in which every possible switching configuration is enumerated as a distinct action, is computationally prohibitive since the cardinality of the space grows exponentially with the number of controllable switches. The multi-binary formulation in (
9) shrinks the action space. However, the exploration might be extremely slow when the number of controllable switches is large, and the maximum number of toggled switches is hard to control in this formulation.
To overcome these limitations, this work adopts a multi-discrete action representation tailored to distribution network reconfiguration. The key is that the agent is allowed to control at most
switches during a single decision epoch, and each control decision is encoded as a discrete variable. Formally, the action is represented as a vector:
where each element
takes a value in
,
denotes the index of the selected switch, while
corresponds to a “no-operation” action. This formulation yields a multi-discrete space of dimension
, where the effective size of the space is
.
Compared with alternative formulations, the proposed multi-discrete design offers several advantages. Relative to a single-discrete encoding, it avoids exponential blow-up, since the agent is no longer required to consider the entire set of feasible topologies as atomic actions. In contrast to multi-binary encoding, which treats all switches simultaneously, the multi-discrete representation provides explicit control over the maximum number of switches that can be operated, thereby reducing the exploration burden and aligning more closely with realistic operational practices where only a limited number of switching actions are permissible in a given time frame. Moreover, the inclusion of a “do-nothing” action at each sub-decision grants the policy the flexibility to adaptively determine how many switches to operate, ranging from zero up to .
3.2. Maskable Proximal Policy Optimization
Conventional PPO is a policy-gradient algorithm that optimizes a clipped surrogate objective, providing stable policy updates while balancing exploration and exploitation. At decision step
t, the policy
parameterized by
is updated using the surrogate objective introduced in [
39]. However, standard PPO assumes a fixed and static action space. In DNR, the feasible action set varies dynamically with the system topology, switch availability, and operational status. For instance, switches that lack remote actuation, are locked for maintenance, constrained by interlocking rules, or already isolated should not be treated as controllable actions. If such context-dependent infeasibilities are not explicitly handled, the agent may waste exploration effort on actions that cannot be executed in practice, slowing convergence and reducing policy interpretability.
To overcome these limitations, the MPPO integrates an adaptive masking mechanism into the policy [
40]. In this work, the valid action space is determined by the currently controllable switches extracted from the case file. To incorporate the multi-discrete formulation in (
13), the mask matrix
is constructed as:
where the entry
indicates the validity of selecting action
j in the
k-th action dimension. For
, a value of
denotes that switch
j is currently controllable, whereas
denotes that it is unavailable due to network or operational constraints. All rows of
are identical, so that each action dimension shares the same set of admissible switches. The last column
corresponds to the “no-operation” action and is fixed to 1 for all
k, since this action is always permissible. This masking strategy dynamically adapts to changes in the set of controllable switches, ensuring that the policy explores only feasible actions without violating the current topological configuration. Consequently, the agent’s exploration is focused on valid operational regions, which enhances convergence stability and training efficiency in the multi-discrete decision space.
In the MPPO scheme, operational constraints on switch controllability are encoded through the binary action–switch mask
. At each time step
t, the policy network outputs, for each action slot
, an unmasked logit vector
. The mask is then applied at the logit level to obtain the masked logits [
40]:
where
C is a large negative constant, and
is used in this work. The categorical policy over the
k-th slot is defined as:
Therefore, invalid actions receive identically zero probability mass and are never sampled during either data collection or policy evaluation. Policy entropy and log-probabilities are computed with respect to this masked distribution, i.e., over the set of indices
j satisfying
. The overall multi-discrete policy factorizes across slots as:
while the value function
is trained on the resulting trajectories and is not directly masked. In practice,
can be viewed as a time-varying constraint signal that conditions the policy on the currently admissible actions without altering the critic architecture.
Training follows the standard PPO procedure with clipping, applied to the masked policy. During the policy update, we recompute the masked logits
, evaluate the new log-probabilities
, and form the importance sampling ratio:
The actor is then updated by maximizing the clipped surrogate objective with entropy regularization:
where
is the clipping hyperparameter,
is the entropy regularization coefficient, and
denotes the entropy of the masked policy, which can be implemented as the sum of the entropies of the
per-slot categorical distributions over indices
j with
. The critic parameters
are updated by minimizing a squared-error loss between
and the empirical returns. Since invalid actions have zero probability by construction, they never appear in the collected trajectories, and the gradients are concentrated on valid action choices only. This integration of the mask into the PPO framework ensures that the policy optimization respects topology-dependent action constraints while retaining the stability and robustness properties of standard PPO. The detailed pseudo-code of multi-discrete MPPO is given in Algorithm 1.
| Algorithm 1: Multi-Discrete Maskable PPO (MPPO) |
| | Input: Historical scenarios ; environment with AC solver; policy network ; value network ; horizon T; mini-batch size B; PPO clip ; entropy coefficient ; GAE parameter ; discount factor ; PPO epochs K; training episodes ; other parameters in Table A1. |
| | Output: Trained parameters . |
| 1 | Initialize ; buffer . |
| 2 | for each training iteration do |
| 3 | | | Sample a scenario and reset with its snapshot. |
| 4 | | | for to do |
| 5 | | | | | Observe state . |
| 6 | | | | | Construct the binary mask matrix . |
| 7 | | | | | for to do |
| 8 | | | | | | | Compute masked logits using (15). |
| 9 | | | | | | | Sample . |
| 10 | | | | | Assemble . |
| 11 | | | | | Apply to ; run solver; obtain next state , convergence flag, done flag d, and other system information, including and ; compute reward using (11). |
| 12 | | | | | Compute ; compute using (17). |
| 13 | | | | | Store into . |
| 14 | | | | | if is terminal then |
| 15 | | | | | | | break |
| | | | | |
| 16 | | | Compute returns and advantages for all transitions in using GAE(). |
| 17 | | | for to K do |
| 18 | | | | | Sample a mini-batch of size B from . |
| 19 | | | | | Recompute masked logits and log-probabilities for the mini-batch. |
| 20 | | | | | Compute importance ratios using (18). |
| 21 | | | | | Compute the masked policy entropy . |
| 22 | | | | | Form the actor loss |
| | | | | | |
| | | | | | and the critic loss |
| | | | | | |
| | | | | | |
| 23 | | | | | Update by taking a gradient step on ; update by taking a gradient step on . |
| 24 | | | Clear buffer . |
| | | |
3.3. Overall Framework
The overall architecture of the proposed DRL-based topology control framework is illustrated in
Figure 2. The framework consists of two main stages: (i) offline data analysis and case selection based on historical Supervisory Control and Data Acquisition (SCADA) measurements, and (ii) training of MPPO agents using a commercial AC power flow solver and parallel environments. In the first stage, historical SCADA data are processed by a data analysis and operational risk identification module, which scans one year of recorded operation to identify system snapshots with RPF, overvoltage, or other undesirable operating conditions while ensuring that the underlying network topology is valid. From these filtered snapshots, representative cases with RPF, overvoltage, and valid topology are selected as training and evaluation scenarios. In parallel, a complete list of candidate breakers is constructed, including sectionalizing and tie switches that can participate in reconfiguration while respecting the existing protection and interlocking design. This pre-processing step provides a realistic ensemble of stressed operating points and a well-defined action space for the subsequent learning phase.
During the training stage, a set of parallel environments is instantiated, each coupling the AC power flow solver with security checks for topology radiality, cycle detection, voltage and line loading limits, and load supply. For each historical snapshot, the environment is first initialized and the solver computes the base-case AC power flow, producing the system state and associated diagnostic information under the current topology. Based on this initial solution and switch metadata, a binary mask is constructed to indicate which switches are admissible for reconfiguration in that snapshot, and this mask is provided to the MPPO policy so that infeasible actions receive zero probability. The policy then samples a multi-discrete switching action, which is applied once to the simulator; the solver resolves the AC power flow for the post-switch configuration, checks all security constraints, and the resulting operating condition is used to compute a scalar reward. Each episode thus consists of a single decision step, and independent episodes are generated in parallel across environments using different historical snapshots. Policy networks are shared across all environments to collect these single-step transitions efficiently, while a centralized buffer aggregates samples from all parallel simulations. The value network is updated using the aggregated data, and the policy parameters are optimized using the MPPO objective with entropy regularization. The agent is periodically evaluated on a separate validation set, and the best-performing model is retained. Overall, this design enables a reinforcement learning–based topology control scheme that mitigates RPF and voltage violations under realistic operational constraints, while the multi-discrete MPPO with scenario-dependent action masks handles the combinatorial switch space efficiently and can be trained at scale using parallel simulations.
4. Case Studies
4.1. Data Pre-Processing and Analysis
One-year historical operational data of the distribution system shown in
Figure 1 were utilized, covering the period from 1 January 2023, to 31 December 2023. The dataset contains hourly measurements, yielding a total of 8760 samples. Each data sample includes both the electrical and topological states required for the state variable definition in (
8). Before model training, the raw data were carefully pre-processed to ensure physical consistency and to extract representative operating conditions. First, all samples exhibiting non-radial topologies were removed to guarantee that the analyzed configurations comply with the radial operation of distribution networks. Second, samples without RPF at the primary transformers were excluded, since this study specifically focuses on the mitigation of RPF of primary transformers. After these filtering steps, 3016 cases remained, each containing one or more overvoltage buses.
Figure 3 provides a statistical overview of the processed dataset.
Figure 3a shows the histograms of active power of the two 110 kV main transformers. Transformer 1 operates under reverse power conditions in nearly all cases, with a maximum export of approximately 35 MW. Transformer 2 exhibits reverse power flow in about 70% of the cases, with a maximum magnitude of 38 MW. These results indicate that reverse power flow has become a persistent and significant phenomenon in the studied system.
Figure 3b depicts the distribution of overvoltage occurrences among all buses. A total of 36 buses experienced overvoltage events under the initial configurations. Among them, buses 8, 9, 10, 11, 20, and 27 consistently exhibited overvoltage across all samples, as they are directly connected to large-scale PV plant feeders. Since these buses are inherently associated with local generation surplus and are hardly influenced by switching control from the current action space, they were excluded from the optimization objective set.
Figure 3c presents a one-month time-series of the total reverse power at the substation. It can be observed that only two days in January 2023 were free from reverse power flow, and the maximum reverse power during the month reached approximately 34 MW. This further confirms the high penetration level of distributed photovoltaic generation in the studied network.
To enhance the training stability of the DRL agent, all continuous variables within the state vector were normalized prior to input to the neural network. Specifically, active and reactive power values of buses, branches, and transformers were scaled using min-max normalization based on their historical extrema, ensuring they lie within the range . Bus voltage magnitudes were normalized with respect to their nominal value. Topological indicators, such as switch statuses and connectivity features, were encoded as binary or categorical variables according to their operational meaning. This normalization not only improves numerical conditioning and convergence of gradient-based optimization but also facilitates generalization across different operating conditions and network configurations.
For the reinforcement learning process, the dataset was partitioned into a training and testing subset. A total of 500 randomly selected system states were assigned to the testing set, while the remaining samples were used for training. During both training and evaluation, the agent’s objective was to relieve transformer reverse power flow and bus overvoltage simultaneously, while maintaining power flow solvability, enforcing network radiality, preventing any load disconnection, and ensuring that line loading limits were not violated.
4.2. Experiment Settings
The experiments were conducted on the distribution network illustrated in
Figure 1, whose parameters are summarized in
Table 1. The system is supplied by two 110 kV/ 35 kV/10 kV three-winding transformers, rated at 63 MVA and 31.5 MVA, respectively. Two PV plants, rated at 15 MW and 9 MW, are connected to the 35 kV network. The selected main system components to construct the state space include 21 main feeders, five two-winding transformers, a total of 101 medium-voltage buses and six 110 kV main grid buses. The controllable switching devices are restricted to 10 kV sectionalizing and tie switches, with a total number of 350. Correspondingly, the system state vector
has a dimension of 607, capturing all electrical and topological variables described in (
8), while the action dimension
equals 350.
All experiments were conducted under the following operational constraints: the bus voltage magnitude was limited within
p.u. and
p.u., and the line loading ratio was constrained by
. The reward function incorporated multiple penalty terms to ensure both operational feasibility and learning stability. The detailed parameters are listed in
Table 2. Other training parameters, such as network structure, learning rate, batch size, etc., are provided in the
Appendix A for completeness.
The simulation environment utilizes a commercial solver deployed by Jiangsu Electric Power Company Research Institute, Nanjing, China for real-time security assessment to solve the power flow and validate topological feasibility. All implementations were carried out in Python v3.10, where the PPO and maskable PPO algorithms were implemented based on the Stable-Baselines3 library. To accelerate training, a distributed implementation was deployed on a Linux-based computation server equipped with four NVIDIA RTX 3090 GPUs and an AMD EPYC processor with 384 threads.
4.3. Experimental Results and Analysis
4.3.1. Comparison Between the Proposed Method and Baselines
To evaluate the effectiveness of the proposed method, comparative experiments were conducted among three reinforcement learning algorithms: A2C, PPO, and the proposed Maskable PPO. The training performance is illustrated in
Figure 4, which presents the evolution of both the training and evaluation rewards. As shown in
Figure 4a, Maskable PPO demonstrates the fastest convergence rate and the most stable learning process, followed by PPO and A2C. The incorporation of action masking effectively reduces infeasible exploration by dynamically excluding invalid switching operations, thereby accelerating policy learning and improving sample efficiency.
Figure 4b further shows that Maskable PPO achieves the highest evaluation reward throughout the training process, indicating its better generalization performance on unseen states. Although the reward difference between PPO and Maskable PPO is moderate, the latter consistently maintains higher performance and stability across the entire training horizon.
The performance of all trained agents was evaluated on a test set of 500 unseen operating scenarios. The evaluation metrics correspond directly to the reward components defined in (12): the reverse power flow reduction reward (
), the bus voltage improvement reward (
), the line overflow penalty (
), and the load disconnection penalty (
). Additional operational indicators, including the absolute reverse power reduction, the number of overvoltage buses, and the number of switching actions executed, were also analyzed. The results are summarized in
Table 3.
Several key observations can be drawn from
Table 3. Maskable PPO achieves the highest reverse power flow reduction reward (
), corresponding to an average decrease of 6.54 MW, which represents an improvement of 28% and 72% compared to PPO and A2C, respectively. This indicates the algorithm’s enhanced capability to identify effective switching actions that reduce reverse power flow while maintaining network feasibility. The line overflow and cycle violation counts remain zero for all methods, confirming that topological and operational constraints were consistently respected during policy execution.
Regarding voltage performance, Maskable PPO also attains the largest improvement in bus voltage compliance, reflected by the highest value (0.067) and the greatest reduction in the number of overvoltage buses (2.64). The slightly higher number of overvoltage cases (162) compared with PPO (149) is attributed to the more aggressive switching strategies adopted to maximize reverse power flow mitigation. Critically, unlike A2C, which frequently caused load disconnections (), both PPO and Maskable PPO maintain full power supply across all test scenarios.
The operational efficiency is further reflected in the average number of switching actions per case. Maskable PPO executes approximately seven breaker operations on average, which is practical for real-world network reconfiguration, considering substantial improvements in both reverse power flow mitigation and voltage regulation. Collectively, these results demonstrate that Maskable PPO provides effective, stable, and feasible topology control for PV-rich distribution networks, balancing reward maximization with practical operational constraints.
4.3.2. Experiment on Different Numbers of Maximum Controlled Switches
To investigate the impact of action space size on learning efficiency and control effectiveness, the multi-discrete Maskable PPO algorithm was evaluated under four different maximum numbers of controllable breakers,
. The corresponding training and evaluation progress is illustrated in
Figure 5. Limiting the number of controllable switches significantly affects both convergence behavior and policy performance. Specifically, smaller
values yield faster and smoother convergence, with
exhibiting the most stable training trajectory, followed by
, 15, and 20. However, policies with moderate control flexibility (
and 15) achieve better evaluation performance, indicating an optimal balance between action efficiency and system optimization capability.
The quantitative test results on 500 unseen operating scenarios are summarized in
Table 4. When
, the agent is highly constrained, resulting in limited improvements in reverse power flow reduction (
) and voltage regulation (
). Although convergence is stable, the small number of switching actions (average of 4 breakers per case) restricts the ability to relieve system stress, leaving a large number of buses experiencing overvoltage (194 buses). Increasing
to 10 or 15 enlarges the feasible action space, enabling more effective exploration and improved control performance. In particular,
achieves the highest reverse power flow reduction ratio (
) and the largest average decrease in reverse active power (6.887 MW), while maintaining full compliance with network constraints, including radiality, connectivity, and thermal limits. The corresponding voltage violation penalty is also minimized, and the overvoltage issue is completely solved in about 65% cases, although there are 173 cases still having overvoltage buses. When
is further increased to 20, a slight performance degradation is observed. This is likely due to the enlarged action space introducing redundant switching options, increasing exploration difficulty, and slightly hindering convergence, despite the reduction in overvoltage cases.
These results indicate that appropriately limiting the maximum number of controllable switches is crucial for effective and stable learning. In this study, and 15 achieve the best trade-off between convergence speed, training stability, and control performance, allowing the agent to effectively balance reverse power flow mitigation and voltage regulation without compromising system reliability.
4.4. Apply Trained Agent on Full Dataset
To further validate the capability and practical effectiveness of the proposed control strategy, the best-trained agent—corresponding to the multi-discrete Maskable PPO model with a maximum controllable breaker number of , was applied to the complete one-year dataset consisting of 3016 operational cases exhibiting reverse power flow (RPF) and overvoltage conditions. The objective of this experiment is to assess whether the trained policy can maintain stable and effective performance under diverse operating conditions that were not directly observed during training, while satisfying all physical and topological constraints.
Figure 6 illustrates the comparison of reverse power flow conditions before and after applying the learned control strategy, which depicts the one-year time-series variation of the total reverse power flow aggregated across the main transformers. It confirms the robustness of the proposed method throughout the entire year, where the total system RPF (orange line) is consistently lower than the initial value (blue line), achieving an average reduction of approximately 25%. This performance demonstrates that the learned control policy generalizes effectively across different seasonal and diurnal variations of load and PV output, enabling the mitigation of RPF in a stable and continuous manner.
The improvement in voltage profiles is further analyzed in
Figure 7, which presents the comparison of overvoltage occurrences before and after control. As shown in
Figure 7a, in the initial states most operational scenarios suffer from severe voltage violations, with 6 or 12 buses exhibiting voltages beyond the upper limit of 1.1 p.u., and extreme cases reaching up to 21 affected buses. After applying the trained agent,
Figure 7b reveals a substantial reduction in the number of overvoltage buses. Approximately 60% of the post-control cases exhibit completely normal voltage profiles with no violations, while the remaining cases show only minor residual overvoltages limited to a few buses, typically 3 or 6. The maximum number of overvoltage buses also decreases from 21 to 12, indicating that the proposed control approach effectively reconfigures the network to alleviate local voltage rise problems caused by high PV generation and reverse flows.
Overall, these results demonstrate that the proposed multi-discrete MPPO method can generalize effectively from training to unseen operational scenarios, maintaining constraint satisfaction and stable performance across an entire year of historical data. The control strategy not only mitigates reverse power flow by around 25% on average, but also restores voltage profiles to acceptable levels in the majority cases, confirming its capability for large-scale effective reconfiguration of active distribution systems with high renewable penetration.
5. Conclusions
This paper has introduced a reinforcement learning-based method for RPF and voltage violations in photovoltaic-rich distribution networks. A novel multi-discrete Maskable PPO algorithm was developed to address the combinatorial complexity and strict operational constraints inherent in network reconfiguration. By integrating topology-aware action masking with a multi-discrete action structure, the proposed approach enables the agent to efficiently explore feasible switching operations while ensuring radiality, load connectivity, and adherence to operational limits.
Extensive case studies based on historical operational data from a real distribution system demonstrated the effectiveness and practicality of the proposed method. Comparative evaluations showed that the proposed multi-discrete Maskable PPO outperformed standard PPO and A2C in terms of learning stability, convergence speed, and operational performance. On unseen test scenarios, the trained agent achieved an average RPF reduction of approximately 24.3% and significantly improved voltage profiles, with about 65% of cases restored to normal voltage ranges. Analyses on varying the maximum number of controllable switches highlighted that moderate action limits () offer an optimal trade-off between convergence efficiency and control effectiveness, confirming the importance of structured action constraints in high-dimensional reconfiguration problems.
Application of the best-trained agent to the full dataset of 3016 operational cases further demonstrated robust generalization across diverse seasonal and diurnal conditions, consistently maintaining constraint-compliant performance. These results establish the proposed multi-discrete Maskable PPO method as a scalable and efficient tool for network reconfiguration, capable of autonomously mitigating RPF and voltage rise phenomena in PV-rich distribution systems.
Future work will focus on two directions. First, topology reconfiguration alone has limited capability to fully mitigate reverse power flow, particularly under scenarios of heavy RPF. To address this limitation, integration with additional control measures, such as large-scale battery energy storage systems or other flexible resources, will be investigated to provide coordinated mitigation strategies. Second, the framework will be extended to real-time control by incorporating forecasted PV generation and load data, enabling proactive adjustment of network configurations under anticipated operating conditions. In addition, more practical constraints will be investigated and integrated. Overall, these directions aim to further enhance the effectiveness, robustness, and practical applicability of the proposed Maskable PPO-based control framework in PV-rich distribution networks.