Next Article in Journal
Dual-Axis Transformer-GNN Framework for Touchless Finger Location Sensing by Using Wi-Fi Channel State Information
Previous Article in Journal
Physics-Informed Neural Network for Denoising Images Using Nonlinear PDE
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LLM-Based Dynamic Distribution Network Reconfiguration with Distributed Photovoltaics

School of Electrical Engineering, Chongqing University, Chongqing 400044, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(3), 566; https://doi.org/10.3390/electronics15030566
Submission received: 5 January 2026 / Revised: 24 January 2026 / Accepted: 26 January 2026 / Published: 28 January 2026

Abstract

To achieve carbon neutrality goals, large amounts of renewable energy sources (RESs) are being integrated into power systems. In particular, high penetration of distributed photovoltaic (PV) makes distribution networks highly stochastic, calling for dynamic distribution network reconfiguration (DNR). Existing DNR approaches can be broadly categorized into model-driven optimization-based methods and learning-based methods, with deep reinforcement learning (DRL) being a representative paradigm for fast online decision-making. Existing DNR models typically belong to mixed-integer linear programming, which requires solution methods such as deep reinforcement learning (DRL). However, existing methods commonly struggle to account for human factors, i.e., the time-varying preferences of distribution network operators in DRL decisions. To this end, this paper proposes a natural language-driven, human-in-the-loop DNR framework, which combines a DRL base policy for hour-level dynamic reconfiguration with a large language model (LLM)-based instruction supervision layer. Based on this human-in-the-loop framework, commands from operators in natural language are translated into online adjustments of safety-screened DRL switching actions. Therefore, the framework demonstrates the fast, model-free decision capability of DRL while providing an explicit and interpretable interface for incorporating temporary and context-dependent operator requirements without retraining. Case studies on IEEE 16-bus and 33-bus distribution networks show that the proposed framework reduces network losses, improves voltage profiles, and limits switching operations. It also achieves markedly higher compliance with operator instructions than a conventional model-based method and a pure DRL baseline. These results highlight a viable path to embedding natural language guidance into the data-driven operation of active distribution networks.

1. Introduction

To achieve carbon neutrality goals, power systems are integrating large amounts of renewable energy sources (RESs), particularly distributed photovoltaic (PV) generation. As a result, distribution networks are becoming the primary interface for accommodating these resources. With the rapid proliferation of distributed RES such as rooftop PV and small-scale wind units, distribution networks are evolving from passive, radially operated systems to active, stochastic infrastructures. High RES penetration leads to pronounced variability of net loads, frequent power flow reversals in feeders, and tighter voltage and thermal margins [1]. Under such conditions, a single static network configuration is rarely adequate for an entire day or season, and there is a growing need to adapt the feeder topology in response to changing operating conditions. Distribution network reconfiguration (DNR) adjusts the open/closed status of sectionalizing and tie switches. It has long been recognized as an effective operational measure for active distribution networks. Specifically, DNR can reduce active power losses, improve voltage profiles, relieve line overloads, and enhance reliability and resilience indices [2,3,4]. Beyond traditional single-period planning, dynamic DNR extends the concept to multi-period horizons by explicitly considering time-varying loads, renewable generation, and switch operation costs. Under uncertainties in distributed generation (DG) and demand, it can further improve loss reduction and constraint management [5,6]. For emerging active distribution grids with high shares of PV and wind, there is a growing need for hour-level, or even finer-granularity, dynamic reconfiguration that can continuously adapt the feeder topology to changing operating conditions.
Most classical DNR formulations are model-based optimization problems. Typical approaches rely on mixed-integer linear or nonlinear programming (MILP/MINLP) models or multi-objective mathematical programming, in which binary variables represent switch statuses and continuous variables represent node voltages and branch currents [2,3,4]. Exact or relaxed formulations are often solved using commercial solvers. For nonconvex, large-scale instances that challenge off-the-shelf solvers, meta-heuristic algorithms (e.g., genetic algorithms, particle swarm optimization, and hybrid evolutionary schemes) are widely used. More recent dynamic DNR studies formulate multi-period models that incorporate stochastic variations of DG and loads [5], further enriching the portfolio of model-based approaches for active distribution networks. However, these methods rely heavily on accurate network models and must be re-solved whenever operating scenarios or objective weights change. As a result, computation times can become prohibitive for large networks or fine decision intervals. In addition, the objectives and constraints are hard-coded in the mathematical formulation, making it difficult to rapidly incorporate ad hoc operator preferences or temporary operating policies in day-to-day operation.
To address the scalability and adaptivity challenges of model-based methods, a growing body of work has explored data-driven approaches for DNR. For example, convolutional neural networks have been used to approximate the mapping from load patterns to near-optimal network topologies and thereby accelerate reconfiguration [6]. More recently, deep reinforcement learning (DRL) has been introduced as a flexible framework for DNR. In this setting, the distribution network and a power-flow solver are embedded in an environment. The system state encodes network measurements and operating conditions, while the action corresponds to a switch configuration or a sequence of switching operations [7,8]. Representative studies show that DRL can learn effective dynamic reconfiguration strategies from historical or simulated operational data. Once trained, the resulting policy supports near real-time decision-making and substantially reduces the computational burden in online deployment [7,8,9,10,11]. Compared with repeatedly solving mixed-integer programs, DRL-based DNR offers two main advantages. Once the policy is trained offline, online decision-making reduces to a simple forward pass through a neural network, which is fast enough to support frequent or near real-time reconfiguration in large-scale feeders. At the same time, DRL is inherently data-driven and model-free, so it can in principle accommodate model inaccuracies and unmodeled dynamics as long as the training data sufficiently covers the relevant operating conditions. These properties have motivated a variety of DRL frameworks and training schemes for DNR, but the common goal is to provide a scalable, real-time decision mechanism under uncertainty.
However, current DRL-based DNR approaches still face important limitations for practical operation. Because system objectives and operator preferences are hard-coded in the reward and environment design, adjusting the trade-off among loss reduction, voltage quality, and switching actions after convergence usually requires retraining or fine-tuning. Such changes are rarely achievable via simple online adjustment. Existing frameworks also lack a natural interface for operator instructions in plain language (e.g., temporarily avoiding a specific switch or prioritizing loss reduction in a given period), making it difficult to reflect temporary, context-dependent preferences. In addition, end-to-end DRL policies are often treated as black-box controllers. Although safe RL techniques for power systems have been actively studied [12,13,14], they generally require safety constraints to be specified a priori during training and do not readily accommodate new or urgent constraints at run time. As a result, existing DRL-based DNR solutions still struggle to reconcile real-time performance with the level of human-in-the-loop controllability expected in critical operational environments.
Recent advances in large language models (LLMs) have opened new possibilities for integrating natural language instructions into power system optimization and control workflows. Recent studies report that GPT and other LLM agents can supervise DRL-based real-time OPF under linguistic stipulations, and they can generate a reward-function code for integrated energy system scheduling [15,16]. This allows multi-objective behavior to be shaped directly from the textual descriptions of operator goals. Other works have proposed LLM-based agents for advanced power dispatch and secure operation. In addition, survey papers have outlined broader opportunities for LLMs in situational awareness, decision support, and human–AI collaboration in energy systems [17,18,19,20]. These efforts indicate that LLMs can interpret domain-specific natural language and influence optimization or control logic. However, most existing approaches incorporate language at the design or training stage. They do not provide an online supervision mechanism for DRL-based DNR policies in active distribution networks. This gap motivates the LLM-based, human-in-the-loop DNR framework proposed in this paper.
Motivated by the above observations, this paper adopts a human-in-the-loop DNR paradigm that couples a DRL base policy with an LLM-based instruction supervision layer. A DRL agent is first trained offline to learn an hour-level dynamic reconfiguration policy for active distribution networks with high RES penetration, providing fast, fully automated decisions under stochastic PV and load profiles when no explicit instructions are given. On top of this, we introduce an LLM as a lightweight supervisory layer. It interprets operator commands in natural language and then adjusts or selects the DRL-generated switching actions accordingly. In this way, real-time reconfiguration can respect operational constraints while aligning with temporary, context-dependent operator preferences.
Building on the above idea, this work makes the following contributions:
  • An LLM-based human-machine collaborative framework for dynamic distribution network reconfiguration is proposed. The framework integrates a DRL-based base policy with an LLM-based instruction-supervision layer. It allows the operator to inject diverse, temporary natural language instructions (e.g., limiting switching operations within specific time windows or avoiding particular feeders) into the DNR process without re-modeling or retraining the control policy.
  • We design a two-stage decision chain that first uses the DRL policy to generate candidate reconfiguration actions, and it then lets the LLM audit and, when necessary, modify the remaining safe actions in light of the operator’s instructions. In this framework, the LLM operates only on safety-filtered candidates, so all executed actions satisfy network constraints while being aligned with natural language instructions.
  • Hour-level dynamic DNR simulations are conducted on an IEEE 16-bus distribution network and an IEEE 33-bus distribution network with high PV penetration and stochastic uncertainties. The proposed DRL and LLM instruction-supervised framework is compared with a conventional model-based DNR method and a pure DRL-based DNR baseline. The results demonstrate that the proposed method simultaneously improves loss reduction and voltage quality, controls the number of switching operations, and significantly enhances instruction satisfaction, illustrating the practical value of LLM-based human–AI collaboration in active distribution network operation.

2. Framework Description

This section introduces the framework of LLM-based dynamic DNR using DRL and LLM. The proposed method adopts a hierarchical human–AI collaborative framework for dynamic DNR, as illustrated in Figure 1. A DRL agent trained offline serves as the base decision-making module and continuously outputs candidate DNR actions according to the current environment state. An upper-layer LLM agent receives both the candidate actions and the operator’s natural language commands, evaluates their consistency with the current system condition, and revises the actions when necessary. The final revised actions are implemented in the distribution network, whose response is fed back to the DRL agent as the new environment state. This framework realizes a natural language-driven, instruction-aware DNR scheme while preserving the fast online decision capability of DRL.

2.1. DRL-Based Base Policy Layer

In the lower decision layer, as shown in the upper-left part of Figure 1, the agent interacts with the distribution network environment in a Markov decision process manner. At each decision interval, the environment summarizes the current operating condition of the network (including loading level, RES generation, switch statuses, and voltage information) into an environment state and provides it to the DRL agent. Based on this state, the DRL agent outputs a DNR action, which specifies a switching operation pattern to reconfigure the network topology.
Once training is completed offline, the DRL agent acts as a fast base policy in online operation. In the absence of explicit instructions from the operator, the output of the DRL agent is directly used as the reconfiguration action, and the whole framework degenerates to a fully autonomous DRL-based dynamic DNR scheme.

2.2. LLM-Based Instruction Supervision Layer

The upper decision layer introduces an LLM agent to supervise and adapt the DNR actions according to operator preferences expressed in natural language, as depicted on the right-hand side of Figure 1. The operator issues temporary operating requirements in plain language, such as “avoid switching S5 today,” “minimize losses in the afternoon peak,” or “keep feeder F2 energized for maintenance.” These natural language commands are processed by the LLM agent, which acts as an instruction interpreter and decision supervisor.
At each decision step, the LLM agent receives three types of information as follows:
  • The action proposed by the DRL agent.
  • System information summarizing the current operating condition of the distribution network.
  • The operator’s natural language instructions.
By jointly considering these inputs, the LLM evaluates whether the action is compatible with the current instructions and, if necessary, modifies the action to generate a revised action. In this way, high-level human preferences and context-dependent policies can be injected into the DNR process without re-designing the reward function or retraining the DRL agent.

2.3. Distribution Network Environment and Information Flow

The bottom block in Figure 1 represents the physical distribution network environment, which includes the radial feeder topology, sectionalizing and tie switches, distributed generation units, and time-varying load and RES profiles. The environment receives the revised action from the decision-making layer and updates the network topology accordingly. Power-flow calculations are then performed to obtain bus voltages, branch currents, losses, and other relevant quantities.
These quantities serve two purposes. First, they are fed back to the DRL agent in an aggregated form as the next environment state, thereby closing the DRL interaction loop. Second, they are organized into system information and provided to the LLM agent, enabling the LLM to interpret operator instructions in the context of the current operating condition and to assess the impact of alternative reconfiguration actions.

2.4. Operating Modes

The framework naturally supports the following two operating modes:
  • Autonomous mode: When no natural language instruction is issued, the DRL agent operates as a conventional dynamic DNR controller. Its output is applied to the distribution network (after basic feasibility checks embedded in the environment), and the LLM agent is effectively bypassed.
  • Instruction-driven mode: When the operator issues temporary natural language instructions, the LLM agent is activated. The LLM evaluates and revises the DRL-generated action based on both system information and the instructions, and the revised action is implemented. In this mode, human operators can flexibly impose short-term preferences or operational policies without changing the DRL model.
Through this layered framework, the proposed framework combines the scalability and speed of DRL with the flexibility and interpretability of natural language interaction provided by LLMs. The detailed formulation of the DRL-based DNR model and the LLM-based instruction supervision mechanism is presented in Section 3 and Section 4, respectively.

3. Formulation and DRL Solution

In this section, the dynamic DNR problem under high PV penetration is formally formulated, and its solution through DRL is established.

3.1. Dynamic DNR Objective and Operational Constraints

We consider a radial medium-voltage distribution network operated over a finite horizon t = 1 , , T with time step Δ t . At each time step, the operator determines the switch configuration z t and other controllable resources to balance multiple technical and economic objectives.

3.1.1. Dynamic Multi-Term Objective

J = t = 1 T ( C sub P t sub Δ t + w loss P t loss + w volt Φ volt ( V t ) + w sw Φ sw ( z t , z t 1 ) + w topo Φ topo ( z t , z ref ) + w over ( i , j ) L ξ i j , t , + w shed i N P i , t shed )
where J denotes the total dynamic DNR objective over the horizon; t = 1 , , T is the time index with horizon length T ; Δ t is the time step length; C sub is the unit energy price for active power purchased from the upstream grid; P t sub is the active power purchased from the upstream grid at time t ; w loss , w volt , w sw , w topo , w over , and w shed are nonnegative weights for network loss, voltage deviation, switching cost, topology deviation, branch overload, and load shedding, respectively; Φ t volt , Φ t sw , and Φ t topo denote the corresponding performance indices at time t ; P t over and P t shed represent the total overload-related power and load shedding at time t ; and ξ t collects any additional penalty terms associated with constraint violations or security considerations.
The performance indices are defined as follows.
Φ volt ( V t ) = i N | V i , t V i ref |
Φ sw ( z t , z t 1 ) = ( i , j ) S c i j | z i j , t z i j , t 1 |
Φ topo ( z t , z ref ) = ( i , j ) L d i j | z i j , t z i j ref |
P t loss = ( i , j ) L r i j I i j , t 2
where V i , t is the voltage magnitude at bus i and time t ; V i r e f is the reference voltage magnitude; z i j , t and z i j , t 1 denote the energized status of branch ( i , j ) at times t and t 1 , respectively; c i j is the switching cost coefficient for branch ( i , j ) ; z i j r e f represents the nominal topology; d i j is the topology deviation penalty coefficient; r i j is the resistance of branch ( i , j ) ; and I i j , t denotes the current magnitude on branch ( i , j ) at time t .

3.1.2. Power Balance and DistFlow Model

P i , t G + P i , t RES ( P i , t L P i , t shed ) = j : ( i , j ) L P i j , t
Q i , t G + Q i , t RES Q i , t L = j : ( i , j ) L Q i j , t
where P i , t G and Q i , t G denote the active and reactive power outputs of controllable generation at bus i , P i , t R E S and Q i , t R E S are the active and reactive injections from RES, P i , t L and Q i , t L represent active and reactive load demands, P i , t s h e d is the involuntary load shedding, and P i j , t and Q i j , t denote the active and reactive power flows on branch ( i , j ) at time t .
A radial DistFlow representation is adopted for the distribution network. For each oriented branch ( i , j ) ,
P i j , t = k C ( j ) P j k , t + P j , t G + P j , t RES ( P j , t L P j , t shed ) + r i j I i j , t 2
Q i j , t = k C ( j ) Q j k , t + Q j , t G + Q j , t RES Q j , t L + x i j I i j , t 2
V j , t 2 = V i , t 2 2 ( r i j P i j , t + x i j Q i j , t ) + ( r i j 2 + x i j 2 ) I i j , t 2
I i j , t 2 = P i j , t 2 + Q i j , t 2 V i , t 2
where P i j , t and Q i j , t denote the active and reactive power flows on branch ( i , j ) ; P j k , t and Q j k , t are the flows on downstream branches ( j , k ) ; P j , t G , P j , t R E S , P j , t c u r t , P j , t L , and P j , t s h e d represent the active power contributions of controllable generation, RES, load demand, and load shedding at bus j , respectively, with analogous meanings for the reactive power terms; r i j and x i j denote the resistance and reactance of branch ( i , j ) ; I i j , t is the branch current magnitude; and V i , t and V j , t are the sending-end and receiving-end voltage magnitudes.
Equations (8)–(11) compactly describe the nonlinear relationships between nodal injections, branch flows, voltages, and currents in the radial network.

3.1.3. Operational Limits

Voltage magnitudes, branch loadings, and flexibility resources are constrained as
V i min V i , t V i max ,   i N , t
P i j , t 2 + Q i j , t 2 ( S i j max ) 2 ,   ( i , j ) L , t
0 P i , t shed P i , t L ,   i N , t
0 ξ i j , t ξ i j max ,   ( i , j ) L , t ,
P i , t G , min P i , t G P i , t G , max ,   Q i , t G , min Q i , t G Q i , t G , max ,   i N , t
To avoid overly frequent topology changes, both per-step and cumulative switching limits are imposed as follows:
( i , j ) S | z i j , t z i j , t 1 | N step max ,   t
t = 1 T | z i j , t z i j , t 1 | N i j max ,   ( i , j ) S
where z i j , t and z i j , t 1 denote the energized status of branch ( i , j ) at times t and t 1 , N step m a x is the maximum allowable per-step switching count, and N i j m a x is the maximum cumulative number of switching operations permitted for branch ( i , j ) over the entire horizon.

3.1.4. Radial Topology Constraints

The network is required to operate radially at each time step, i.e., the energized branches { ( i , j ) L z i j , t = 1 } form a spanning tree rooted at the substation. A single-commodity flow formulation is adopted to enforce radiality. For each time t ,
j : i j F i j , t + D i = k : k i F k i , t ,   i N \ { 0 } , t
| F i j , t | M z i j , t ,   ( i , j ) L , t
( i , j ) L z i j , t = | N | 1 ,   t
where F i j , t denotes the fictitious flow on branch ( i , j ) used to enforce radiality, D i , t is the net inflow associated with bus i , M is a sufficiently large constant, z i j , t is the energized status of branch ( i , j ) , and N 1 ensures that the energized branches form a spanning tree connecting all buses to the substation.
Collecting (1)–(21), the dynamic DNR problem is formulated as a time-coupled mixed-integer nonlinear program with detailed operational and radiality constraints. In the next subsection, this problem is reformulated as an MDP suitable for a DRL-based solution.

3.2. MDP Formulation for DRL-Based Dynamic DNR

All symbols used in this subsection are summarized in the Nomenclature. To enable a data-driven solution that can adapt to stochastic operating conditions and changing operator preferences, the dynamic DNR problem defined by (1)–(21) is cast as a Markov decision process (MDP) ( S , A , P , r , γ ) , where S and A denote the state and action spaces, P is the (unknown) state transition kernel, r is the reward function, and γ ( 0 , 1 ) is the discount factor.

3.2.1. State Space

The state s t S aggregates the information required for making reconfiguration decisions at time t as follows:
s t = [ p t L , q t L , p t RES , q t RES , V t , I t , z t , t ]
where p t L and q t L denote the vectors of active and reactive load demands at time t , p t R E S and q t R E S are the vectors of RES injections, V t and I t represent the voltage magnitudes and selected branch currents, z t is the vector of switch statuses, and t encodes the time index within the operating horizon.
This design ensures that s t contains both exogenous information (loads and RES profiles) and endogenous network conditions (voltages, currents, and topology).

3.2.2. Action Space and Topology Projection

The agent operates in a multi-binary action space that directly represents the on/off states of controllable switches. At each time step, the action is a binary vector shown in (23).
a t = [ a i j , t ] ( i , j ) S { 0 , 1 } | S |
where a i j , t = 1 and a i j , t = 0 correspond to closing and opening the switch on branch ( i , j ) , respectively.
z t = a t
while the per-step and cumulative switching constraints in (17)–(18) are checked using z t and z t 1 . If a t violates the operability constraints (17)–(18) or the radiality constraints (19)–(21), the environment performs a topology repair by setting z t z t 1 to keep the simulation well-posed. This repair is not a cost-free transition. It is treated as an invalid action and triggers the violation penalty Ψ v i o l ( s t , z t ) in (26), which provides strong negative feedback to suppress such actions during learning.

3.2.3. State Transition

Given the current state s t and the sampled action a t , the environment performs the following:
  • Determine z t from a t according to (24), and, if necessary, project infeasible choices back to z t 1 to respect (17)–(21).
  • Solve the distribution power-flow model under z t and the realized load and RES profiles at time t , enforcing (6)–(16) and (19)–(21), and obtain P t l o s s , P t s u b , V t , I t , as well as curtailment and shedding levels.
  • Sample the next-step exogenous profiles p t + 1 L , q t + 1 L , p t + 1 R E S , q t + 1 R E S from the scenario model.
  • Construct the next state as (25).
s t + 1 = [ p t + 1 L , q t + 1 L , p t + 1 RES , q t + 1 RES , V t + 1 , I t + 1 , z t + 1 , τ t + 1 ]
with z t + 1 = z t and τ t + 1 updated according to the decision interval. The transition kernel P ( s t + 1 s t , a t ) is thus accessed only through simulation and is not required in closed form.

3.2.4. Reward Design

The immediate reward is defined as the negative of the instantaneous DNR cost in (1), augmented by a penalty for infeasible operation as follows:
r t = ( C sub P t sub Δ t + w loss P t loss + w volt Φ volt ( V t ) + w sw Φ sw ( Z t , Z t 1 ) + w topo Φ topo ( Z t , Z ref ) + w curt i N P i , t curt ) Ψ viol ( s t , Z t ) , i N P i , t shed = 0 , ( C sub P t sub Δ t + w loss P t loss + w volt Φ volt ( V t ) + w sw Φ sw ( Z t , Z t 1 ) + w topo Φ topo ( Z t , Z ref ) + w curt i N P i , t curt + w shed i N P i , t shed ) Ψ viol ( s t , Z t ) , i N P i , t shed > 0 .
Under normal operation, i N P i , t s h e d = 0 and the shedding penalty is identically zero. Load shedding is only activated as a last resort when no feasible operating point exists without involuntary curtailment due to operational constraints.

3.2.5. Episodic Return

Each training episode corresponds to one operating horizon (e.g., a 24-h day). The SAC agent searches for a stochastic policy π ( a t s t ) that maximizes the expected discounted return.
J ( π ) = E π t = 1 T γ t 1 r t
where J ( π ) denotes the expected discounted return under policy π , γ is the discount factor, and the expectation is taken over the stochastic state transitions induced by the environment and the policy.

3.3. Soft Actor-Critic for Dynamic DNR

Based on the MDP formulation in Section 3.2, a soft actor-critic (SAC) agent is adopted as the base policy to solve the dynamic DNR problem with a multi-binary action space. The SAC agent maintains a stochastic policy π θ ( a t s t ) and two soft Q-functions Q ϕ 1 ( s , a ) and Q ϕ 2 ( s , a ) , parameterized by neural networks. Given state s t , the policy network outputs, for each controllable switch, the probability of being closed; the multi-binary action a t { 0 , 1 } S is obtained by sampling from this stochastic policy. For a mini-batch of transitions ( s t , a t , r t , s t + 1 ) sampled from the replay buffer, the soft Bellman target is shown in (28).
y t = r t + γ a A π θ ( a | s t + 1 ) [ min k = 1 , 2   Q ϕ k ( s t + 1 , a ) α log π θ ( a | s t + 1 ) ]
The policy parameters are updated by minimizing J π ( θ ) .
J π ( θ ) = E s ~ D [ a A π θ ( a | s ) ( α log π θ ( a | s ) min k = 1 , 2   Q ϕ k ( s , a ) ) ]
where the inner summations implement the entropy regularization characteristic of SAC. The temperature α is tuned automatically toward a prescribed target entropy.

4. LLM-Based Natural Language Instruction Supervision

This section describes the LLM-based supervision layer that enables a natural language-driven, human-in-the-loop distribution network reconfiguration on top of the DRL framework. Figure 2 illustrates that the distribution network provides numerical state information to the DRL agent and a compact textual system-condition summary to the LLM agent, while the operator issues high-level orders in natural language. The DRL agent proposes an initial switching plan, which the LLM evaluates and revises when needed. The finalized DNR action remains consistent with both physical network constraints and current operator preferences.

4.1. Information Flow and Module Roles

At each decision interval, the environment collects real-time measurements and topology data (branch, bus, and PV information) and constructs two outputs. The first is a numerical state vector, which is fed to the SAC-based DRL agent. Using this state, the DRL agent produces an unrevised DNR action, namely, a set of switch open/close commands corresponding to a candidate radial topology. The second is a concise system-condition summary, which captures salient features of the present operating point in textual form, such as overloaded lines, voltage violations, high-PV periods, or critical feeders.
In parallel, the operator may issue a natural language order, for example, “reduce losses,” “avoid feeder S12–S14,” or “limit switching around the main substation,” as indicated on the right of Figure 2. These orders are optional and may change over time; when no order is active, the unrevised DNR action is directly sent back to the distribution network after conventional feasibility checks. When an order is active, the unrevised action, the system-condition summary, and the most recent order are passed to the LLM agent, which acts as a supervisory gate between the DRL agent and the physical network.

4.2. Prompt Design and Instruction Understanding

The LLM agent is implemented as a frozen, general-purpose foundation model queried through carefully designed prompts. To obtain reproducible behavior and make the workflow inspectable, the interaction with the LLM is standardized by a fixed prompt template shown in the lower part of Figure 2.
Given this prompt, the LLM performs three functions. First, it interprets the operator’s instruction in the context of the current system conditions. Second, it assesses whether the DRL-generated switching plan aligns with the interpreted intent. Third, when necessary, it proposes a revised plan that better reflects the instruction. Because only the prompt is engineered and the underlying model parameters remain fixed, new instruction patterns or operational policies can be incorporated by modifying the textual template without retraining the DRL agent.
In practical operation, operator instructions may be vague, contradictory, or erroneous. Therefore, the supervisory gate first parses the natural language instruction into structured intents and checks whether the instruction is actionable. If the instruction is insufficiently specified, internally contradictory, or cannot be grounded to valid system entities, the framework does not trigger action revision and returns an operator-facing message requesting clarification or correction.
To make the above conflict-handling procedure intuitive, we provide illustrative examples under three representative instruction types, namely, vague, contradictory, and erroneous instructions. In all three cases, the operator receives an explanatory message indicating why the instruction is not actionable and how it can be refined into an executable form.
For a vague instruction, consider “Improve the network performance.” Since no actionable target is specified, the LLM treats the instruction as insufficiently defined and requests concrete objectives and scope, such as a target area, a voltage requirement, or an allowable switching budget. For a contradictory instruction, consider “Improve voltage quality at remote buses while keeping all switch statuses unchanged.” The LLM recognizes the internal inconsistency for a reconfiguration task and requests either relaxing the prohibition on switching changes or specifying a permissible set of controllable switches. For an erroneous instruction, consider “Open the switch on feeder F12” when feeder F12 is not defined in the current network model. In this case, the instruction cannot be grounded to valid system entities, and the LLM requests a valid device identifier or a selection from the available switch list.

4.3. LLM-Supervised Action Revision

At each decision interval t , the LLM receives the following three inputs: the operator instruction O t , the textual system-condition summary, and the unrevised DRL switching plan a t D R L . The LLM then produces (i) a compliance judgement indicating whether a t D R L is consistent with O t , and (ii) when non-compliant, an optional revised switching plan a t L L M .
If the compliance judgement is affirmative, the framework directly executes a t D R L after standard electrical feasibility checks. If the judgement is negative, the LLM proposes a t L L M by modifying a t D R L according to the operator intent. In this step, the LLM is instructed to revise the DRL plan through rule-bounded local modifications that preserve the action syntax assumed by the DRL layer, rather than synthesizing an unconstrained switching plan from scratch.
Before execution, any proposed revision a t L L M is subjected to deterministic validation. First, the output is parsed under a predefined structured schema. Second, operability is checked to ensure that all referenced switches exist in the network model and are controllable and available at time t . Third, radiality is verified using deterministic graph-theoretic tests to ensure that the resulting energized topology is connected and cycle-free. If any validation step fails, the revision is discarded and the framework falls back to a t D R L for the current interval. If validation passes, the revised action is further evaluated by the same power-flow-based feasibility checks as in the autonomous DRL mode. If the revised action is infeasible, it is rejected and the framework again falls back to a t D R L . Optionally, even when feasible, a revision can be rejected if it leads to substantial deterioration in the operational objective relative to a t D R L at the same time step (e.g., a pronounced increase in network loss).
Algorithm 1 summarizes the above deterministic validation and fallback policy for LLM-revised actions.
Algorithm 1: Deterministic safety filtering for LLM-revised actions
Input: System-condition summary at time t ; operator instruction O t ; DRL action a t D R L ; network graph data and switch metadata.
Output: Executed action a t .
 1. Query the LLM with O t , the system-condition summary, and a t D R L . Obtain a compliance judgement and an optional revision a t L L M .
 2. If compliant, set a t a t D R L .
 3. Otherwise, validate the revised action a t L L M :
    3.1 Parse a t L L M . If the format is invalid, set a t a t D R L .
    3.2 Check basic operability (all referenced switches are controllable and available at time t ). If failed, set a t a t D R L .
    3.3 Check radiality consistency using a deterministic graph test (cycle-free and connected topology after applying a t L L M ). If failed, set a t a t D R L .
    3.4 If all checks pass, set a t a t L L M .
 4. Execute a t after standard electrical feasibility checks.
This framework preserves the advantages of the SAC-based training in Section 3: the DRL agent continues to optimize long-term performance under stochastic RES and load conditions, whereas the LLM layer provides a flexible, interpretable interface through which operators can inject high-level preferences at run time. As a result, the proposed framework achieves natural language-driven, human-in-the-loop dynamic DNR without re-optimizing the underlying DRL policy whenever operator requirements change.

4.4. Quantitative Evaluation of Instruction Compliance

Operator instructions are issued in natural language and may vary in specificity, scope, and internal structure. To enable reproducible quantitative analysis, instruction compliance is evaluated using deterministic metrics computed from the executed switching action and the resulting network operating point.

4.4.1. Instruction Grounding and Clause Representation

At each decision interval t , an active instruction O t is mapped to a set of verifiable clauses K t = { k 1 , , k K t } . Each clause corresponds to one requirement that can be evaluated from (i) the executed switching action a t and/or (ii) the resulting network state (voltages, line loadings, losses) obtained from the feasibility evaluation. The clause representation supports both single and compound instructions by decomposing them into atomic, independently evaluable components.
Because natural language orders are open-ended, some inputs may be insufficiently specified, internally inconsistent, or not grounded to valid system entities. Such orders are treated as non-actionable for quantitative scoring at time t . We define an actionable indicator δ t a c t { 0 , 1 } , where δ t a c t = 1 if O t can be grounded and mapped to a non-empty clause set K t and δ t a c t = 0 otherwise. Over an evaluation window W , the actionable rate is defined as
r act ( W ) = 1 | W | t W δ t act
This separates the coverage of actionable instructions from the satisfaction level computed for actionable clauses.

4.4.2. Clause-Level Satisfaction with Partial and Conditional Fulfilment

For each actionable order, a per-step satisfaction score c k ( t ) [ 0 , 1 ] is computed deterministically for every clause k K t . The window-averaged satisfaction of clause k over W is
C k ( W ) = 1 | W | t W c k ( t )
The following clause types are used.
(i)
Prohibition clause.
A prohibition clause requires that specified components or regions are not involved in the executed switching action or energized topology. Let ν k ( t ) { 0 , 1 } denote whether the prohibition is violated at time t . The satisfaction is
c k ( t ) = 1 ν k ( t )
(ii)
Switching-budget clause.
A switching-budget clause limits the number of switching operations within a specified scope. Let n k ( t ) be the number of switching operations affecting that scope at time t , and let n k m a x be the budget. A graded satisfaction score is
c k ( t ) = 1 min max n k ( t ) n k max max 1 , n k max , 0 , 1
This yields full satisfaction when the budget is met and partial satisfaction under moderate exceedance.
(iii)
Performance clause.
A performance clause targets operational quality over a specified set. For a voltage-quality clause defined on a bus set B k , the average deviation is
d k ( t ) = 1 | B k | i B k V i ( t ) 1
The corresponding satisfaction is mapped as
c k ( t ) = 1 min max d k ( t ) Δ V k , 0 , 1
where Δ V k > 0 is a normalization constant that specifies the deviation scale regarded as fully unsatisfactory for this clause. Analogous definitions can be applied to overload-mitigation clauses using normalized loading exceedance on a specified branch set.
  • (iv) Conditional clause.
A conditional clause is binding only when a triggering condition holds. Let χ k ( t ) { 0 , 1 } denote whether the clause is active at time t . When χ k ( t ) = 0 , the clause is treated as non-binding and c k ( t ) = 1 . When χ k ( t ) = 1 , c k ( t ) is computed using the corresponding prohibition, budget, or performance definition above.

4.4.3. Compound Instructions and Priority-Aware Aggregation

Compound orders are represented by multiple clauses with different priorities. We partition clauses into priority tiers T 1 T 2 T 3 , where T 1 contains hard prohibitions, T 2 contains conditional or safety-relevant operational rules, and T 3 contains soft performance preferences. The tier-wise satisfaction over W is
C T j ( W ) = 1 | T j | k T j C k ( W ) ,   j { 1 , 2 , 3 }
For interpretability, we report tier-wise compliance as the vector
C ( W ) = C T 1 ( W ) , C T 2 ( W ) , C T 3 ( W )
When a single scalar score is required, a tier-dominant scalarization is used.
C inst ( W ) = C T 1 ( W ) + ϵ C T 2 ( W ) + ϵ 2 C T 3 ( W ) ,   0 < ϵ < 1

4.4.4. Fallback Rate for Auditable Robustness

In addition to satisfaction, the robustness of instruction handling is characterized by a fallback indicator δ t f b { 0 , 1 } , where δ t f b = 1 if the LLM-proposed revision is rejected by deterministic validation and the executed action equals the native DRL action, i.e., a t = a t D R L . Over W , the fallback rate is
r fb ( W ) = 1 | W | t W δ t fb
Together, r a c t ( W ) , C ( W ) , r f b ( W ) provide a reproducible characterization of instruction compliance, including partial fulfilment, conditional rules, and priority handling, while keeping non-actionable inputs explicit through the actionable rate.

5. Case Study

This section presents numerical case studies on two standard radial distribution feeders to evaluate the proposed natural language-driven dynamic DNR framework. The IEEE 16-bus system is first used to analyze the basic behavior of the SAC-based dynamic DNR and to illustrate the impact of natural language instructions on switching decisions. The IEEE 33-bus system is then adopted as a larger-scale testbed to further assess scalability and instruction-awareness under higher PV penetration and more switching options.
We organize the case studies following the two-layer structure of the proposed framework. First, we benchmark candidate DNR solvers in the autonomous mode without operator instructions to select a high-quality DRL algorithm for online operation. Second, we evaluate the instruction-aware online revision layer built on top of the selected algorithm. The LLM revises the DRL-proposed switching action at run time without retraining.

5.1. Case Study on the IEEE 16-Bus Distribution System

5.1.1. System Description and Simulation Settings

The first case study is carried out on the IEEE 16-bus radial distribution system, whose single-line diagram is shown in Figure 3. The locations of sectionalizing switches, tie switches, and distributed PV units are indicated in the figure. PV units are installed at several load buses in order to generate reverse power flows and local congestion under high PV output. The main electrical characteristics of the feeder, including the number of buses and branches, base voltage and power, number and capacity of PV units, PV-connected bus indices, and tie-switch indices, are summarized in Table 1.
The time-varying load, PV generation, and electricity price follow typical 24 h profiles, as depicted in Figure 4. The base load presents an evening peak, while the PV generation is concentrated around midday, resulting in pronounced variations of the net load and potential power flow reversals in downstream branches. The electricity price adopts a time-of-use pattern with higher prices during peak hours and lower prices during off-peak periods.
The main hyperparameters of the SAC agent and the configuration of the LLM are listed in Table 2.

5.1.2. Training Performance of DRL Agents on the IEEE 16-Bus Feeder

Before comparing the proposed SAC-based dynamic DNR with the conventional model-based scheme, we first examine the training behavior of different DRL agents on the IEEE 16-bus feeder. The following four methods are reported in Figure 5: (i) a simple baseline obtained from a traditional optimizer (Benchmark); (ii) a rule-based heuristic algorithm; (iii) a standard entropy-free actor-critic agent (AC); and (iv) the proposed soft actor-critic agent (SAC) with entropy regularization and automatic temperature tuning.
Figure 5 shows that the SAC curve increases rapidly during the first few hundred episodes and then gradually saturates, reaching a stable plateau after around 1000 episodes without noticeable performance degradation. The dashed lines represent the non-learning references (Heuristic and Benchmark), which remain constant because they do not involve parameter updates. This indicates that the SAC agent can reliably learn a stable and high-quality dynamic DNR policy on the IEEE 16-bus feeder, providing a sound basis for the subsequent case studies.

5.1.3. Baseline Comparison Without Natural Language Instructions

In the baseline case without natural language instructions, we compare the following four schemes on the IEEE 16-bus feeder: the conventional model-based DNR, a rule-based heuristic algorithm, a standard entropy-free AC agent, and the proposed SAC-based dynamic DNR. The numerical results averaged over all testing days are summarized in Table 3.
Table 3 illustrates that all schemes operate without load shedding or branch overloading, while model-based DNR provides the best benchmark performance at the cost of high computational burden. The heuristic algorithm offers a rule-based reference and achieves better performance than the entropy-free AC agent, but it remains noticeably inferior to SAC in terms of cost, loss, and voltage profile. The AC agent improves upon a fixed topology but still suffers from noticeably higher cost, loss, and voltage deviation than the benchmark. The SAC-based scheme achieves performance very close to model-based DNR and is consistently better than AC across all objective-related metrics, with much lower online computation; therefore, it is adopted as the base DRL policy for subsequent instruction-aware operation.

5.1.4. Instruction-Aware Operation with SAC and LLM

To illustrate instruction-aware operation on the IEEE 16-bus feeder, we consider a typical maintenance instruction “From 10:00 to 16:00 today, avoid using the tie line between buses 5 and 11.”
The left panel of Figure 6 shows a typical unrevised action generated by the SAC-based DRL agent under this instruction. In order to reduce losses and relieve the loading of the left feeder, the agent closes the tie line between buses 5 and 11 and opens one of the original radial branches on the left side, so that part of the load supplied by the S4-S5-S6 corridor is transferred to the feeder that passes through bus 11. Electrically, this action is beneficial, but it energizes line (5,11) during the maintenance window and therefore violates the operator’s requirement.
In the proposed SAC-LLM framework, the same DRL action is first produced and then passed to the LLM supervision layer together with the active instruction. The LLM interprets “avoid using the line between buses 5 and 11” as a hard constraint and searches, within the safety-filtered action set, for an alternative topology that keeps line (5,11) open while still alleviating the critical loading. As illustrated in the right panel of Figure 6, the revised action keeps line (5,11) open and instead reconfigures the feeder by closing the path from bus 6 to bus 7 and onward to bus 16 and bus 15, while opening one branch on the original right feeder to maintain radial operation. In other words, the LLM replaces the SAC action “use the 5–11 tie” with a different power-flow transfer path that bypasses the out-of-service line but achieves a similar relief of branch loading.
To quantify the extent to which the executed switching actions satisfy the operator’s natural language instruction, we evaluate the instruction over the maintenance window W 16 = 10:00–16:00 using the clause-based metrics defined in Section 4.4. This instruction is mapped to a Tier-1 prohibition clause on line (5,11). Accordingly, we report (i) the Tier-1 prohibition satisfaction C T 1 ( W 16 ) and (ii) the total violation hours of line (5,11) within W 16 , both computed from the executed topology and the post-check outputs. We additionally include a language-free supervisory baseline, denoted as SAC and Rule Supervisor, which enforces the same prohibition clause via deterministic post-processing on top of the SAC action, followed by the same feasibility checks and fallback policy. This baseline isolates the effect of adding a generic rule-based override layer from that of language-based instruction interpretation.
Table 4 reports the results for three schemes, namely, SAC, SAC and Rule Supervisor, and SAC and LLM. For each scheme, two categories of metrics are presented. The first category summarizes the conventional system metrics, including purchasing cost, energy loss, voltage deviation, switching operations, and computation time. The second category summarizes instruction compliance over the maintenance window W 16 = 10 : 00 16 : 00 , including the actionable rate r a c t ( W 16 ) , the violation hours of line (5,11), the Tier-1 prohibition compliance C T 1 ( W 16 ) , and the fallback rate r f b ( W 16 ) . As shown in Table 4, both SAC and Rule Supervisor and SAC and LLM eliminate the maintenance-window usage of line (5,11), improving C T 1 ( W 16 ) from 14.3% under pure SAC to 100%, with only marginal changes in purchasing cost, energy loss, and voltage deviation, while the daily number of switching operations remains unchanged in this case study.
Beyond the numerical results, we also clarify the deployment and maintenance overhead of the two supervisory designs. SAC and Rule Supervisor is intrinsically engineering-driven as follows: when operator requirements change or new instruction patterns are introduced, engineers must re-specify the requirements in a machine-interpretable form and implement and test new rule logic and mappings in code. In contrast, SAC and LLM provides an instruction-to-action interface that is primarily configured at the prompt level. Many new instruction phrasings can be accommodated by updating the prompt template, examples, and structured-output constraints, without re-engineering the supervisory logic, thereby substantially reducing instruction-specific programming and integration effort in practice.
As summarized in Table 4, both SAC and Rule Supervisor and SAC and LLM eliminate the maintenance-window usage of line (5,11), achieving 100% Tier-1 prohibition compliance over W 16 , while incurring only marginal changes in purchasing cost, energy loss, and voltage deviation. The added computation time of the supervisory layer is modest and remains far below that of model-based DNR, supporting the practicality of instruction-aware operation.

5.2. Case Study on the IEEE 33-Bus Distribution System

The second case study is carried out on the IEEE 33-bus radial distribution system. The system topology is shown in Figure 7. The same 24 h load, PV, and price profiles as in the 16-bus case are adopted, scaled to the peak demand of the 33-bus feeder. The DNR problem is formulated with the same MDP structure and reward design; only the network parameters and load levels are updated.
As summarized in Table 5, model-based DNR provides the best benchmark performance in terms of daily purchasing cost, energy loss, and voltage profile, but it requires solving mixed-integer problems every hour. The AC agent achieves only moderate improvements and still exhibits noticeably higher cost, loss, and voltage deviations than the benchmark. The SAC-based scheme attains operating performance very close to the model-based DNR and clearly better than AC across all objective-related metrics, while using substantially less online computation time.
For the IEEE 33-bus system, we focus on an instruction that reflects the operator’s preference for voltage quality and reliability on the remote long lateral from bus 24 to buses 29–33 as follows: “During the evening peak (18:00–22:00), give higher priority to voltage quality at the remote buses 29–33. Try to keep their voltages as close to 1.0 p.u. as possible and avoid unnecessary reconfiguration of the long lateral from bus 24 through buses 25 and 29 to buses 30–33, unless there is an overload or a voltage violation.”
Under the pure SAC policy, the agent optimizes the global objective and may still transfer additional load to the 24–25–29–30–33 path during the evening peak. For example, at 20:00, it may propose to close the tie line (25,29) and open branch (28,29), slightly reducing total losses but further depressing the voltages at buses 29–33 and contradicting the operator’s preference on the remote lateral.
In the SAC-LLM framework, the same SAC action is first generated and then checked against the active instruction. The LLM recognizes that this reconfiguration directly affects the 24–25–29–30–33 lateral and is not triggered by any overload or constraint violation; therefore, it vetoes the proposal. The final action keeps line (25,29) open and line (28,29) closed, i.e., the remote lateral remains in its original radial configuration, as shown in the lower panel of Figure 8. In other words, whenever SAC attempts to perform non-essential switching on the remote lateral, the LLM cancels the reconfiguration and preserves the existing topology.
To quantify the extent to which the executed actions satisfy this compound natural language instruction over W 33 = 18:00–22:00, we decompose the instruction into two verifiable clauses, as defined in Section 4.4, as follows: (i) a Tier-2 conditional clause that discourages switching operations on the 24–25–29–30–33 lateral unless an overload or a voltage-violation trigger is present, and (ii) a Tier-3 voltage-quality clause on buses 29–33 quantified by a normalized voltage-deviation score. We report the corresponding satisfaction scores C T 2 ( W 33 ) and C T 3 ( W 33 ) , together with the number of non-essential lateral switching hours within W 33 , all computed from the executed switching actions and the resulting operating points.
The daily impact of this instruction is summarized in Table 6. Compared with pure SAC, SAC and LLM significantly improves the voltage profile at buses 29–33 and greatly increases the instruction satisfaction rate, while only slightly increasing daily purchasing cost and energy loss. Meanwhile, the satisfaction rows in Table 6 show that SAC-LLM achieves higher clause-level satisfaction for both the conditional lateral rule and the voltage-quality preference during the instruction window.

6. Discussion

6.1. On Stochastic Operating Conditions and Distribution Shift

This work focuses on the hour-level dynamic distribution network reconfiguration under normal operating conditions with stochastic PV and load variations. Although Figure 4 presents representative 24 h trajectories for visualization, the PV generation and load demands used in both training and evaluation are not fixed deterministic curves. Instead, they are generated as episode-wise stochastic realizations within predefined variability ranges, so the SAC agent is trained and tested on a family of time-varying net load conditions rather than a single nominal profile.
Nevertheless, practical operation may still involve out-of-distribution (OOD) conditions caused by imperfect PV information, such as forecast or estimation errors, as well as abrupt PV ramps induced by cloud transients. To explicitly evaluate robustness beyond typical fluctuations, we add an additional stress test with two representative scenarios. The first scenario considers a PV forecast-error case in which the PV information available to the policy deviates from the realized PV trajectory. The second scenario considers a cloud-transient ramp case in which the PV output experiences an abrupt daytime drop over a short window. Figure 9 illustrates the construction of these two OOD PV events, and Table 7 summarizes the resulting operating performance on the IEEE 33-bus system. Compared with the normal setting, both OOD cases lead to moderate increases in purchasing cost and energy loss, while voltage deviation becomes more pronounced under the cloud-transient ramp. Importantly, the policy retains the same daily switching budget in the forecast-error case and requires only one additional switching operation in the ramp case, with computation time remaining at the sub-second level.

6.2. Reproducibility of the LLM-Based Supervisory Layer

To quantify the run-to-run variability introduced by the LLM-based supervisory layer, we conduct a short repeatability test under strictly identical inputs. In our implementation, each episode corresponds to one operating day and consists of 24 decision steps (hour-level reconfiguration). At each hour, the DRL agent outputs a single baseline switching action from a discrete action space, and the LLM acts as an instruction-aware supervisor that may revise this baseline action if needed.
We construct the repeatability test using one representative test episode from the instruction-driven case study and include all 24-h steps. For each hour, we fix the following: (i) the numerical system-condition summary, (ii) the operator instruction, and (iii) the DRL base action. We then repeat the LLM inference 20 times per step while keeping the LLM configuration unchanged. Each LLM output is parsed into a revised action and validated by deterministic feasibility checks; if the output is invalid or infeasible, the final executed action deterministically falls back to the DRL baseline action.
Run-to-run variability is quantified using the same deterministic compliance metrics defined in Section 4.4, rather than any LLM-based scoring. Concretely, for each repeated run we compute the actionable rate r a c t ( W 33 ) , the tier-level satisfaction scores C T 2 ( W 33 ) and C T 3 ( W 33 ) , and the fallback rate r f b ( W 33 ) over the instruction window W 33 = 18 : 00 22 : 00 . In addition, we report the same system-level outcomes as in Table 8. For the SAC–LLM method, each metric is summarized as the mean ± standard deviation across the 20 repeated LLM runs under identical inputs, while the pure SAC baseline remains deterministic for the selected episode. This repeatability analysis directly measures the variance induced by the supervisory layer and confirms that the reported compliance improvements are reproducible under the fixed configuration.

6.3. Scalability Considerations

The current implementation feeds the LLM a system-condition summary containing Branch, Bus, and Topology data together with the DRL-proposed switching plan to support instruction-aware supervision. This design is effective for feeder-level systems with tens to roughly a hundred buses. However, directly extending the same full-state prompt serialization to thousand-bus models would significantly increase prompt length and inference overhead, which is beyond the scope of this study. Two practical directions to enable large-scale deployment are described below.
(1)
Token-efficient full-state serialization
To control prompt length without losing system visibility, we represent the system-condition summary using a compact, fixed-schema tabular format. Components are indexed by integer IDs and numerical values are quantized to a fixed precision. To ensure the LLM can reliably parse and reason over such simplified tables, the same table schema can be incorporated into the model training pipeline, either during pretraining on synthetic power-system tables or during supervised fine-tuning on schema-consistent prompt–response pairs. This reduces the need for verbose textual descriptions at inference time.
(2)
Partitioned multi-agent deployment
When the feeder becomes very large, prompt length can be bounded by partitioning the network into multiple zones defined by feeder sections or tie-switch boundaries. Each zone hosts a local agent and ingests only the full-state tables within its zone. A lightweight coordination layer resolves cross-zone actions and maintains global consistency such as radiality, so the per-agent prompt length scales with zone size rather than total system size.

7. Conclusions

This paper proposed an LLM-based, human-in-the-loop framework for dynamic distribution network reconfiguration in active feeders with high PV penetration. A SAC agent was designed as a fast base policy for hour-level DNR. An LLM-based supervision layer interprets operator instructions in natural language and selectively revises the SAC-proposed switching action after safety checking, thereby integrating temporary operator preferences without retraining.
Case studies on IEEE 16-bus and IEEE 33-bus distribution systems demonstrated that the SAC base policy achieves operating performance close to a conventional model-based DNR benchmark while drastically reducing online computation time. When natural language instructions are activated, the LLM supervision layer markedly improves instruction compliance with only marginal changes in purchasing cost, energy loss, voltage profile, and switching frequency. These results indicate that LLMs can serve as an interpretable and lightweight supervisory interface for DRL-based DNR. This enables instruction-aware, human–AI collaborative operation in future active distribution networks.
Future work will extend the framework in two directions. First, we will scale it to larger real feeders with richer operational rules, heterogeneous switch types, and multi-operator instruction patterns. Second, we will expand its scope from normal-operation DNR to contingency-driven reconfiguration and service restoration, where faults and partial observability introduce additional constraints and demand tighter coordination between safety assurance and instruction compliance.

Author Contributions

Conceptualization, H.Z. (Hanxin Zhang); methodology, H.Z. (Hanxin Zhang); software, H.Z. (Hao Zhou); validation, H.Z. (Hanxin Zhang) and H.Z. (Hao Zhou); formal analysis, H.Z. (Hao Zhou); investigation, H.Z. (Hanxin Zhang); resources, H.Z. (Hao Zhou); data curation, H.Z. (Hanxin Zhang); writing—original draft preparation, H.Z. (Hanxin Zhang); writing—review and editing, H.Z. (Hao Zhou); visualization, H.Z. (Hanxin Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

Indices and Sets
i , j Indices of buses
t Time index within the operating horizon
N Set of buses
L Set of candidate branches
S Set of branches equipped with remotely controllable switches, S L
C ( j ) Set of buses directly downstream of bus j in the radial tree
T Number of decision steps in the operating horizon
S M D P State space of the MDP formulation
A Action space of the DRL agent
D Replay buffer storing transitions
B k Bus set associated with performance clause k
Parameters
Δ t Time step length
P i , t L , Q i , t L Active and reactive demand at bus i and time t
P i , t R E S , Q i , t R E S Active and reactive power of RESs at bus i and time t
V i r e f Reference voltage magnitude at bus i
V i m i n , V i m a x Lower and upper bounds of voltage magnitude at bus i
S i j m a x Apparent power rating (thermal limit) of branch ( i , j )
r i j , x i j Resistance and reactance of branch ( i , j )
C s u b Energy price for active power purchased from the upstream grid
w l o s s , w v o l t , w s w Weights of loss, voltage deviation, switching cost
w t o p o , w o v e r , w s h e d Weights of topology deviation, branch overload, and load shedding
d i j Penalty coefficient for topology deviation on branch ( i , j )
c i j Operation cost coefficient of the controllable switch on branch ( i , j )
P i , t G , m i n , P i , t G , m a x
Q i , t G , m i n , Q i , t G , m a x
Minimum and maximum active and reactive power output of controllable generation
N S T E P M A X Maximum allowed number of switching operations
N i j m a x Maximum cumulative number of switching operations for branch ( i , j )
γ Discount factor in the SAC updates
α Temperature parameter in SAC that balances expected return
τ S A C Temperature factor of SAC
τ L L M LLM sampling temperature
top-pLLM nucleus sampling parameter
Variables
P t s u b Active power purchased from the upstream grid at time t
P t l o s s Total active power loss of the distribution network at time t
P i , t G , Q i , t G Active and reactive power output of controllable generation at bus i and time t
P i , t s h e d Active load shedding at bus i and time t
P i j , t , Q i j , t Active and reactive power flow on branch i , j at time t
V i , t Voltage magnitude at bus i and time t
z i j , t Binary energization status of branch ( i , j ) at time t
p t L , q t L Vectors collecting nodal active and reactive demands at time t
p t R E S , q t R E S Vectors collecting nodal RES injections at time t
I t , V t Vector of selected branch current and voltage magnitudes at time t
Φ t volt Voltage deviation index
Φ t sw Switching cost index
Φ t topo Topology deviation index
ξ i j , t Branch-level overload/violation-related penalty
s t MDP state at time t
a t Multi-binary action vector at time t
r t Immediate reward at time t
J ( π ) Expected discounted return associated with policy π
π θ ( a s ) Stochastic policy of the SAC agent with parameters θ
Q ϕ 1 ( s , a ) , Q ϕ 2 ( s , a ) Soft Q-functions of the SAC agent
θ Parameter vector of the policy network
J π ( θ ) Policy objective used to update θ
O t Active natural language instruction at time t
T 1 , T 2 , T 3 Priority tiers of clauses (hard prohibitions/conditional or safety-relevant rules/soft preferences)
δ t a c t Actionable indicator
r a c t ( W ) Actionable rate over window W
c k ( t ) Per-step satisfaction score of clause k
C k ( W ) Per-step satisfaction score of clause k
v k ( t ) Violation indicator for the prohibition clause
n k ( t ) Switching count and budget for clause k
d k ( t ) Deviation metric for the performance clause
χ k ( t ) Trigger indicator for the conditional clause
C ( W ) Tier-wise compliance vector
C T i ( W ) Tier-wise satisfaction
δ t f b Fallback indicator
r f b ( W ) Fallback rate over window W

References

  1. International Energy Agency. Renewables 2022: Analysis and Forecast to 2027; IEA: Paris, France, 2022. [Google Scholar]
  2. Lotfi, H.; Hajiabadi, M.E.; Parsadust, H. Power distribution network reconfiguration techniques: A thorough review. Sustainability 2024, 16, 10307. [Google Scholar] [CrossRef]
  3. Behbahani, M.R.B.; Jalilian, A.; Bahmanyar, A.; Ernst, D. Comprehensive review on static and dynamic distribution network reconfiguration methodologies. IEEE Access 2024, 12, 9510–9525. [Google Scholar] [CrossRef]
  4. Ushashree, G.S.; Kumar, N. Power system reconfiguration in distribution system for loss minimization using optimization techniques: A review. Wirel. Pers. Commun. 2023, 128, 1907–1940. [Google Scholar] [CrossRef]
  5. Liu, Q.; Zhang, L.; Ji, X.; Shi, H. Dynamic reconfiguration of distribution network considering the uncertainty of distributed generation and loads. Front. Energy Res. 2023, 11, 1279579. [Google Scholar] [CrossRef]
  6. Yu, Y.; Yang, M.; Zhang, Y.; Ye, P.; Ji, X.; Li, J. Fast reconfiguration method of low-carbon distribution network based on convolutional neural network. Front. Energy Res. 2023, 11, 1102949. [Google Scholar] [CrossRef]
  7. Gao, Y.; Wang, W.; Shi, J.; Yu, N. Batch-constrained reinforcement learning for dynamic distribution network reconfiguration. IEEE Trans. Smart Grid 2020, 11, 5357–5369. [Google Scholar] [CrossRef]
  8. Bui, V.-H.; Su, W. Real-time operation of distribution network: A deep reinforcement learning-based reconfiguration approach. Sustain. Energy Technol. Assess. 2022, 50, 101841. [Google Scholar] [CrossRef]
  9. Gholizadeh, N.; Musilek, P. A generalized deep reinforcement learning model for distribution network reconfiguration with power-flow-based action-space sampling. Energies 2024, 17, 5187. [Google Scholar] [CrossRef]
  10. Jiang, S.; Gao, H.; Wang, X.; Liu, J.; Zuo, K. Deep reinforcement learning based multi-level dynamic reconfiguration for urban distribution network: A cloud-edge collaboration architecture. Glob. Energy Interconnect. 2023, 6, 1–14. [Google Scholar] [CrossRef]
  11. Kundačina, O.B.; Vidović, P.M.; Petković, M.R. Solving dynamic distribution network reconfiguration using deep reinforcement learning. Electr. Eng. 2022, 104, 1487–1501. [Google Scholar] [CrossRef]
  12. Shalaby, A.A.; Kirschen, D.S.; Zhang, B. Safe deep reinforcement learning for resilient self-proactive distribution grids against wildfires. IEEE Trans. Smart Grid, 2025; early access. [Google Scholar] [CrossRef]
  13. Su, T.; Teng, F.; Strbac, G. A review of safe reinforcement learning methods for modern power systems. Proc. IEEE 2025, 113, 213–255. [Google Scholar] [CrossRef]
  14. Zhang, J.; Liu, X.; Li, Y.; He, H. Networked multiagent safe reinforcement learning for low-carbon demand management in distribution network. arXiv 2023, arXiv:2311.15594. [Google Scholar] [CrossRef]
  15. Yan, Z.; Xu, Y. Real-time optimal power flow with linguistic stipulations: Integrating GPT-agent and deep reinforcement learning. IEEE Trans. Power Syst. 2024, 39, 4747–4750. [Google Scholar] [CrossRef]
  16. Zhang, L.; Yue, D.; Hancke, G.P.; Dou, C.; Yu, L.; Chen, Z. Optimization of energy and carbon emissions in integrated energy system based on deep reinforcement learning assisted by large language model. IEEE Trans. Ind. Inform. 2025, 21, 8186–8197. [Google Scholar] [CrossRef]
  17. Cheng, Y.; Zhao, H.; Zhou, X.; Zhao, J.; Cao, Y.; Yang, C.; Cai, X. A large language model for advanced power dispatch. Sci. Rep. 2025, 15, 8925. [Google Scholar] [CrossRef] [PubMed]
  18. Tan, L.; Wang, J.; Li, Z. Large language model based framework for secure operation of power systems. In Proceedings of the 3rd International Conference on Power System and Electrical Technology (PSET), Tokyo, Japan, 5–8 August 2024; pp. 1–6. [Google Scholar]
  19. Yang, Y.; Yang, X.; Lin, C.; Wu, W. Large language model powered automated modeling and optimization of active distribution network dispatch problems. IEEE Trans. Smart Grid, 2025; early access. [Google Scholar] [CrossRef]
  20. Zhang, C.; Wang, Y.; Strbac, G.; Dong, Z. Large language models meet energy systems: Opportunities, challenges, and future perspectives. Appl. Energy 2026, 403, 127076. [Google Scholar] [CrossRef]
Figure 1. Overall framework of the proposed natural language-driven dynamic DNR scheme.
Figure 1. Overall framework of the proposed natural language-driven dynamic DNR scheme.
Electronics 15 00566 g001
Figure 2. LLM-based instruction supervision layer for natural language-driven dynamic DNR.
Figure 2. LLM-based instruction supervision layer for natural language-driven dynamic DNR.
Electronics 15 00566 g002
Figure 3. IEEE 16-bus system with distributed PV.
Figure 3. IEEE 16-bus system with distributed PV.
Electronics 15 00566 g003
Figure 4. Time-varying operating profiles of the distribution system: (a) Electricity price (blue, left y-axis) and load demand (orange, right y-axis) over a typical 24 h cycle; (b) PV generation profile (blue) normalized to its rated capacity.
Figure 4. Time-varying operating profiles of the distribution system: (a) Electricity price (blue, left y-axis) and load demand (orange, right y-axis) over a typical 24 h cycle; (b) PV generation profile (blue) normalized to its rated capacity.
Electronics 15 00566 g004
Figure 5. Training curves of the DRL agents in the IEEE 16-bus feeder.
Figure 5. Training curves of the DRL agents in the IEEE 16-bus feeder.
Electronics 15 00566 g005
Figure 6. Example of natural language instruction-aware topology adjustment on the IEEE 16-bus system.
Figure 6. Example of natural language instruction-aware topology adjustment on the IEEE 16-bus system.
Electronics 15 00566 g006
Figure 7. IEEE 33-bus system with distributed PV.
Figure 7. IEEE 33-bus system with distributed PV.
Electronics 15 00566 g007
Figure 8. Example of natural language instruction-aware topology adjustment on the IEEE 33-bus system.
Figure 8. Example of natural language instruction-aware topology adjustment on the IEEE 33-bus system.
Electronics 15 00566 g008
Figure 9. OOD PV event design: forecast error and cloud-transient ramp.
Figure 9. OOD PV event design: forecast error and cloud-transient ramp.
Electronics 15 00566 g009
Table 1. Parameters of IEEE 16-bus system with distributed PV.
Table 1. Parameters of IEEE 16-bus system with distributed PV.
ParameterValueParameterValue
Number of buses16Sectionalizing switch number13
Number of branches16Tie switch number4
Base voltage12.66 kVNumber of PV units6
Base power10 MWBuses with PV units4, 6, 8, 11, 13, 15
Table 2. Parameters of the algorithm.
Table 2. Parameters of the algorithm.
ParameterValueParameterValue
Actor Learning Rate1 × 10−4Buffer Size10,000
Critic Learning Rate1 × 10−4Minimal Size200
Alpha Learning Rate1 × 10−4Discount Factor γ 0.99
Episode Number2000Temperature Factor τ S A C 0.05
Batch Size64Target Entropy−1.0
LLM ModelMoonshot-v1-32kContext Window32k
Top-p0.7Temperature τ L L M 0.3
Table 3. Performance of different algorithms on the IEEE 16-bus system.
Table 3. Performance of different algorithms on the IEEE 16-bus system.
MetricModel-BasedHeuristicACSAC
Daily purchasing cost (103 $)18.7219.2120.0518.79
Daily energy loss (MWh)9.5010.6511.959.64
Avg. voltage deviation (p.u.)0.02870.04120.06050.0293
Max. voltage deviation (p.u.)0.04500.06350.08030.0474
Daily switching operations10121412
Computation time per day (s)12.415.70.180.19
Table 4. Impact of instruction on the IEEE 16-bus system.
Table 4. Impact of instruction on the IEEE 16-bus system.
CategoryMetricSACSAC and Rule SupervisorSAC and LLM
Conventional system metricsDaily purchasing cost (103 $)18.7918.8318.83
Daily energy loss (MWh)9.649.889.72
Load shedding (MWh)0.000.000.00
Branch overloading cost (103 $)0.000.000.00
Avg. voltage deviation (p.u.)0.02930.03210.0295
Max. voltage deviation (p.u.)0.04740.04980.0478
Daily switching operations121212
Computation time per day (s)0.090.260.30
Natural language instruction satisfaction metricsActionable rate r a c t ( W 16 ) -11
Violation hours of line (5,11) within W 16 (h)600
Tier-1 prohibition compliance C T 1 ( W 16 ) (%)14.3100100
Fallback rate r f b ( W 16 ) (%)-00
Table 5. Performance of different algorithms on the IEEE 33-bus system.
Table 5. Performance of different algorithms on the IEEE 33-bus system.
MetricModel-Based ACSAC
Daily purchasing cost (103 $)2.742.822.76
Daily energy loss (MWh)4.805.204.86
Avg. voltage deviation (p.u.)0.0240.0270.025
Max. voltage deviation (p.u.)0.0660.0720.068
Daily switching operations1098
Computation time per day (s)42.70.620.58
Table 6. Impact of instruction on the IEEE 33-bus system.
Table 6. Impact of instruction on the IEEE 33-bus system.
CategoryMetricSACSAC and LLM
Conventional system metricsDaily purchasing cost (103 $)2.762.80
Daily energy loss (MWh)4.864.98
Avg. voltage deviation (p.u.)0.0250.024
Min. voltage at buses 29–33 (p.u.)0.9360.953
Hours with (V < 0.95) p.u. at buses 29–3372
Daily switching operations89
Computation time per day (s)0.62.4
Natural language instruction satisfaction metricsActionable rate r a c t   ( W 33 ) -1
Triggered hours within W 33 (overload or voltage violation) (h)11
Non-essential lateral switching hours within W 33 (h)20
Tier-2 conditional lateral compliance C T 2 ( W 33 ) (%)60100
Tier-3 voltage-quality compliance C T 3 ( W 33 ) (%)72.080.0
Fallback rate r f b ( W 33 ) (%)-0
Table 7. Robustness stress test of the SAC policy under OOD PV events on the IEEE 33-bus system.
Table 7. Robustness stress test of the SAC policy under OOD PV events on the IEEE 33-bus system.
MetricNormalPV Forecast ErrorCloud-Transient
Daily purchasing cost (103 $)2.762.822.91
Daily energy loss (MWh)4.864.934.98
Avg. voltage deviation (p.u.)0.0250.0310.044
Max. voltage deviation (p.u.)0.0680.0820.095
Daily switching operations889
Computation time per day (s)0.580.660.61
Table 8. Daily impact and repeatability of instruction handling.
Table 8. Daily impact and repeatability of instruction handling.
CategoryMetricRepeatability Result (Mean ± Std)
Conventional system metricsDaily purchasing cost (103 $)2.64 ± 0.12
Daily energy loss (MWh)4.80 ± 0.07
Avg. voltage deviation (p.u.)0.041 ± 0.008
Min. voltage at buses 29–33 (p.u.)0.953 ± 0.012
Hours with (V < 0.95) p.u. at buses 29–332.0 ± 0.0
Daily switching operations6.3 ± 2.5
Computation time per day (s)2.404 ± 0.115
Natural language instruction satisfaction metricsActionable rate r a c t ( W 33 ) 1 ± 0.0
Triggered hours within W 33 (overload or voltage violation) (h)0 ± 0.0
Non-essential lateral switching hours within W 33 (h)0 ± 0.0
Tier-2 conditional lateral compliance C T 2 ( W 33 ) (%)100 ± 0.0
Tier-3 voltage-quality compliance C T 3 ( W 33 ) (%)91.7 ± 0.0
Fallback rate r f b ( W 33 ) (%)0 ± 0.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, H.; Zhou, H. LLM-Based Dynamic Distribution Network Reconfiguration with Distributed Photovoltaics. Electronics 2026, 15, 566. https://doi.org/10.3390/electronics15030566

AMA Style

Zhang H, Zhou H. LLM-Based Dynamic Distribution Network Reconfiguration with Distributed Photovoltaics. Electronics. 2026; 15(3):566. https://doi.org/10.3390/electronics15030566

Chicago/Turabian Style

Zhang, Hanxin, and Hao Zhou. 2026. "LLM-Based Dynamic Distribution Network Reconfiguration with Distributed Photovoltaics" Electronics 15, no. 3: 566. https://doi.org/10.3390/electronics15030566

APA Style

Zhang, H., & Zhou, H. (2026). LLM-Based Dynamic Distribution Network Reconfiguration with Distributed Photovoltaics. Electronics, 15(3), 566. https://doi.org/10.3390/electronics15030566

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop