An End-to-End Hierarchical Intelligent Inference Model for Collaborative Operation of Grid Switches

Zhao, Mingrui; Chen, Tie; Yuan, Jiaxin; Jiang, Yuting; Ren, Junlin

doi:10.3390/en18246574

Open AccessArticle

An End-to-End Hierarchical Intelligent Inference Model for Collaborative Operation of Grid Switches

by

Mingrui Zhao

^1,2,

Tie Chen

^1,2,*

,

Jiaxin Yuan

³

,

Yuting Jiang

^1,2 and

Junlin Ren

^1,2

¹

College of Electrical and New Energy, China Three Gorges University, Yichang 443002, China

²

Hubei Provincial Key Laboratory for Operation and Control of Cascaded Hydropower Station, China Three Gorges University, Yichang 443002, China

³

School of Electrical Engineering and Automation, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(24), 6574; https://doi.org/10.3390/en18246574

Submission received: 31 October 2025 / Revised: 2 December 2025 / Accepted: 10 December 2025 / Published: 16 December 2025

Download

Browse Figures

Versions Notes

Abstract

To address the issue of heavy reliance on manual intervention in substation maintenance tasks, this paper proposes an end-to-end hierarchical intelligent inference method for collaborative operation of grid switches. The method constructs a unified knowledge environment that can simultaneously describe the operational characteristics of both the power grid and the substation, and combines Dueling Double Deep Q-Network (D3QN) with Multi-Task Dueling Double Deep Q-Network (MT-D3QN) algorithms for interactive training, achieving hierarchical inference. The upper layer uses bays as the base nodes to reflect the power flow, designing a reward and penalty function under N-1 power flow constraints and ring-current impact constraints, optimizing the load transfer plan for the power outages caused by maintenance tasks. The lower layer uses switches as the base nodes to reflect the main wiring status of the substation, introduces a multi-task learning mechanism for parallel training of bays with the same tasks, designs the reward and penalty function according to the five protection rules, and optimizes the switching operations within the bay. The experimental results show that the trained model can quickly deduce the switching operation sequence for different maintenance tasks.

Keywords:

substation maintenance; transfer supply; switching operations; deep reinforcement learning

1. Introduction

In power grid operation, to ensure continuity of supply to users, improve operational efficiency, and reduce operational risks, it is essential to first transfer the load of equipment to be repaired to other operational equipment through load transfer operations, and then carry out the repair by disconnecting the equipment for maintenance. Intelligent grid operations [1,2,3,4] can automatically generate load transfer and maintenance operation plans, providing technical support for the smart operation of the grid. Both load transfer and maintenance involve the optimization of switch operation sequences. Load transfer identifies the optimal power supply path at the grid level, while maintenance focuses on optimizing the switch operation sequence within the substation. In practical applications, numerous independent operational tasks are generated, which are then translated into operational steps. This entire process is complex and inefficient, containing many redundant steps, and requires significant manual correction. Currently, no technology exists that can directly generate complete load transfer and maintenance operation plans. Research aimed at achieving end-to-end intelligent reasoning is of great significance for improving the operational and maintenance efficiency of power grids.

Although load transfer and switching operations essentially involve the same sequence of switch actions, there are significant differences in operational objectives and safety constraints, making it difficult to solve them simultaneously. Load transfer optimizes the sequence of switch operations in the grid topology based on safety constraints such as the N-1 power flow constraint and ring-current impact constraints [5,6,7,8]. These switches are combinations of switches in the substation’s main wiring. Substation switching operations, on the other hand, optimize the sequence of switch operations in the substation’s main wiring while ensuring operational safety. To address this issue, references [9,10] propose a hierarchical decomposition method for solving the problem, determining the grid topology based on power flow constraints and then optimizing the operation sequence based on this topology. However, this method requires simplifying the topology during the upper-level optimization according to the task, neglecting the power flow in some branches, which may lead to potential risks of overlimit conditions in certain branches. During the lower-level inference, both operational safety and local power flow constraints must be considered simultaneously, resulting in low solution efficiency in complex scenarios, and it does not incorporate the isolation switch requirements. This paper considers the relationship between the grid topology and the substation’s main wiring and constructs a two-layer grid model with bays as intermediate nodes. The upper layer optimizes the load transfer path, satisfying the N-1 power flow and ring-current impact constraints, and infers feasible bay operation steps; the lower layer focuses on the switches within the bays, ensuring operational safety while inferring switch operation steps.

Traditional optimization methods struggle to balance solution speed and solution quality. Methods such as multistage optimization [11], nonlinear programming [12], and dynamic programming [13] face significantly increased computational complexity when dealing with large-scale systems, often falling into the trap of combinatorial explosion. Heuristic algorithms [14,15] improve computational efficiency by introducing topological simplification rules that compress the solution space, but the quality of the solutions is highly dependent on the empirical and completeness of these rules. Metaheuristic algorithms, such as genetic algorithms [16], tabu search [17], and particle swarm optimization [18], use iterative searches to approach the optimal solution and possess global search capabilities. However, these methods require careful parameter tuning, are prone to premature convergence, and may struggle to reach the global optimum.

Deep reinforcement learning (DRL) combines the feature extraction advantages of deep learning [19] with the dynamic decision-making capabilities of reinforcement learning [20], offering significant advantages when addressing decision-making tasks with high-dimensional state spaces and complex action sequences. It has been widely applied in fields such as energy management [21] and operational control [22] within power systems. Deep reinforcement learning methods, when optimizing multiple objects, consume substantial computational resources and face convergence difficulties, making it challenging to implement optimization strategies tailored to different objects [23]. Reference [24] proposed hierarchical reinforcement learning, which decomposes complex problems into different subproblems, thereby significantly reducing the solution dimensionality. However, due to the adoption of a synchronous training mechanism between the upper and lower layers, the system is strongly coupled and exhibits low optimization efficiency. Reference [25] assigns different objectives to the upper and lower layers; after the upper-layer training is completed, the learned knowledge can guide the lower-layer policy, achieving more efficient learning.

The state and action spaces of load transfer and switching operations differ significantly, making it difficult to solve them directly through hierarchical knowledge transfer. This paper considers constructing a unified knowledge reasoning environment that integrates the power grid topology with the substation primary configuration to enable hierarchical reasoning. Traditional adjacency matrices have low storage efficiency when handling large-scale sparse topologies and typically require multiple matrices to separately represent topological structures and attribute information, increasing storage and computational overhead [26]. Although graph neural network frameworks such as PyTorch-Geometric 2.0.3 can model complex power grid relationships, frequent reconstruction of nodes and edges is required during dynamic topology changes, and multi-attribute information needs additional encoding, resulting in low state-update efficiency [27]. The Neo4j graph database [28,29] possesses a natural structural similarity to power grids, supports multi-attribute storage for individual nodes, and offers excellent dynamic update and query capabilities. By dynamically representing the topology and power-flow characteristics of the grid and substations through entities, relationships, and attributes, a dynamically updatable load-transfer-switching knowledge base environment can be established, which automatically realizes the conversion between action space and state space for load transfer and switching, thereby supporting hierarchical reasoning.

Multi-task learning improves learning efficiency by sharing knowledge across various related tasks [30]. Since the composition and operational rules of switches are the same within a switch group, a multi-task learning mechanism is introduced in the lower-layer model to share policies among tasks, enabling the transfer and reuse of operational knowledge.

Based on the research approach described above, this paper proposes an end-to-end hierarchical intelligent reasoning model for coordinated switch operations in power grids. Through the collaborative optimization of the upper-layer and lower-layer models, the model infers the switch operation steps for load transfer and switching operations. The main innovations include

(1): This paper constructs a two-layer power grid graph model with bays as intermediate nodes: The upper layer uses bays as nodes to represent the power grid topology and power flow distribution, while the lower layer uses switchgear as nodes to construct the substation busbar layout, achieving the transformation from grid-level topology to substation-level busbar configuration, thus creating a unified knowledge reasoning environment.
(2): A new hierarchical reinforcement learning framework is designed, which automatically transforms the state and action spaces between the upper and lower layers, decoupling the complex constraints of load transfer and switching operations. The upper layer builds the power grid load transfer optimization model, while the lower layer generates the substation switching operation sequences, effectively reducing the problem-solving dimension.
(3): A multi-task learning mechanism is introduced, treating the switching operations of different bays as parallel sub-tasks. This enables secondary decomposition of the solution space and facilitates knowledge transfer and reuse across tasks, thereby improving the efficiency of model training.

2. Structure of the Hierarchical Optimization Model

Figure 1 illustrates the end-to-end “Load Transfer—Switch Operation” hierarchical intelligent reasoning model architecture proposed in this paper. The architecture consists of the Neo4j graph model, power flow calculation model, D3QN agent, and MT-D3QN agent. The Neo4j model represents the power grid topology and main busbar, embedding the power flow calculation module within it. The upper-layer D3QN agent takes the topology and power flow states output by the graph model as input to construct the state-action space for load transfer. Using a value function separation architecture, it evaluates and selects the optimal load transfer path, while also outputting the optimized load transfer bay and maintenance bay. The lower-layer MT-D3QN agent receives the optimization results from the upper-layer model, constructing a multi-task parallel operation state-action space within the bay to optimize the switch operation sequence for bay isolation. All tasks share the underlying network weights and optimize the switch operations within the bay through policy iteration. Finally, the generated switch operation sequence must undergo limit checks for node voltage deviation, line load, transformer load, and loop impact, as well as the five protection safety validations for switching operations. Once all safety requirements are confirmed to be met, the final operation sequence is output.

3. Neo4j Graph Model Based on Bay Topology Integration

Build the Neo4j graph model to map electrical topology and power flow data into model parameters. When the switch state changes, the parameters in the Neo4j graph model can reflect the changes in power flow status.

The Neo4j graph model uses entities, relationships, and attributes to represent devices, topology, and power flow parameters, respectively.

Entities

E = \{E 1, E 2\}

represent bay entities and the equipment entities contained within the bays. Bays are divided into switch bays and component bays. A switch bay consists of circuit breakers, isolators, and grounding switches, while a component bay includes transformers, busbars, and transmission lines. When searching with bay nodes as the object, the number of devices to be searched is significantly reduced, improving search efficiency.

Relationships

R = \{R_{1}, R_{2}, R_{3}\}

indicate the connections between entities.

R_{1}

represents the affiliation between bay entities and their internal equipment entities.

R_{2}

and

R_{3}

are used to describe the connections and disconnections between different equipment entities, and are determined by the switching status (open or closed) of the switches connecting the equipment entities.

Attributes

S = \{S_{1}, S_{2}, S_{3}, S_{4}, S_{5}\}

describe the real-time status information of entities, including bay status, switch status, busbar voltage deviation, transformer load rate, line load rate, and closing current.

The changes in switch states lead to variations in the relationships between entities. The power flow is recalculated in real-time through the power flow calculation module, and the parameters in the Neo4j graph model are updated accordingly. The mapping relationship between the power flow calculation model and the graph model is shown in Figure 2.

The state of the bay is defined as either normal or abnormal. The normal state includes the operating state

σ_{0}

, hot standby state

σ_{1}

, cold standby state

σ_{2}

, and maintenance state

σ_{3}

. The abnormal state consists of the transition state

σ_{4}

, which occurs between the transitions of neighboring normal states, and the degraded state

σ_{5}

, which is when the grounding switch has not opened after closing. The switch bay state in the Neo4j graph model is illustrated in Figure 3.

4. “Load Transfer—Switching Operation” Hierarchical Optimization Model

First, identify the load area for maintenance outage in the Neo4j graph model, and simultaneously search for the transfer supply space and maintenance space. The transfer supply space is based on the load area as the starting point, including all transfer supply channels capable of supplying power to the outage area. The maintenance space starts from the maintenance equipment, including all switches that can isolate the maintenance equipment for an outage. The transfer supply space excludes any switch bays that overlap with the maintenance space.

The action space of the load-transfer model consists of operable bays within the load-transfer space, and the agent’s reward function is constructed based on N-1 power-flow constraints and loop-closing inrush current constraints. The agent selects actions using an ε-greedy strategy. After each action is selected, the agent sends the bay operation command to the graph model, which immediately updates the network topology and calls the power-flow calculation module to evaluate the current state. The agent progressively optimizes its policy based on real-time power-flow feedback to infer the optimal load-transfer bay sequence.

The switching model takes the load-transfer bay sequence and maintenance bays as input. Using the graph model, the switches within each bay of the primary wiring are identified, converting bays into switching tasks. The action space of the switching model consists of operable switches within each bay, and the reward function is designed according to the “Five-Prevention” rules. The MT-D3QN architecture is employed to optimize multiple switching tasks with identical target states in parallel. Within the switching action space, the agent selects actions via an ε-greedy strategy, dynamically updating the primary wiring topology and bay states in the graph model, and adjusting its policy based on state feedback to generate the optimal switching operation sequence. Finally, the generated operation sequence undergoes power-flow and switching safety verification, and the final switching operation sequence is output only after all constraints are satisfied. The process is shown in Figure 4.

4.1. Transfer Supply Space and Maintenance Space

The search process is as follows:

(1): In the graph model, retrieve the detailed information of the maintenance equipment and input this information into the power flow calculation model. By simulating the scenario of isolating the maintenance equipment, perform a power flow analysis of the grid, calculating system parameters such as current, voltage, and power. Identify the outage load areas caused by the equipment shutdown, and feedback the outage load area information into the graph model.
(2): In the graph model, perform a path search from the outage load area to other areas, recording the switch bays along the path as the transfer supply space.
(3): In the graph model, use the maintenance equipment as the starting point to search for the power supply path connected to non-switching equipment, and record the switch bays along the path as the maintenance space. In this process, set $A$ as the collection of the bays in both the transfer-supply and maintenance spaces:

$A = \{a_{1}, a_{2}, \dots, a_{n}\}$

(1)

In the above formula, $a_{n}$ represents the number of switch bays in the path.

4.2. Transfer-Supply Reasoning Model Based on D3QN

4.2.1. D3QN Algorithm

The D3QN algorithm integrates Dueling DQN [31] and Double DQN [32]. Dueling DQN decomposes the Q-value estimation into two components: the state value function

V (s, ω, α)

and the advantage function

A (s, a, ω, β)

, enabling the agent to more effectively assess the importance of each action in a given state. The Q-value is computed using the following equation:

Q (s, a) = V (s, ω, α) + A (s, a, ω, β)

(2)

Double DQN uses the online network to select actions and employs the target network to estimate the target Q-value associated with the chosen action, thereby reducing the overestimation bias. The Q-value is calculated using the following equation:

Q_{o n l i n e} (s_{t}, a_{t}) \leftarrow Q_{o n l i n e} (s_{t}, a_{t}) + α [r + γ \max Q_{t \arg e t} (s_{t + 1}, \arg \max Q_{o n l i n e} (s_{t + 1}, a_{t + 1})) - Q_{o n l i n e} (s_{t}, a_{t})]

(3)

where

Q_{o l i n e} (s_{t}, a_{t})

represents the Q-value of the current state

s_{t}

and action

a_{t}

in the online network.

r

represents the immediate reward at the current state.

γ

represents the discount factor.

α

represents the learning rate.

\max Q_{t} (s_{t + 1}, a_{t + 1})

represents the maximum target Q-value of action

a_{t + 1}

that can be executed in the next state

s_{t + 1}

, computed by the target network.

\arg \max Q_{o n l i n e} (s_{t + 1}, a_{t + 1})

represents the action with the largest Q-value selected by the online network in the next state

s_{t + 1}

.

In the initial stage, D3QN uses the ε-greedy strategy to explore the optimal actions, and in the later stage, it relies on the model’s predicted optimal actions. Each time the agent interacts with the environment, it generates transition samples and stores them in the experience replay pool. During training, the model randomly samples data from the experience pool and updates the network parameters by computing the loss function to minimize the difference between the predicted and actual values. The learning framework of D3QN is shown in Figure 5.

The loss function

L_{D 3 Q N}

is given by the following formula:

L_{D 3 Q N} = E [{(r_{t} + γ \max Q_{t \arg e t} (s_{t + 1}, \arg \max Q_{o n l i n e} (s_{t + 1}, a_{t + 1})) - Q_{o n l i n e} (s_{t}, a_{t}))}^{2}]

(4)

4.2.2. Dynamic Adjustment Mechanism of Reward Weights

In Reinforcement Learning (RL), the design of rewards and the dynamic adjustment of reward weights are key factors that influence learning effectiveness and task execution efficiency. In complex task environments, automatically adjusting the weights of reward terms can help the system optimize its learning strategy based on actual conditions, improving learning efficiency and the stability of task execution.

The Q-value is composed of the weighted sum of multiple reward terms. Therefore, the impact of each reward term on the Q-value can be expressed through the gradient of the Q-value with respect to the reward term weight. Using the chain rule, the gradient of the loss function with respect to the reward weight can be calculated, as follows:

\frac{\partial L}{\partial w_{i}} = \frac{\partial L}{\partial Q} \cdot \frac{\partial Q}{\partial w_{i}}

(5)

Based on the calculated gradient, the reward weights are automatically updated using gradient descent:

w_{i} (t + 1) = w_{i} (t) + α \frac{\partial L}{\partial w_{i}}

(6)

4.2.3. Transfer-Supply State Space

The model uses the following as state space information for transfer supply optimization: load area outage state

Z_{o u t a g e}

, network topology

G_{t o p o l o g y}

, the closing impulse current

I_{M}

, closing steady-state current

I_{m}

, line load rate

L_{l o a d}

, node voltage deviation rate

V_{n o d e}

, transformer load rate

T_{l o a d}

, and the working state of the switch bays

S_{d}

.

4.2.4. Reward and Penalty Function

This section will comprehensively consider the grid security constraints and operational efficiency, and design the reward and penalty function based on the state space information.

(1): Setting node voltage deviation penalty $P_{v o l t}$ :

$P_{v o l t} = \{\begin{cases} - k_{z} \times |\frac{|V_{n o d e}| - |V_{\max}|}{|V_{\max}|}|, |V_{n o d e}| > |V_{\max}| \\ 0, e l s e \end{cases}$

(7)

where $k_{z}$ is the deviation coefficient in the load transfer optimization model, used to adjust the sensitivity of the reward and penalty to deviations from the target limit. The parameter $k_{z}$ plays a key role in balancing rewards and penalties, thereby guiding the agent’s learning process. The maximum allowable node voltage deviation $|V_{\max}|$ is 5, and $V_{n o d e}$ represents the actual voltage deviation at node. When a node’s voltage deviation exceeds the limit, the agent receives a negative penalty, which increases with the magnitude of the deviation to prevent excessively high voltage deviations.

(2): Setting line overload penalty $P_{l i n e}$ :

$P_{l i n e} = \{\begin{cases} - k_{z} \times \frac{|L_{l o a d} - L_{\max}|}{L_{\max}}, L_{l o a d} > L_{\max} \\ 0, e l s e \end{cases}$

(8)

where $L_{\max}$ is the maximum line load rate of 100%, and $L_{l o a d}$ represents the actual load rate of the line. When the line load rate exceeds the limit, a negative penalty is applied, which increases with the degree of overload to mitigate the risk of line overloading.

(3): Setting closing impulse current penalty $P_{c l o s e}$ :

$P_{c l o s e} = \{\begin{cases} - k_{z} \times |\frac{|I_{M}| - |I_{a c t . Ⅰ}|}{|I_{a c t . Ⅰ}|}|, I_{M} > I_{a c t . Ⅰ} \\ - k_{z} \times |\frac{|I_{m}| - |I_{a c t . Ⅲ}|}{|I_{a c t . Ⅲ}|}|, I_{m} > I_{a c t . Ⅲ} \\ 0, e l s e \end{cases}$

(9)

where $I_{M}$ represents the actual closing impulse current of the system, $I_{m}$ represents the actual closing steady-state current, $I_{a c t . Ⅰ}$ denotes the setting value of the instantaneous current-breaking protection, and $I_{a c t . Ⅲ}$ denotes the setting value of the overcurrent protection. When either the impulse current or the steady-state current exceeds its corresponding protection setting, a negative penalty is applied to prevent excessive current surges during the closing operation.

(4): Setting transformer overload penalty $P_{t r a n s f o r m e r}$ :

$P_{t r a n s f o r m e r} = \{\begin{cases} - k_{z} \times \frac{|T_{l o a d} - T_{\max}|}{T_{\max}}, T_{l o a d} > T_{\max} \\ 0, e l s e \end{cases}$

(10)

where $T_{l o a d}$ represents the actual loading rate of the transformer, and $T_{\max}$ is the maximum loading rate, set to 100%. When the transformer loading rate exceeds the limit, a negative penalty is applied, increasing with the degree of overload to prevent transformer overloading.

(5): Setting repeated action penalty $P_{b a y}$ :

$P_{b a y} = \{\begin{cases} - k_{z} \times \frac{N_{R}}{N_{b a y s}}, R e p e a t e d a c t i o n \\ 0, e l s e \end{cases}$

(11)

where $N_{R}$ represents the number of repeated actions for a bay, and $N_{b a y s}$ denotes the total number of bays in the action space. When a bay switches from a non-operating state to an operating state but is switched back to a non-operating state due to repeated actions, an invalid action is generated. A larger number of repetitions results in a greater penalty, preventing ineffective operations during the reasoning process.

(6): The primary objective of load-transfer optimization is to ensure that no load area is deenergized and that all safety constraints are satisfied. The secondary objective is to minimize the number of bay operations to improve system operating efficiency. The operation efficiency reward $R_{1}$ is defined as follows:

$R_{z} = - k_{z} \times \frac{M_{b a y s}}{N_{b a y s}} + C_{z}$

(12)

where $M_{b a y s}$ represents the number of operated bays, and $C_{z}$ is the initial reward for load transfer. The initial reward is set to 10 to ensure that the agent prioritizes the high reward associated with reaching the target state, thereby avoiding inefficient policies driven by small immediate rewards.

4.3. Switching Operation Reasoning Model Based on MT-D3QN

4.3.1. MT-D3QN Algorithm

Switching operations involve multiple tasks, but single-task reinforcement learning suffers from low training efficiency and limited knowledge reuse. This paper proposes the MT-D3QN algorithm to achieve collaborative reasoning for multiple switching tasks. MT-D3QN builds upon D3QN by embedding task feature encoding, which transforms task information into low-dimensional embedding vectors. These vectors are concatenated with state features to form the input, helping the model distinguish between different tasks. Given the maximum number of bays

B

and the embedding dimension

d_{e}

, an embedding matrix

E \in ℝ^{B \times d_{e}}

is constructed to represent the mapping between task IDs and embedding vectors. The specific form is as follows:

E = [e_{0} e_{1} \dots e_{B - 1}]

(13)

Here,

e_{b} = E [b, :] \in ℝ^{d e}

is the embedding vector corresponding to task ID

b

, and

b \in \{0, 1, \dots, B - 1\}

is the task identifier, representing the index of the embedding vector corresponding to each task.

Due to significant differences in learning difficulty, safety priority, and convergence speed across different switching tasks, if a fixed weight distribution is used to allocate training resources, it may result in high-reward tasks excessively dominating the training or low-reward but critical tasks being marginalized. An adaptive weight mechanism based on cumulative task rewards is introduced to dynamically allocate training resources according to the real-time performance of each task, ensuring that all tasks can converge efficiently and that critical operations are not overlooked. The task weight is calculated as follows:

w_{b} = \frac{e^{α \cdot R_{b}}}{\sum_{j = 1}^{B} e^{α \cdot R_{j}}}

(14)

Here,

w_{b}

represents the weight of task ID

b

;

R_{b}

represents the cumulative reward of the task, which is used to measure the task’s performance. The higher the cumulative reward, the greater the task’s importance;

α

represents the temperature parameter, which controls the sensitivity of the reward when calculating the weight.

An overall multi-task loss function

L_{M T D 3 Q N}

is introduced, which includes the losses of all tasks and a regularization term, as follows:

L_{M T D 3 Q N} = \sum_{b = 1}^{B} w_{b} L_{D 3 Q N}^{(b)} + λ_{r e g} {‖θ‖}_{2}^{2}

(15)

Here,

L_{D 3 Q N}^{(b)}

represents the D3QN loss function for task ID

b

;

B

represents the total number of tasks;

λ_{r e g}

is the regularization coefficient, which controls the complexity of the model;

{‖θ‖}_{2}^{2}

represents the L2 norm of the model parameters, which reduces excessively large parameter values to prevent overfitting.

The model is shown in Figure 6. It first receives the transfer-supply bay and maintenance bay lists output by the upper-level D3QN and splits them into two parallel sub-tasks: transfer supply and maintenance, assigning a unique ID to each task. Then, each task undergoes feature encoding, integrating information such as the number of switch bays, initial state, and target state, to generate low-dimensional embedding vectors, enabling the model to precisely distinguish the operational logic of different tasks. Next, a shared-parameter strategy is used to construct the network structure, where the parameters of the fully connected (FC) layers are shared to extract common logic for the “Five Preventions” rule and general knowledge, such as switch operation timing constraints, facilitating cross-task knowledge transfer. At the same time, independent output layer parameters are optimized to account for differences in the action spaces of each task, ensuring that actions align accurately with the task goals. Finally, a multi-task total loss function is constructed, integrating the D3QN losses of each task and the L2 regularization term. Weighted resource allocation balances training resources among tasks and prevents overfitting. The synergistic effect of task encoding and shared-specific parameters enables the model to accurately predict Q-values and select optimal actions, effectively accelerating convergence and improving training efficiency, thereby meeting the safety and efficiency requirements of multi-bay parallel switching operations.

4.3.2. Switching Operation Space

Based on the transfer bay set and maintenance bay set inferred by the upper-layer D3QN, we traverse the bays in both sets. Using the graph model, the switching devices contained in each bay can be identified, and all switches within the same bay are treated as a single action space. The objective of the switching operation optimization is to infer an action sequence within each action space that switches the transfer bays to the operating state and the maintenance bays to the maintenance state. Different action spaces correspond to different target states. Action spaces with identical target states can be regarded as similar tasks, from which the transfer switching space and maintenance switching space can be constructed. Both the transfer-supply switching operation space and maintenance switching operation space are uniformly represented as

T_{x}

:

T_{x} = \sum_{i = 1}^{n} \{t_{i, 1}, t_{i, 2}, \dots, t_{i, m_{i}}\}

(16)

Here,

x = 1

represents the transfer-supply switching operation space,

x = 2

represents the maintenance switching operation space,

t_{i, 1}

represents the switch device number within the

i

-th bay,

m_{i}

represents the number of devices in the

i

-th bay, and

n

represents the total number of task bays.

4.3.3. State Space

The state space of the switching operation reasoning model is defined as follows:

S = \{S_{d}, x_{1}, x_{2}, \dots, x_{m}\}

(17)

Here,

S_{d}

represents the operating state of the task bay, which includes six states: operating state

σ_{0}

, hot standby state

σ_{1}

, cold standby state

σ_{2}

, maintenance state

σ_{3}

, transition state

σ_{4}

, and degraded state

σ_{5}

;

x_{m}

denotes the state of the switch device, which includes the type of the switch and its open or close status;

m

represents the switch number of the current operation.

4.3.4. Reward and Penalty Function

The reward and penalty function is set to guide the model toward the correct state transition direction and strategy reasoning, preventing bays from incorrectly transitioning into a degraded state.

(1): Setting penalty $P_{s w i t c h}$ for repeated switch operation:

$P_{s w i t c h} = \{\begin{cases} - k_{d} \times \frac{N_{r}}{N_{d}}, R e p e a t e d a c t i o n \\ 0, e l s e \end{cases}$

(18)

where $k_{d}$ is the deviation coefficient of the switch operation optimization model, which determines the sensitivity of the rewards and penalties to deviations from the target limits. $N_{r}$ represents the number of repeated actions, and $N_{d}$ denotes the total number of controllable switches. The greater the number of repetitions, the larger the penalty, thereby preventing invalid switch operations during the reasoning process.

(2): The reward function is designed based on the proximity between the target state and the current device state. The state priorities are set in the following order: operating state $σ_{0}$ , hot standby state $σ_{1}$ , cold standby state $σ_{2}$ , maintenance state $σ_{3}$ , and degraded state $σ_{5}$ ; The state reward function $R_{2}$ is defined as follows:

$R_{2} = \{\begin{cases} - k_{d} \times |\frac{D (S_{d}) - D (S_{a i m})}{D (S_{a i m})}| + C_{d}, S_{d} = σ_{0}, σ_{1}, σ_{2}, σ_{3}, σ_{5} \\ 0, S_{d} = σ_{4} \end{cases}$

(19)

where $D (S_{a i m})$ represents the priority of the target state, and $D (S_{d})$ represents the priority of the current switch state. $C_{d}$ is the initial reward for the switch operation. Setting $C_{d}$ to 10 allows the agent to prioritize the high reward of the target state during decision-making, preventing it from being trapped in non-target state strategies due to small immediate rewards.

5. Case Study Analysis

To verify the effectiveness of the method, two case studies were constructed for validation.

5.1. Case Study 1: Maintenance Verification of the WB33 Busbar at the 35 kV Substation

This case study involves two 35 kV substations. Maintenance is performed on the WB33 busbar at Substation 1, and the transfer supply and switching operation schemes are inferred, as shown in Figure 7. The core electrical parameters are listed in Table 1. The main equipment of the substation is in the operating state, and the ring network switch bays between the load areas are in cold standby mode, with the grid operating in a radial configuration.

5.1.1. Experimental Environment and Algorithm Parameter Configuration

In this study, Python 3.9 was used as the primary development language. The Py2neo 2021.2.4 library was employed to establish connections with the Neo4j 5.3.0 graph database. The TensorFlow 2.0.0 open-source framework was utilized to construct a two-layer deep reinforcement learning architecture, consisting of the upper-layer transfer optimization network (D3QN) and the lower-layer switching operation optimization network (MT-D3QN). The hyperparameter settings for deep reinforcement learning are shown in Table 2.

5.1.2. Experimental Results and Performance Analysis

Figure 8 illustrates the cumulative reward curves for the three stages: transfer path optimization, transfer switching operation reasoning, and maintenance switching operation reasoning. The results demonstrate that by adopting the hierarchical framework, the dimensionality of the action space in each stage is significantly reduced, enabling the model to converge rapidly to the maximum reward within 200 iterations. This indicates that the proposed method possesses a highly efficient policy learning capability and can quickly adapt to the operational constraints of maintenance scenarios.

Figure 9 presents the loss function variation curves for each optimization stage. As the number of training iterations increases, the error between the target Q-value and the estimated Q-value decreases monotonically and eventually stabilizes, confirming the training stability and reliability of the proposed model.

Table 3 presents the optimization results for each stage of the WB33 busbar maintenance. Figure 7 presents a comparison of the WB33 busbar in its preoperative and postoperative maintenance states. The entire operation sequence contains no redundant actions and requires no manual intervention. After verification against on-site operation regulations, the sequence strictly complies with the “Five-Prevention” (anti-misoperation) rules, and the operational logic fully conforms to the power grid maintenance standards.

To analyze the impact of cross-bay transfer learning on training effectiveness, we selected the transfer-switching operation space and maintenance-switching operation space in the WB33 scenario for optimization. A comparative analysis was conducted between the training processes with and without the introduction of a multi-task learning mechanism. The analysis results are shown in Figure 10. Under the same operation space, the model with the multi-task learning mechanism accumulates rewards faster, and its training convergence speed is significantly better than that of the model without the multi-task learning mechanism.

5.1.3. Sensitivity Analysis of the Reward Function

To explore the impact of coefficient adjustments in the reward function on strategy behavior, we conducted a systematic sensitivity analysis of load transfer and switch operation optimization. The analysis focused on two trade-offs: the relationship between load transfer success rate (non-outage areas) and flow limit violation rate during load transfer training, and the relationship between switch operation success rate (bay switched to the target state) and switch operation violation rate (bay switched to the degraded state) during switch operation training. The experiment selected three equipment maintenance scenarios: WB33, T32, and QF1036, and systematically adjusted the load transfer deviation coefficient

k_{z}

and switch operation deviation coefficient

k_{d}

, while keeping the initial reward constant, to observe how the agent balances rewards and penalties under different coefficients.

The analysis results are shown in Figure 11. As the deviation coefficients increase, the agent’s behavior exhibits a clear and expected pattern. In load transfer optimization, increasing the deviation coefficient reduces the flow limit violation rate, but at the cost of a decreased load transfer success rate. In switch operation optimization, increasing the deviation coefficient reduces the switch operation violation rate, but the switch operation success rate also decreases accordingly. This indicates that the reward function coefficients directly influence the strategy’s operational tendencies. It is worth noting that under different maintenance scenarios and coefficient settings, the strategy performs consistently, with the overall violation rate and success rate fluctuating within a limited range, demonstrating the robustness of the strategy. In the study, deviation coefficient (

k_{z}

= 3) and (

k_{d}

= 5) were selected, as they allow for maintaining a relatively high load transfer and switch operation success rate while keeping the flow limit violation rate and switch operation violation rate at low levels.

5.2. Case Study 2: Random Task Maintenance Test

To further verify the generalization ability and convergence performance of the proposed hierarchical reinforcement learning model under different maintenance tasks, a 220 kV power grid was selected as the test object, as shown in Figure 12. The relevant electrical parameters are listed in Table 4. Multiple types of tasks, including transformer, circuit breaker, and line maintenance, were randomly generated for testing. In the figure, all main equipment is in the operating state, the ring network switch bays between the feeders are in cold standby mode, and the grid operates in a radial configuration.

5.2.1. Performance Analysis

As shown in Figure 13, under random tasks, the average reward curves of the hierarchical reinforcement learning model across multiple training runs exhibit an overall upward trend. The shaded area represents the range of mean ± 1 standard deviation. It can be observed that as the number of training steps increases, the model maintains a stable convergence trend across different task switches. The gradual convergence of the standard deviation indicates that the model performs consistently under different random initializations, demonstrating good training stability. As training progresses, the average rewards for each task gradually increase and stabilize, indicating that the model can continuously learn effective strategies under task disturbances. These results show that the proposed method possesses strong robustness and policy adaptability across different main busbar configurations and multiple types of maintenance tasks.

5.2.2. Metrics and Comparison

The proposed method is compared with MILP, DQN, and D3QN through random tests involving X transformers, X lines, and X circuit breakers. The evaluation metrics include transfer success rate, maintenance success rate, power flow violation rate, switching misoperation rate, and inference time. The definitions of these metrics are as follows:

Transfer Success Rate: The proportion of load outages caused by equipment shutdown during maintenance that can be successfully restored via transfer paths.

Maintenance Success Rate: The proportion of maintenance tasks for which the model correctly identifies the electrical association of the maintenance equipment and generates an operation sequence that safely achieves isolation.

Power Flow Violation Rate: The proportion of loop current, node voltage deviation, transformer load, and line load exceeding their rated limits in the power flow calculation after executing the inferred operations.

Switching Misoperation Rate: The proportion of switching steps in the derived operation sequence that violate the “Five-Prevention” interlocking rules or cause unintended power interruptions.

Inference Time: The time required for the model to complete the reasoning for a single task.

According to the baseline comparison results in Table 5, MILP achieves a success rate of 71.17% for both load transfer and maintenance tasks. However, due to the computational complexity of solving large-scale Mixed Integer Linear Programming problems, its average inference time reaches 1680.46 s, and in many complex scenarios, a solution cannot be obtained. Although DQN has higher inference efficiency (831.07 s), its task success rate is lower: 46.67% for load transfer, 66.67% for maintenance, and its flow limit violation rate and misoperation rate are 53.33% and 33.33%, respectively. D3QN shows an improvement in performance, with success rates for load transfer and maintenance reaching 73.33% and 86.67%, respectively, while its flow limit violation rate and misoperation rate decrease to 26.67% and 13.33%, with an average inference time reduced to 660.87 s. The hierarchical multi-task framework, D3QN + MT-D3QN, performs the best in terms of reliability and inference efficiency: the load transfer success rate increases to 86.67%, the maintenance success rate reaches 100%, the misoperation rate drops to 0, the flow limit violation rate is reduced to 13.33%, and the average inference time is further shortened to 309.47 s, fully demonstrating the advantages of the hierarchical multi-task strategy in complex switch operation tasks.

To further evaluate the optimality of the solutions provided by different algorithms, this study selected three maintenance scenarios corresponding to devices T33, L1, and QF123 for comparison experiments. Table 6 lists the sequence inference results for different algorithms in each scenario, while Table 7 provides the evaluation data for each algorithm, including metrics such as the average absolute voltage deviation, average line load rate, maximum line load rate, average transformer load rate, maximum transformer load rate, total number of switch operations, interval degraded state rate, and loop impact limit violation rate, which are used for the quantitative evaluation of algorithm performance. All metrics were standardized (with smaller values indicating better performance), and the standardized results are visualized in Figure 14.

From the figure, it can be observed that in the three maintenance scenarios, D3QN + MT-D3QN achieves the smallest values in most metrics, demonstrating the most conservative and stable performance. D3QN exceeds D3QN + MT-D3QN in transformer load and maximum line load rate but still outperforms MILP and DQN overall. MILP is close to D3QN in transformer and line load metrics but slightly higher in voltage deviation and switch operation count. DQN generally has higher values across all metrics. A comprehensive analysis indicates that, in complex maintenance scenarios, the proposed algorithm outperforms other algorithms, yielding better inference results.

6. Conclusions

This paper proposes an end-to-end hierarchical intelligent reasoning model for coordinated switching operations in power grids. By treating bays as intermediate nodes, the model simplifies the power system topology and constructs a graph model based on Neo4j that is applicable to different main busbar configurations, while embedding a power flow calculation module. The graph model integrates the D3QN and MT-D3QN algorithms to form a hierarchical reinforcement learning framework. In the upper layer, bays serve as the lowest-level nodes; the D3QN model leverages power flow calculations and network topology changes within the graph to ensure that supply paths comply with the N-1 security criterion and to optimize transfer paths. In the lower layer, switching devices are treated as the lowest-level nodes; the MT-D3QN algorithm optimizes the switching operation sequence under the constraints of the Five-Prevention rules. Through task interaction between the upper and lower layers, the model achieves collaborative optimization of transfer and switching operations.

Author Contributions

Methodology, T.C. and M.Z.; Software, J.Y.; Validation, Y.J. and J.R.; Writing—original draft, M.Z.; writing—review and editing, T.C. and J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (51907104), the Opening Fund of Hubei Province Key Laboratory of Operation and Control of Cascade Hydropower Station (2019KJX08).

Data Availability Statement

The original contributions presented in the study are included in the article and further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Neo4J	Graph database
DQN	Deep Q-Network
D3QN	Dueling Double Deep Q-Network
MT-D3QN	Multi-Task Dueling Double Deep Q-Network
MILP	Mixed-Integer Linear Programming

References

Martinez, M.T.V.; Comech, M.P.; Hurtado, A.A.P.; Olivan, M.A.; Cortón, D.L.; Castillo, C.R.D. Software-Defined Analog Processing Based on IEC 61850 Implemented in an Edge Hardware Platform to be Used in Digital Substations. IEEE Access 2024, 12, 11549–11560. [Google Scholar] [CrossRef]
Guibout, C.; Wataré, A.; Carli, F.; Carbonne, A.; Mourier, K.; Rudolph, T. Centralized Protection and Control for Transmission System Operations: Practical Applications and Perspectives. IEEE Power Energy Mag. 2024, 22, 67–78. [Google Scholar] [CrossRef]
Chen, Y.; Li, H.; Li, X.; Zhang, K.; Hu, J.; Liu, D. Research of “one key sequence control” test method based on panoramic digital simulation technology. AIP Adv. 2022, 12, 125314. [Google Scholar] [CrossRef]
Wang, C.; Fu, Z.; Zhang, Z.; Wang, W.; Chen, H.; Xu, D. Fault Diagnosis of Power Transformer in One-Key Sequential Control System of Intelligent Substation Based on a Transformer Neural Network Model. Processes 2024, 12, 824. [Google Scholar] [CrossRef]
Wang, M.; Yang, M.; Fang, Z.; Wang, M.; Wu, Q. A Practical Feeder Planning Model for Urban Distribution System. IEEE Trans. Power Syst. 2023, 38, 1297–1308. [Google Scholar] [CrossRef]
Chen, B.; Liu, J.; Wu, H.; Wang, H.; Chen, Y. Flexible-resource coordination supply recovery of active distribution network considering multiple demand responses. Front. Energy Res. 2024, 12, 1496247. [Google Scholar] [CrossRef]
Ghasemi, S.; Darwesh, A.; Moshtagh, J. Critical loads restoration of distribution networks after blackout by microgrids to improve network resiliency. Electr. Eng. 2023, 105, 2909–2922. [Google Scholar] [CrossRef]
Wen, J.; Qu, X.; Jiang, L.; Lin, S. A Hierarchical Restoration Mechanism for Distribution Networks Considering Multiple Faults. Math. Probl. Eng. 2022, 2022, 8787262. [Google Scholar] [CrossRef]
Poudel, S.; Dubey, A. A Two-Stage Service Restoration Method for Electric Power Distribution Systems. IET Gener. Transm. Distrib. 2021, 4, 500–521. [Google Scholar] [CrossRef]
Arif, A.; Cui, B.; Wang, Z. Switching Device-Cognizant Sequential Distribution System Restoration. IEEE Trans. Power Syst. 2022, 37, 317–329. [Google Scholar] [CrossRef]
Ma, X.; Peng, B.; Ma, X.; Tian, C.; Yan, Y. Multi-timescale optimization scheduling of regional integrated energy system based on source-load joint forecasting. Energy 2023, 283, 129186. [Google Scholar] [CrossRef]
Xing, H.; Hong, S.; Sun, X. Active distribution network expansion planning considering distributed generation integration and network reconfiguration. J. Electr. Eng. Technol. 2018, 13, 540–549. [Google Scholar] [CrossRef]
Lü, X.; He, S.; Xu, Y.; Zhai, X.; Qian, S.; Wu, T.; Wang, Y. Overview of improved dynamic programming algorithm for optimizing energy distribution of hybrid electric vehicles. Electr. Power Syst. Res. 2024, 232, 110372. [Google Scholar] [CrossRef]
Pereira, E.C.; Barbosa, C.H.N.R.; Vasconcelos, J.A. Distribution Network Reconfiguration Using Iterative Branch Exchange and Clustering Technique. Energies 2023, 16, 2395. [Google Scholar] [CrossRef]
Ayanlade, S.O.; Ariyo, F.K.; Jimoh, A.; Akindeji, K.T.; Adetunji, A.O.; Ogunwole, E.I.; Owolabi, D.E. Optimal Allocation of Photovoltaic Distributed Generations in Radial Distribution Networks. Sustainability 2023, 15, 13933. [Google Scholar] [CrossRef]
Shukla, V.; Mukherjee, V.; Singh, B. Genetic algorithm based for coordinated control of distributed generations with different load models. Int. J. Syst. Assur. Eng. Manag. 2025, 16, 89–112. [Google Scholar] [CrossRef]
Bosisio, A.; Berizzi, A.; Lupis, D.; Morotti, G.; Iannarelli, I.; Greco, B. A Tabu-search-based Algorithm for Distribution Network Restoration to Improve Reliability and Resiliency. J. Mod. Power Syst. Clean Energy 2023, 11, 302–311. [Google Scholar] [CrossRef]
Ren, C.; Zhou, J.; Xu, X.; Mao, Y.; Ma, Y.; Wang, B. Load Balance and Recovery Optimization of Distribution Network Based on Binary Particle Swarm Optimization Algorithm. In Proceedings of the 2022 5th International Conference on Renewable Energy and Power Engineering (REPE), Beijing, China, 28–30 September 2022; pp. 103–107. [Google Scholar] [CrossRef]
Mohammed, A.; Kora, R. A comprehensive review on ensemble deep learning: Opportunities and challenges. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 757–774. [Google Scholar] [CrossRef]
Shakya, A.K.; Pillai, G.; Chakrabarty, S. Reinforcement learning algorithms: A brief survey. Expert Syst. Appl. 2023, 231, 120495. [Google Scholar] [CrossRef]
Nakabi, T.A.; Toivanen, P. Deep reinforcement learning for energy management in a microgrid with flexible demand. Sustain. Energy Grids Netw. 2021, 25, 100413. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, J.; Xu, P.D.; Gao, T.; Gao, D.W. Explainable AI in Deep Reinforcement Learning Models for Power System Emergency Control. IEEE Trans. Comput. Soc. Syst. 2022, 9, 419–427. [Google Scholar] [CrossRef]
Lee, D.; He, N.; Kamalaruban, P.; Cevher, V. Optimization for Reinforcement Learning: From a single agent to cooperative agents. IEEE Signal Process. Mag. 2020, 37, 123–135. [Google Scholar] [CrossRef]
Liu, C.; Zhu, F.; Liu, Q.; Fu, Y. Hierarchical Reinforcement Learning With Automatic Sub-Goal Identification. IEEE CAA J. Autom. Sin. 2021, 8, 1686–1696. [Google Scholar] [CrossRef]
Mao, Z.; Liu, Y.; Qu, A. Integrating Big Data Analytics in Autonomous Driving: An Unsupervised Hierarchical Reinforcement Learning Approach. Transp. Res. Part C Emerg. Technol. 2024, 162, 104606. [Google Scholar] [CrossRef]
Zhang, S.; Yan, Y.; Bao, W.; Guo, S.; Jiang, J.; Ma, M. Network Topology Identification Algorithm Based on Adjacency Matrix. In Proceedings of the 2017 IEEE Innovative Smart Grid Technologies—Asia (ISGT-Asia), Auckland, New Zealand, 4–7 December 2017; pp. 1–5. [Google Scholar] [CrossRef]
Park, S.; Gama, F.; Lavaei, J.; Sojoudi, S. Distributed Power System State Estimation Using Graph Convolutional Neural Networks. In Proceedings of the Hawaii International Conference on System Sciences 2023, Maui, HI, USA, 3–6 January 2023. [Google Scholar] [CrossRef]
Zhu, D.; Zeng, W.; Su, J. Construction of transformer substation fault knowledge graph based on a depth learning algorithm. Int. J. Model. Simul. Sci. Comput. 2022, 14, 2341017. [Google Scholar] [CrossRef]
Chen, T.; Yang, P.; Li, H.; Gao, J.; Yuan, Y. Two-Stage Optimization Model Based on Neo4j-Dueling Deep Q Network. Energies 2024, 17, 4998. [Google Scholar] [CrossRef]
Senisetty, M.; Kiran, P. Energy Optimization in Microgrids: A Federated Multi-Task Reinforcement Learning Approach. In Proceedings of the 2025 4th International Conference on Advances in Computing, Communication, Embedded and Secure Systems (ACCESS), Ernakulam, India, 11–13 June 2025; pp. 179–184. [Google Scholar] [CrossRef]
Cao, J.; Wang, X.; Wang, Y.; Tian, Y. An improved Dueling Deep Q-network with optimizing reward functions for driving decision method. Proc. Inst. Mech. Eng. Part D 2022, 237, 2295–2299. [Google Scholar] [CrossRef]
Zeng, L.; Yao, W.; Shuai, H.; Zhou, Y.; Ai, X.; Wen, J. Resilience Assessment for Power Systems Under Sequential Attacks Using Double DQN With Improved Prioritized Experience Replay. IEEE Syst. J. 2023, 17, 1865–1876. [Google Scholar] [CrossRef]

Figure 1. Hierarchical Optimization Model Structure.

Figure 2. Neo4j Graph Model Based on Bay Topology Integration.

Figure 3. Switch Bay Status in the Neo4j Graph Model.

Figure 4. Model Optimization Process.

Figure 5. D3QN Learning Framework.

Figure 6. MT-D3QN Learning Framework.

Figure 7. Main Busbar Diagram of a 35 kV Substation in a Power Grid. (a) WB33 busbar maintenance preoperative; (b) WB33 busbar maintenance postoperative.

Figure 8. Cumulative Reward Curves at Each Optimization Stage. (a) transfer path; (b) transfer switching operation; (c) maintenance switching operation.

Figure 9. Loss Function Variation Curves at Each Optimization Stage. (a) transfer path; (b) transfer switching operation; (c) maintenance switching operation.

Figure 10. Average reward variation curves under different training mechanisms. (a) Transfer-switching operation optimization; (b) Maintenance-switching operation optimization.

Figure 11. Sensitivity Analysis of Load Transfer Optimization and Switch Operation Optimization Stages for Three Scenarios. (a) Load Transfer Optimization; (b) Load Transfer Optimization.

Figure 12. Main Busbar Diagram of a 220 kV Substation in a Power Grid.

Figure 13. Average Reward Variation Curves for Random Maintenance Tasks. (a) Transformer Maintenance; (b) Line Maintenance; (c) Circuit Breaker Maintenance.

Figure 14. Visualization of Evaluation Results for Each Algorithm in Three Scenarios. (a) T33 Maintenance; (b) L1 Maintenance; (c) QF123 Maintenance.

Table 1. Substation Data.

Substation Number	Main Transformer	Voltage Ratio /(KV/KV)	Capacity /MVA	Area Number	Area Load/MW
Substation 1	T31	35/10	8	Area1	4.2
Substation 1	T32	35/10	8	Area2	3.5
Substation 2	T33	35/10	8	Area3	3.9
Substation 2	T34	35/10	8	Area4	3.0

Table 2. Hyperparameter Settings.

Parameter	D3QN	MT-D3QN
Network Architecture	256 × 1024 × 512	128 × 1024 × 512
Experience Replay Capacity	10,000	10,000
Batch Size	64	128
Learning Rate	0.005	0.009
Exploration Rate	0.9	0.9
Discount Factor	0.9	0.95
Minimum Exploration Rate	0.01	0.01
Update Frequency	50	50

Table 3. Optimization Results of the WB33 Busbar Maintenance Scenario.

Scenario	Optimization Stage	Result
WB33 Busbar Maintenance	Transfer Space	1052bay, 1054bay
	Maintenance Space	1037bay, 1031bay
	Transfer Switching Space	1052, 10521, 10522, 105211, 105212, 1054, 10541, 10542, 105411, 105412
	Maintenance Switching Space	1037, 10371, 10372, 103711, 103712, 1031, 10311, 10312, 103111, 103112
	Switching Operation Sequence	close10522, 10521, 1052, 10541, 10542, 1054; open1031, 1037, 10312, 10372, 10311, 10371; close103111, 103711, 103112, 103712

Table 4. Substation Data.

Substation Number	Main Transformer	Voltage Ratio /(KV/KV)	Capacity /MVA	Area Number	Area Load/MW
Substation 1	T11	220/110	70	Area1 Area2	14.00 15.40
	T12	110/10	31.5
	T13	110/10	31.5
Substation 2	T21	220/110	70	Area3 Area4	12.80 12.60
	T22	110/10	31.5
	T23	110/10	31.5
Substation 3	T31	220/110	70	Area5 Area6	15.10 15.70
	T32	110/10	31.5
	T33	110/10	31.5

Table 5. Baseline Comparison of Random Experiments.

Algorithm	Transfer Success Rate/%	Maintenance Success Rate/%	Power Flow Violation Rate/%	Switching Misoperation Rate/%	Average Reward	Average Inference Time/s
MILP	71.17	71.17	/	/	/	1680.46
DQN	46.67	66.67	53.33	33.33	9.49	831.07
D3QN	73.33	86.67	26.67	13.33	12.03	660.87
D3QN + MT-D3QN	86.67	100	13.33	0	13.71	309.47

Table 6. Sequence Inference Results of Each Algorithm in Different Scenarios.

Scenario	Algorithm	Inference Results
T33 Maintenance	DQN	Colse1043, 1047, 10431, 10471, 10432, 10472; Open1032, 138, 10321, 1381, 10322, 1382; Colse103211,13811, 103212, 13812;
	D3QN	Colse1043, 1046, 10431, 10461, 10432, 10462; Open1032, 138, 10321, 1381, 10322, 1382; Colse103211, 13811, 103212, 13812;
	MILP	Colse1045, 1042, 1046, 10451, 10422, 10461, 10452, 10421, 10462; Open1032, 138, 10321, 1381, 10322, 1382; Colse103211, 13811, 103212, 13812;
	D3QN + MT-D3QN	Colse1050, 1046, 10501, 10461, 10502, 10462; Open1032, 138, 10321, 1381, 10322, 1382; Colse103211, 13811, 103212, 13812;
L1 Maintenance	DQN	Colse1043, 1049, 10431, 10491, 10432, 10492; Open111, 213, 1111, 2131, 1112, 2132; Colse1113, 11111, 21311, 11112, 21312;
	D3QN	Colse1042, 1044, 10421, 10441, 10422, 10442; Open111, 213, 1111, 2131, 1112, 2132; Colse11111, 21311, 11112, 21312;
	MILP	Colse1043, 1045, 1048, 10432, 10452, 10481, 10432, 10451, 10482; Open111, 213, 1111, 2131, 1112, 2132; Colse1113, 11111, 21311, 11112, 21312;
	D3QN + MT-D3QN	Colse1045, 1044, 10451, 10441, 10452, 10442; Open111, 213, 1111, 2131, 1112, 2132; Colse11111, 21311, 11112, 21312;
QF123 Maintenance	DQN	Colse1041, 1044, 10411, 10441, 10412, 10442; Open123, 1231, 1233; Colse12311, 12312;
	D3QN	Colse1043, 1044, 1045, 10431, 10441, 10451, 10432, 10442, 10452; Open123, 1231, 1233; Colse12311, 12312;
	MILP	Colse1043, 1048, 1044, 10431, 10481, 10441, 10432, 10482, 10442; Open123, 1231, 1233; Colse12311, 12312;
	D3QN + MT-D3QN	Colse1047, 1045, 10471, 10451, 10472, 10452; Open123, 1231, 1233; Colse12311, 12312;

Table 7. Quantitative Evaluation of Each Algorithm in Different Scenarios.

Scenario	Algorithm	Voltage Deviation (MAE)	Average Line Load Rate (%)	Line Max Load (%)	Average Transformer Load Rate (%)	Transformer Max Load (%)	Total Switch Operation	Degraded State Ratio (%)	Loop Impact Limit Violation Rate (%)
T33 Maintenance	MILP	0.823	40.392	41.378	64.113	78.123	18	0	0
	DQN	0.915	45.426	48.435	67.517	86.752	15	0	0
	D3QN	0.847	40.1	40.241	65.376	80.537	15	0	0
	D3QN + MT-D3QN	0.798	37.142	38.232	62.896	72.154	15	0	0
L1 Maintenance	MILP	1.642	42.392	46.978	64.113	83.123	20	0	0
	DQN	1.897	47.426	49.456	69.517	88.723	20	25	50
	D3QN	1.786	43.213	47.251	65.376	85.557	18	0	0
	D3QN + MT-D3QN	1.498	39.542	42.331	61.853	77.256	18	0	0
QF123 Maintenance	MILP	1.598	17.026	46.978	63.113	83.123	20	0	0
	DQN	1.783	17.047	49.456	71.517	90.723	20	0	50
	D3QN	1.598	17.028	47.251	65.376	85.557	18	0	0
	D3QN + MT-D3QN	1.498	17.013	42.331	62.896	77.762	18	0	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, M.; Chen, T.; Yuan, J.; Jiang, Y.; Ren, J. An End-to-End Hierarchical Intelligent Inference Model for Collaborative Operation of Grid Switches. Energies 2025, 18, 6574. https://doi.org/10.3390/en18246574

AMA Style

Zhao M, Chen T, Yuan J, Jiang Y, Ren J. An End-to-End Hierarchical Intelligent Inference Model for Collaborative Operation of Grid Switches. Energies. 2025; 18(24):6574. https://doi.org/10.3390/en18246574

Chicago/Turabian Style

Zhao, Mingrui, Tie Chen, Jiaxin Yuan, Yuting Jiang, and Junlin Ren. 2025. "An End-to-End Hierarchical Intelligent Inference Model for Collaborative Operation of Grid Switches" Energies 18, no. 24: 6574. https://doi.org/10.3390/en18246574

APA Style

Zhao, M., Chen, T., Yuan, J., Jiang, Y., & Ren, J. (2025). An End-to-End Hierarchical Intelligent Inference Model for Collaborative Operation of Grid Switches. Energies, 18(24), 6574. https://doi.org/10.3390/en18246574

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An End-to-End Hierarchical Intelligent Inference Model for Collaborative Operation of Grid Switches

Abstract

1. Introduction

2. Structure of the Hierarchical Optimization Model

3. Neo4j Graph Model Based on Bay Topology Integration

4. “Load Transfer—Switching Operation” Hierarchical Optimization Model

4.1. Transfer Supply Space and Maintenance Space

4.2. Transfer-Supply Reasoning Model Based on D3QN

4.2.1. D3QN Algorithm

4.2.2. Dynamic Adjustment Mechanism of Reward Weights

4.2.3. Transfer-Supply State Space

4.2.4. Reward and Penalty Function

4.3. Switching Operation Reasoning Model Based on MT-D3QN

4.3.1. MT-D3QN Algorithm

4.3.2. Switching Operation Space

4.3.3. State Space

4.3.4. Reward and Penalty Function

5. Case Study Analysis

5.1. Case Study 1: Maintenance Verification of the WB33 Busbar at the 35 kV Substation

5.1.1. Experimental Environment and Algorithm Parameter Configuration

5.1.2. Experimental Results and Performance Analysis

5.1.3. Sensitivity Analysis of the Reward Function

5.2. Case Study 2: Random Task Maintenance Test

5.2.1. Performance Analysis

5.2.2. Metrics and Comparison

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI