Power System Operation Mode Calculation Based on Improved Deep Reinforcement Learning

Yu, Ziyang; Zhou, Bowen; Yang, Dongsheng; Wu, Weirong; Lv, Chen; Cui, Yong

doi:10.3390/math12010134

Open AccessArticle

Power System Operation Mode Calculation Based on Improved Deep Reinforcement Learning

by

Ziyang Yu

¹,

Bowen Zhou

^1,*

,

Dongsheng Yang

¹,

Weirong Wu

¹,

Chen Lv

² and

Yong Cui

³

¹

College of Information Science and Engineering, Northeastern University, Shenyang 110004, China

²

China Electric Power Research Institute, Beijing 100192, China

³

State Grid Shanghai Municipal Electric Power Company, Shanghai 201507, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(1), 134; https://doi.org/10.3390/math12010134

Submission received: 30 November 2023 / Revised: 29 December 2023 / Accepted: 29 December 2023 / Published: 30 December 2023

(This article belongs to the Section E2: Control Theory and Mechanics)

Download

Browse Figures

Versions Notes

Abstract

:

Power system operation mode calculation (OMC) is the basis for unit commitment, scheduling arrangement, and stability analyses. In dispatch centers at all levels, OMC is usually realized by manually adjusting the parameters of power system components. In a new-type power system scenario, a large number of new energy sources lead to a significant increase in the complexity and uncertainty of a system structure, thus further increasing the workload and difficulty of manual adjustment. Therefore, improving efficiency and quality is of particular importance for power system OMC. This paper first considers generator power adjustment and line switching, and it then models the power flow adjustment process in OMC as a Markov decision process. Afterward, an improved deep Q-network (improved DQN) method is proposed for OMC. A state space, action space, and reward function that conform to the rules of the power system are designed. In addition, the action mapping strategy for generator power adjustment is improved to reduce the number of action adjustments and to speed up the network training process. Finally, 14 load levels under normal and N-1 fault conditions are designed. The experimental results on an IEEE-118 bus system show that the proposed method can effectively generate the operation mode under a given load level, and that it has good robustness.

Keywords:

deep reinforcement learning; DQN; operation mode calculation; power flow convergence; power system

MSC:

68T07; 68T20

1. Introduction

Operation mode calculation (OMC) can provide the operation boundaries for a power system, which is the overall guidance scheme for ensuring the safe and stable operation of a power system. It is also a theoretical basis for dispatchers to evaluate the real-time operation status of a power system [1]. Currently, power system OMC is mainly obtained manually. Large power systems are usually divided into small partitions according to geographical regions. Firstly, dispatchers adjust the operation mode of the electrical devices to achieve a target power flow state convergence in each partition. Finally, according to the preset boundary conditions, all the partitions are combined to realize the power flow convergence of the whole power system [2]. Specifically, the purpose of OMC is to balance the generation and load consumption of the whole power system. Factors such as seasonal changes, equipment maintenance, and emergencies will lead to different load levels. Therefore, it is necessary to formulate operation modes under different load levels to guide generation. The goal of OMC is to ensure the safe and stable operation of the power system.

With large-scale new energy and long-distance, large-capacity transmission integration in power systems, the scale and complexity of power systems continues to increase, and power system OMC thus faces severe challenges [3]. Differences in the models and scopes of applications among power system data make power flow convergence difficult and thus require extensive manual intervention. To solve such problems, Jabr [4] proposed a polyhedral formulation and loop elimination constraint method for piecewise linear programming. Vaishya et al. [5] proposed a method to find a feasible solution to a power flow equation based on piecewise linear programming. However, when the number of buses in a power system is too large, the piecewise linear programming algorithm will lead to non-convergence in the power flow calculation. Farivar et al. [6] constructed a convex relaxation model suitable for power flow calculation. Fan et al. [7] extended the expression of variable space to polynomial functions and established a linearized model of power flow for the optimal selection of variable space. With the large-scale construction of microgrids, it is difficult for traditional power flow calculation methods to consider the hierarchical control effects in microgrids [8]. Ren et al. [9] proposed a general power flow calculation model that can incorporate a hierarchical control scheme into power flow calculations. Guo et al. [10] proposed a method with strong adaptability for power flow nonlinearity in order to improve the calculation accuracy of data-driven power flows under the high penetration rate of renewable power distribution generation. Based on the Koopman operator theory, the nonlinear relationship in power flow calculations was transformed into linear mapping in high-dimensional state space, which could significantly improve calculation accuracy. Huang et al. [11] proposed a multi-area dynamic optimal power flow model that considers the collaborative optimization of energy reserves, intrinsic three-phase unbalanced networks, and dual-control time scales. Smadi et al. [12] modeled the power conversion process by setting up a state estimator and proposed an HVDC hybrid two-step state estimation model.

The above methods have made great contributions to OMC from different angles. However, they are all based on the improvement of the power flow calculation method and traditional power system mechanism modeling. The operation mode of a large power grid involves the calculation of high-dimensional nonlinear systems, and the data model for this still needs to be built. In actual OMC, manual decision making is also needed to adjust power flow, thereby resulting in low efficiency, large errors, and heavy dependence on expert experience.

The emergence of deep reinforcement learning provides an opportunity to solve the above problems, and related studies have also gradually applied it to power systems [13,14]. In microgrid power management, Zhang et al. [15] proposed a distributed consensus-based optimization method to train the policy function of supervised multi-agents, converted operating constraints into training gradients, and achieved the optimal power management of connected microgrids in power distribution systems. Ye et al. [16] proposed an interior-point policy optimization (IPO) method. The time–space-related features were extracted from the operating state of a microgrid to further enhance the generalization ability of IPO. However, OMC needs to consider equipment maintenance, limit conditions, N-1 failure, and other such situations, as well as providing different grid scheduling schemes. Differently from microgrids, taking the optimal decision is disastrous for the computation of large-scale power systems. Therefore, in the actual OMC of power grids, the dispatchers often give many feasible solutions first, and they then find the relative optimal solution through an evaluation system. In addition, a great number of important and interesting research results have been published in photovoltaic power forecasting [17], power generation control [18,19,20], reactive power optimization [21], frequency control [22,23], load control [24], and economic dispatches [25]. However, the above studies were all based on the data of one or several operation modes for power system component control and decision making, and there has been less research on OMC itself. The slow training speed, low decision-making efficiency, and poor adjustment accuracy limit the application of the deep reinforcement learning method in OMC.

The application of deep reinforcement learning in large-scale power system OMC needs to solve the following three problems: firstly, the system state space and action space are discrete. As such, the component parameters to be adjusted should all be discrete variables with only a few adjustment states. Secondly, the power system has thousands of buses. Thus, the state space and action space are high-dimensional. Finally, the model needs to have a certain generalization ability and applicability, as well as having the capacity to adapt to power systems with a different number of buses. Deep Q-network (DQN) has shown good performance in solving high-dimensional discrete state space and action space decision-making problems. At the same time, DQN uses neural networks to fit the Q value function; as such, they have strong applicability to hyperparameters and can easily solve the above three problems. Another problem caused by high-dimensional decision space is the dramatic increase in the amount of computing. Therefore, we improved the DQN by redesigning the system state space, action space, reward function, and action mapping strategy to reduce the consumption of computing and speed up the network training process.

The contributions of this paper are summarized as follows:

(1): The power flow adjustment problem in OMC is expressed as a Markov decision process (MDP), in which the generator power adjustment and line switching in power systems are considered. The state space, action space, and reward function are designed to conform to the rules of a power system, and the minimum adjustment value of the generator output power is set as 5% of the upper limit;
(2): An improved deep Q-network (improved DQN) method is proposed and introduced to solve the MDP problem, which improves the action mapping strategy for generator power adjustment to reduce the number of adjustments and speed up the DQN training process;
(3): OMC experiments with a power system with eight basic load levels and six N-1 faults are designed; the simulation verification is realized in an IEEE-118 bus system and the robustness of the algorithm after generator fault disconnection is verified.

The rest of this paper is organized as follows: Section 2 presents the modeling process for the OMC problem; Section 3 presents the improved deep reinforcement learning algorithm and training process; Section 4 presents the simulation results and related discussions for an IEEE-118 bus system; and finally, Section 5 presents the conclusions and future work.

2. Problem Formulation

The power flow convergence adjustment process in OMC can be modeled with an MDP. The dispatchers can be regarded as agents that make decisions. By making decisions, they continuously fix the boundary conditions and adjust the operation status of the power system components to achieve a target power flow state. The power system simulator can be regarded as an environment. The agent interacts with the environment via a series of observations, actions, and immediate rewards. The Markov decision-making results are evaluated by cumulative rewards. The larger the cumulative rewards, the better the results [26].

2.1. Introduction of the MDP

An MDP is 5-tuple

(S, A, P_{r}, R, γ)

, where

S

is the system environment state space;

s_{t}

is the system state at time step t;

A

is the action space;

a_{t}

is the agent action at time step t;

P_{r}

is the transition probability;

P_{r} (s_{t + 1} | s_{t}, a_{t})

is the probability of transferring to state

s_{t + 1}

after taking action

a_{t}

in state

s_{t}

;

R

is the reward function;

r_{t}

is the reward value obtained after taking action

a_{t}

in state

s_{t}

; and

γ

is the discount factor (

0 \leq γ \leq 1

), which is used to weigh the impact of the immediate reward value and the future reward value on the decision-making process.

Q (s, a) = E [r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + \dots | s_{t} = s, a_{t} = a, π], s \in S, a \in A .

(1)

Here, E is the average value of cumulative rewards; the larger

γ

means that the future reward value will have a greater impact on

Q (s, a)

;

γ = 1

means that the future reward value has the same impact on

Q (s, a)

as the immediate reward value;

γ = 0

means that only the immediate reward value will affect

Q (s, a)

; and

π

represents the action execution of the agent strategy, which is the mapping relationship between system state

s_{t}

and action

a_{t}

. The optimal strategy

μ^{*}

is calculated to maximize the Q value of action

a_{t}

at each moment. The formula is defined as

μ^{*} = \max Q_{μ} (s_{t}, a_{t}),

(2)

where

Q_{μ} (s_{t}, a_{t})

is the Q value of strategy

μ

after taking action

a_{t}

from state

s_{t}

.

2.2. Power Flow Convergence MDP Model

The key to transforming the power flow adjustment process in OMC into an MDP for a solution is to define the expressions of state space S, action space A, and reward function R, which are shown as follows.

(1): State space S: The state space at time step t is defined as

$s_{t} = [p_{1}, p_{2}, \dots, p_{m}, l_{1}, l_{2}, \dots, l_{n}, n_{1}, n_{2}, \dots n_{k}],$

(3)

where $p_{i}$ is the active power of the generator i; $l_{i}$ is the line status of the number i; m is the total number of adjustable generators; n is the total number of lines; and $n_{1}, n_{2}, \dots, n_{k}$ is a set of binary codes to represent the number of operating modes at different load levels. For example, k = 4 represents a total of 16 load levels; $n_{1}, n_{2}, n_{3}, n_{4} = 0000$ represents the operation mode of the first target load level, and $n_{1}, n_{2}, n_{3}, n_{4} = 1111$ represents the operation mode of the 16th target load level. In order to simplify the model, we discretized the value of the active generator power into 21 types. The maximum active power $p_{i \max}$ of generator i was taken as 1.0. Thus, the allowed value of $p_{i}$ was $[0, 0.05, 0.1, \dots, 1.0]$ . The line status is a binary variable representing the connection ( $l_{i} = 1$ ) and disconnection ( $l_{i} = 0$ ) of line i.
(2): Action space A: The action space A is discrete and associated with positive integers, defined as $A = {1, 2, \dots, m}$ , where m is the number of adjustable generators. As is shown in Figure 1 (taking the IEEE-118 bus system as an example), $A = {1, 2, \dots, 54}$ represents the respective serial numbers of the adjustable generators. $a_{t} = i$ represents the power adjustment of generator i at time step t. For different sizes of grids, t takes on different values. The action adjustment strategy of generator i is shown in Section 3.2.
(3): State transition: The state transition formula of the system is defined as

$s_{t + 1} = f (s_{t}, a_{t}) .$

(4)

The state transition between the system states $s_{t}$ and $s_{t + 1}$ at adjacent moments is determined by the transfer function f and the action $a_{t}$ . Since the power flow calculation function f remains unchanged, the state transition result is only determined by $a_{t}$ .
(4): Reward function R: Start by defining two indicators in the OMC power flow convergence problem as follows: (a) the convergence of power flow calculation, denoted by $c o n d . 1$ ; and (b) the output power of the slack bus generator, which does not exceed the limit and is denoted by $c o n d . 2$ . The reward function is thus defined as

$r_{t} = \{\begin{cases} 0, c o n d . 1 & c o n d . 2 \\ - 1, others \end{cases} .$

(5)

After executing

a_{t}

,

r_{t} = 0

means that the power flow calculation converges and the output power of the slack bus generator does not exceed the limit, otherwise it is −1. Therefore, the fewer the number of adjustment steps, the greater the cumulative reward.

3. The Improved DQN Model for OMC

3.1. Introduction of DQN

DQN is improved on the basis of Q-learning. Q-learning calculates the Q value of each action under strategy

μ

to form a Q value table, and it then selects the action with the largest Q value to execute. However, as the scale of the state space and action space increases, the scale of the Q value table will increase exponentially. Calculating the Q value of each action separately will lead to an excessively long algorithm running time. DQN combines Q-learning with neural networks and then uses neural networks to estimate the Q value function, which is shown as

Q (s, a, w) = \arg \max_{a} Q (s_{t}, a_{t}),

(6)

where w is the weight of the neural network and the Q value function is estimated by function approximation. DQN adopts the

ε - g r e e d y

strategy for action selection, and usually selects the action output with the largest Q value. However, there is a certain probability (

1 - ε

) of a random selection of actions for environment exploration. DQN introduces the online Q-network and the target Q-network. The DQN structure diagram for OMC is shown in Figure 2.

At the beginning of the training, the power system element parameters and target load level information are input into the online Q-network, and the parameters of the online Q-network and the target Q-network are set the same, i.e., as

θ

and

θ^{'}

, respectively. The online Q-network calculates the agent state

s_{t}

in real time and outputs action

a_{t}

; the target Q-network calculates the Q value of state

s_{t + 1}

at time t + 1 after taking action

a_{t}

; the loss function of (7) is used to calculate the gradient between the online Q-network and the target Q-network; and, in order to prevent overfitting, the online Q-network parameter is copied to target the Q-network at every C time step. The reply memory is responsible for storing the experience tuples

(s_{t}, a_{t}, r_{t}, s_{t + 1})

of the power system operation mode. When the reply memory is full, the experience tuples at the earliest moment are removed. Finally, the DQN network outputs the trained online Q-network parameter

θ

and a series of adjustment actions

a_{t}

to complete the OMC at the target load level.

L (θ) = r_{j} + γ Q (s_{j}, \arg \max_{a_{j}} (Q (s_{j}, a_{j}; θ_{j}^{'}) - Q (s_{j - 1}, a_{j - 1}; θ_{j - 1})))

(7)

3.2. Improved Mapping Strategy

The usual action-executing mapping strategy is to correspond action

a_{t}

with the state space S one by one. However, there are many generators in the power system, and each additional generator will increase the state space exponentially. When the generator power and load are unbalanced, it is obvious that the power flow of the grid will not converge and that all the previous states are “useless”. We need to quickly skip these states and focus on the subsequent adjustments for power flow convergence. As shown in Figure 3, in order to improve training efficiency, an improved mapping strategy was designed in this paper, which works as follows:

Set

P_{G}

as the sum of the current active power of the generators without a slack bus generator;

P_{L}

is the sum of the current active power of all loads;

P_{S lack_\max}

and

P_{S lack_\min}

are the maximum and minimum active power of the slack bus generator, respectively; K is the network power loss rate;

P_{i}

is the active power of generator i;

P_{i \max}

and

P_{i \min}

are the maximum and minimum active power of generator I, respectively; and the minimum adjustment threshold

λ

is

0.05 P_{i \max}

.

Strategy 1: When the power system is operating normally, it is only necessary to adjust the active power of the generator to match the current load level. Scenario 1: When

P_{G} + P_{S l a c k_\max} \leq P_{L} (1 + K)

,

a_{t} = i

, set

p_{i} = 0.5 P_{i \max}

if

p_{i} \geq 0.5 P_{i \max}

, then set

p_{i} = p_{i} + λ

until

p_{i} = P_{i \max}

. In this scenario, the total active power of the generators is insufficient, and the active power of the generators needs to be increased to meet the requirements of the power flow convergence. Scenario 2: When

P_{G} + P_{S l a c k_\min} \geq P_{L} (1 + K)

and

a_{t} = i

, set

p_{i} = 0.5 P_{i \max}

if

p_{i} \leq 0.5 P_{i \max}

, and set

p_{i} = p_{i} - λ

until

p_{i} = P_{i \min}

. In this scenario, the total active power of the generator is too large, and the active power of the generator needs to be reduced. Scenario 3: Except for Scenario 1 and Scenario 2, the active power of the generator matches the load level, and only the output of the slack bus generator needs to be adjusted.

Strategy 2: When an N-1 fault occurs on line i, set

l_{i} = 0

in the state space S. If the generator bus is disconnected by the fault line, the number of the disconnected generators is deleted in action space A, and then Strategy 1 is executed.

In the actual power flow adjustment process of a large power system, when the generator output is insufficient/surplus, then the estimated value of the difference is generally calculated while selecting the proportion of generator start-up in the corresponding area. In this paper, in order to simplify the model, this estimated value was set at

0.5 P_{i \max}

and the difference was then fine-tuned by power flow calculations that corresponded to the process of

p_{i} \pm 0.05 * P_{i \max}

.

3.3. Training of the Improved DQN

In this paper, an improved DQN method for intelligent power system OMC was established based on the original DQN [27]. At the beginning of training, the reinforcement learning parameters and the current power flow state are initialized. The power system operation mode information and target load level information were input into the improved DQN. In line 7, the algorithm has a

1 - ε

probability to input

s_{t}

into the estimated Q-network, and it uses the improved mapping strategy to determine the action

a_{t} = \arg \max_{a} (Q (s, a))

. Line 8 calculates the power flow state after each iteration by calling MATPOWER [28]. Furthermore, it saves the action and reward values, and then inputs them into reply memory D. MATPOWER is an m-file package based on MATLAB, which is used as power flow calculation simulation software. Lines 10–12 select the minibatch data from D, and they use (7) to calculate the gradient between the target Q-network and the online Q-network. In lines 14–15, if

c o n d . 1

and

c o n d . 2

are satisfied, it proves that the load level has been adjusted. Then, the power flow state is reset and the target load level is randomly initialized. Finally, the estimated Q-network parameters are output to complete the intelligent power system OMC. See Algorithm 1.

Algorithm 1: Training of the improved DQN method
Input: All target load levels.
Output: Trained parameters $θ$ .
1	Initialize reply memory D to capacity N.
2	Initialize online Q-network with random weight $θ$ , set target Q-network weight $θ^{'} = θ$ .
3	For episode = 1, M do
4	Initialize $s_{1}$ of power system with generator, load and bus data.
5	For t = 1, T do
6	With probability $ε$ select a random action $a_{t}$ .
7	Otherwise select $a_{t} = \arg \max_{a} Q (s_{t}, a; θ)$ .
8	Execute action $a_{t}$ in MATPOWER and observe reward $r_{t}$ and state $s_{t + 1}$ .
9	Store transition $a_{t} = \arg \max_{a} Q (s_{t}, a; θ)$ in D.
10	Sample random minibatch of transitions $(s_{j}, a_{j}, r_{j}, s_{j + 1})$ from D.
11	Set $y_{j} = \{\begin{cases} r_{j} if episode terminates at step j + 1 \\ r_{j} + γ \max_{a^{'}} Q (s_{j + 1}, a^{'}; θ^{'}) otherwise \end{cases}$ .
12	Perform a gradient descent step on ${(y_{j} - Q (s_{j}, a_{j}; θ))}^{2}$ with the parameters $θ$ .
13	Every C step, reset $θ^{'} = θ$ .
14	If $c o n d . 1 & c o n d . 2$
15	Reset the power flow state and randomly initialize the target load level.
16	End If
17	End For
18	End For

4. Experimental Verification and Analyses

4.1. Experimental Setup

In this paper, the proposed method was verified in an IEEE-118 bus system [29], which contains 54 generators (1 slack bus generator) and 186 lines. The IEEE-118 bus file (case118.m) in MATPOWER 7.1 (https://matpower.org/download/ (accessed on 10 October 2023)) was used as the initial data. Two experiments were considered.

Experiment 1 set eight load levels under normal line connections as the OMC target. The initial load level was 4242 MW, and the eight load levels were 1.0 (initial data), 0.6, 0.8, 1.2, 1.4, 1.6, 1.8, and 2.0 times the initial load level, respectively.

Experiment 2 simulated the situation when N-1 faults occur on the section lines. Three lines (line 23–24, line 23–25, and line 24–70) were respectively tripped off near the section lines of the system. Then, 1.0 (initial load level) and 1.4 times the load level after the lines were respectively tripped off were set as the OMC target.

The serial number of the slack bus generator was 30; we defined

η_{i}

to describe the generator turn-on ratio. The formula for this is as follows:

η_{i} = \frac{P_{i}}{P_{i \max}} \times 100 %

(8)

where

P_{i}

is the output active power of generator i and

P_{i \max}

is the maximum active power of generator i.

The network structure settings of the online Q-network and target Q-network are shown in Table 1. The size of the input layer was a 242 × 1 × 1 × minibatch. The size of system state space S was 242, which was composed of 54 generators, 186 lines, and 8 load levels (binary coding). The output layer dimension was 54 and represented the Q value corresponding to each possible generator action.

The hardware environment used for the experiments was an Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz and the GPU was an RTX 2060 6G (Legion Y7000P2020H laptop made in China). The software environment was the MATLAB and MATPOWER 7.1 package. The AC power flow model and Newton–Raphson method were used in the experiments. The remaining hyperparameters of the improved DQN were set as follows: a network loss rate of K = 0.05, a greedy coefficient of

ε = 0.95

, a learning rate of

α = 0.0005

, a reward attenuation factor of

γ = 0.9

, a minibatch of 64, a C of 100, a D of 5000, episodes of 3000, and a T of 2000.

4.2. Training Process

The DQN method [27], the IDRL method [30], and the improved DQN method proposed in this paper were used for network training, and the average cumulative reward value change curve is shown in Figure 4.

Figure 4a shows the average cumulative reward value change curve of Experiment 1. The cumulative reward values of DQN, IDRL, and the improved DQN reached convergence at 600 episodes, 540 episodes, and 470 episodes, respectively. The DQN method randomly selects 1 of the 21 states for state transition ([0, 0.05,…, 1.0]). Therefore, the convergence of the reward value is the slowest, the convergence of the reward values is the smallest, and the training effect is the worst. The IDRL method selects only one of the two states [0, 1] for state transition, which corresponds to the full on/full off operation of the generator; as such, the initial reward value produced the fastest accumulation speed, and the fluctuation after convergence was smaller than in the DQN. The improved DQN method performed state transition operations through the improved mapping strategy proposed in this paper. The convergence of the improved DQN training reward value was the fastest, the average cumulative reward value after convergence was the largest, and the fluctuation was the least. Figure 4b shows the average cumulative reward value change curve of Experiment 2. The convergence episodes of the DQN, IDRL, and improved DQN were 650, 630, and 540, respectively. The average cumulative reward value and training speed of the improved DQN were also the best. When combining the two experiments, it can be concluded that the proposed improved DQN has a better training effect.

When comparing Experiment 1 and Experiment 2, it can be seen that the episodes and average reward values of the three methods in Experiment 1 were larger than those in Experiment 2. Our explanation for this phenomenon is as follows: in Experiment 2, the key line of the IEEE-118 bus system section tripped, which could have resulted in the power flow convergence action being unable to achieve the original effect, thus resulting in a slower network training. At the same time, the number of power flow convergences in an episode was reduced, thus making the average cumulative reward value smaller.

4.3. Analysis of Experimental Results

Figure 5 shows the OMC results of Experiment 1. As shown in Figure 5a, 19 generators (including the slack bus generator) were running in the initial state, and the values of

η_{i}

were all less than 80%. There is no doubt that if the generators remained operating in their initial state, the power flow would not converge for the next 7 load levels. Figure 5b–h show the generator adjustment results for the power flow convergence under the remaining load levels. It can be seen that the number of operating generators rose from 17 at the 0.6 times load level to 53 at the 2.0 times load level. At the 0.6 times load level,

η_{i}

was less than 50%. At the 2.0 times load level,

η_{i}

was generally more than 50%, and 18 of the generators would have reached their upper limits.

Figure 6 shows the OMC results of Experiment 2. It can be seen that, when the N-1 faults occurred on the three key lines, there was almost no difference in the start-up of the generators under the 1.0 times load level, thus indicating that the initial load level of the IEEE-118 bus system was robust and that the safety and stability margin was set as very large. However, at the load level of 1.4 times, after tripping off line 23–24, line 23–25, and line 24–70, the number of operating generators was 42, 40, and 38, respectively. The reason for the phenomenon is that line 23–24 is the key section tie line of the IEEE-118 bus system [31]. After line 23–24 was tripped off, the power flow convergence of the system deteriorated, and more generators needed to be switched on to compensate for the power demand. Line 23–25 and line 24–70 are the key lines near the key sections; they have less impact on the system power flow convergence than line 23–24.

From Figure 5 and Figure 6, it can be concluded that the model proposed in this paper can better realize power flow convergence under different load levels, and it can complete OMC under normal conditions and N-1 faults.

Table 2 shows the comparison of the sum of generator output power (excluding the slack bus generator), the output power of the slack bus generator, the number m of the operating generator, and the network loss rate K of the 14 operating modes. It can be seen that the number of operating generators m is positively correlated with the load level. The maximum K value is 3.21% and the minimum K value is 1.97%, all of which are less than the K = 0.05 setting that was present during the improved DQN training. This shows that the hyperparameter setting was valid. The network loss rates of the 14 operating modes were all low, which proves that the proposed method had a good effect on the power flow adjustment. Furthermore,

P_{S l a c k}

approaches

P_{S l a c k_\min}

at load levels of less than 1.0 and approaches

P_{S l a c k_\max}

at load levels of more than 1.0. The reason for this is that when the load level is greater than 1.0, decreasing

P_{S l a c k}

requires additional actions to turn up the other generators. And, at load levels that are less than 1.0, increasing

P_{S l a c k}

requires additional actions to turn down the other generators. According to the improved mapping strategy and reward mechanism proposed in this paper, the improved DQN will always provide as few generator adjustment actions as possible in each episode to obtain a higher reward. Thus,

P_{S l a c k}

approaches the two limits as mentioned above.

Figure 7a shows the operating state of the generator under normal conditions and after the failure of the No. 37 generator under the IDRL method at a 0.6 times load level. The state of the IDRL method generator was only fully on and off. Therefore, it can be seen that the number of generators turned on was only 8. Except for the No. 30 slack bus generator, the output power of other generators was at the upper limit. As shown in the red dotted line box, when the No. 37 generator failed, the output power became 0. In order to compensate for its power, the

η_{30}

of the slack bus generator was 114.07%, which exceeded the

P_{Slack_\max}

. The power flow calculation did not converge, thus resulting in power system collapse.

Figure 7b shows the improved DQN method proposed in this paper under normal conditions and after the failure of the No. 37 generator at a 0.6 times load level. In normal conditions, 17 of the generators were turned on. The

η_{i}

of all the generators was less than 50%, which reduced the dependence on a single generator’s output power. Therefore, when the No. 37 generator failed, the slack bus generator could compensate for its power particularly well, and the

η_{30}

was only 62.35%. The power flow calculation was still convergent and the power system was still running normally, which verifies the robustness of the proposed improved DQN method.

5. Conclusions

In this paper, power system OMC was modeled as a Markov decision process. The state space, action space, and reward function were designed to conform to the rules of a power system. Following this, a method based on an improved deep reinforcement learning network was introduced to solve the MDP problem. In addition, the action for an improved mapping strategy was proposed to speed up the training process of the network. Finally, experimental verification was realized on an IEEE-118 bus system, and a total of 14 OMC experiments with different load levels under normal conditions and N-1 faults were designed. Experiments were conducted that verified that the proposed method can automatically adjust the output power of the generators and output the operation mode with power flow convergence. The robustness of the output operation mode was found to be better, and it effectively reduced the manual adjustment workload during OMC and its dependence on experts.

In future work, more electrical component action mapping strategies will be considered to realize the effective application of the proposed method in real power systems.

Author Contributions

Conceptualization, Z.Y. and B.Z.; methodology, Z.Y. and B.Z.; software, Z.Y.; validation, Z.Y. and W.W.; investigation, D.Y., C.L. and Y.C.; resources, B.Z.; data curation, Z.Y.; writing—original draft preparation, Z.Y.; writing—review and editing, D.Y., W.W. and B.Z.; supervision, C.L. and Y.C.; project administration, D.Y.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Project of State Grid: Research on artificial intelligence analysis technology of available transmission capacity (ATC) of key section under multiple power grid operation modes (5100-202255020A-1-1-ZN).

Data Availability Statement

We have described the data used and the acquisition methods in detail in Section 4.1.

Conflicts of Interest

The authors declare no conflicts of interest. There are no conflicts of interest with State Grid Shanghai Municipal Electric Power Company.

References

Zhang, H.T.; Sun, W.G.; Chen, Z.Y.; Meng, H.F.; Chen, G.R. Backwards square completion MPC solution for real-time economic dispatch in power networks. IET Control Theory Appl. 2019, 13, 2940–2947. [Google Scholar] [CrossRef]
Xu, H.T.; Yu, Z.H.; Zheng, Q.P.; Hou, J.X.; Wei, Y.W.; Zhang, Z.J. Deep Reinforcement Learning-Based Tie-Line Power Adjustment Method for Power System Operation State Calculation. IEEE Access 2019, 7, 156160–156174. [Google Scholar] [CrossRef]
Gong, C.; Cheng, R.; Ma, W. Available transfer capacity model and algorithm of the power grid with STATCOM installation. J. Comput. Methods Sci. Eng. 2021, 21, 185–196. [Google Scholar] [CrossRef]
Jabr, R.A. Polyhedral Formulations and Loop Elimination Constraints for Distribution Network Expansion Planning. IEEE Trans. Power Syst. 2013, 28, 1888–1897. [Google Scholar] [CrossRef]
Vaishya, S.R.; Sarkar, V. Accurate loss modelling in the DCOPF calculation for power markets via static piecewise linear loss approximation based upon line loading classification. Electr. Power Syst. Res. 2019, 170, 150–157. [Google Scholar] [CrossRef]
Farivar, M.; Low, S.H. Branch Flow Model: Relaxations and Convexification-Part I. IEEE Trans. Power Syst. 2013, 28, 2554–2564. [Google Scholar] [CrossRef]
Fan, Z.X.; Yang, Z.F.; Yu, J.; Xie, K.G.; Yang, G.F. Minimize Linearization Error of Power Flow Model Based on Optimal Selection of Variable Space. IEEE Trans. Power Syst. 2021, 36, 1130–1140. [Google Scholar] [CrossRef]
Shaaban, M.F.; Saber, A.; Ammar, M.E.; Zeineldin, H.H. A multi-objective planning approach for optimal DG allocation for droop based microgrids. Electr. Power Syst. Res. 2021, 200, 107474. [Google Scholar] [CrossRef]
Ren, L.Y.; Zhang, P. Generalized Microgrid Power Flow. IEEE Trans. Smart Grid 2018, 9, 3911–3913. [Google Scholar] [CrossRef]
Guo, L.; Zhang, Y.X.; Li, X.L.; Wang, Z.G.; Liu, Y.X.; Bai, L.Q.; Wang, C.S. Data-Driven Power Flow Calculation Method: A Lifting Dimension Linear Regression Approach. IEEE Trans. Power Syst. 2022, 37, 1798–1808. [Google Scholar] [CrossRef]
Huang, W.J.; Zheng, W.Y.; Hill, D.J. Distributionally Robust Optimal Power Flow in Multi-Microgrids with Decomposition and Guaranteed Convergence. IEEE Trans. Smart Grid 2021, 12, 43–55. [Google Scholar] [CrossRef]
Smadi, A.A.; Johnson, B.K.; Lei, H.T.; Aljahrine, A.A. A Lnified Hybrid State Estimation Approach for VSC HVDC lines Embedded in Ac Power Grid. In Proceedings of the IEEE-Power-and-Energy-Society General Meeting (PESGM), Orlando, FL, USA, 16–20 July 2023. [Google Scholar]
Khodayar, M.; Liu, G.Y.; Wang, J.H.; Khodayar, M.E. Deep learning in power systems research: A review. CSEE J. Power Energy Syst. 2021, 7, 209–220. [Google Scholar] [CrossRef]
Zhang, Z.D.; Zhang, D.X.; Qiu, R.C. Deep Reinforcement Learning for Power System Applications: An Overview. CSEE J. Power Energy Syst. 2020, 6, 213–225. [Google Scholar] [CrossRef]
Zhang, Q.Z.; Dehghanpour, K.; Wang, Z.Y.; Qiu, F.; Zhao, D.B. Multi-Agent Safe Policy Learning for Power Management of Networked Microgrids. IEEE Trans. Smart Grid 2021, 12, 1048–1062. [Google Scholar] [CrossRef]
Ye, Y.J.; Wang, H.R.; Chen, P.L.; Tang, Y.; Strbac, G.R. Safe Deep Reinforcement Learning for Microgrid Energy Management in Distribution Networks with Leveraged SpatialTemporal Perception. IEEE Trans. Smart Grid 2023, 14, 3759–3775. [Google Scholar] [CrossRef]
Massaoudi, M.; Chihi, I.; Abu-Rub, H.; Refaat, S.S.; Oueslati, F.S. Convergence of Photovoltaic Power Forecasting and Deep Learning: State-of-Art Review. IEEE Access 2021, 9, 136593–136615. [Google Scholar] [CrossRef]
Yin, L.F.; Zhao, L.L.; Yu, T.; Zhang, X.S. Deep Forest Reinforcement Learning for Preventive Strategy Considering Automatic Generation Control in Large-Scale Interconnected Power Systems. Appl. Sci. 2018, 8, 2185. [Google Scholar] [CrossRef]
Yin, L.F.; Yu, T.; Zhou, L. Design of a Novel Smart Generation Controller Based on Deep Q Learning for Large-Scale Interconnected Power System. J. Energy Eng. 2018, 144, 04018033. [Google Scholar] [CrossRef]
Xi, L.; Chen, J.F.; Huang, Y.H.; Xu, Y.C.; Liu, L.; Zhou, Y.M.; Li, Y.D. Smart generation control based on multi-agent reinforcement learning with the idea of the time tunnel. Energy 2018, 153, 977–987. [Google Scholar] [CrossRef]
Ali, M.; Mujeeb, A.; Ullah, H.; Zeb, S. Reactive Power Optimization Using Feed Forward Neural Deep Reinforcement Learning Method: (Deep Reinforcement Learning DQN algorithm). In Proceedings of the 2020 Asia Energy and Electrical Engineering Symposium (AEEES), Chengdu, China, 29–31 May 2020; pp. 497–501. [Google Scholar]
Yan, Z.M.; Xu, Y. Data-Driven Load Frequency Control for Stochastic Power Systems: A Deep Reinforcement Learning Method with Continuous Action Search. IEEE Trans. Power Syst. 2019, 34, 1653–1656. [Google Scholar] [CrossRef]
Yan, Z.M.; Xu, Y. A Multi-Agent Deep Reinforcement Learning Method for Cooperative Load Frequency Control of a Multi-Area Power System. IEEE Trans. Power Syst. 2020, 35, 4599–4608. [Google Scholar] [CrossRef]
Claessens, B.J.; Vrancx, P.; Ruelens, F. Convolutional Neural Networks for Automatic State-Time Feature Extraction in Reinforcement Learning Applied to Residential Load Control. IEEE Trans. Smart Grid 2018, 9, 3259–3269. [Google Scholar] [CrossRef]
Lin, L.; Guan, X.; Peng, Y.; Wang, N.; Maharjan, S.; Ohtsuki, T. Deep Reinforcement Learning for Economic Dispatch of Virtual Power Plant in Internet of Energy. IEEE Internet Things J. 2020, 7, 6288–6301. [Google Scholar] [CrossRef]
Nian, R.; Liu, J.F.; Huang, B. A review On reinforcement learning: Introduction and applications in industrial process control. Comput. Chem. Eng. 2020, 139, 106886. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Zimmerman, R.D.; Murillo-Sánchez, C.E.; Thomas, R.J. MATPOWER: Steady-State Operations, Planning, and Analysis Tools for Power Systems Research and Education. IEEE Trans. Power Syst. 2011, 26, 12–19. [Google Scholar] [CrossRef]
IEEE 118-Bus System; Illinois Center for a Smarter Electric Grid (ICSEG): Champaign, IL, USA, 2022.
Xu, H.; Yu, Z.; Zheng, Q.; Hou, J.; Wei, Y. Improved deep reinforcement learning based convergence adjustment method for power flow calculation. In Proceedings of the 16th IET International Conference on AC and DC Power Transmission (ACDC 2020), Virtual, 2–3 July 2020; pp. 1898–1903. [Google Scholar]
Jie, W.; Ming, D.; Lei, S.; Liubing, W. An improved clustering algorithm for searching critical transmission sections. Electr. Power China 2022, 55, 86–94. [Google Scholar]

Figure 1. Action space diagram of the IEEE-118 bus system.

Figure 2. DQN structure for OMC.

Figure 3. Process description of the improved mapping strategy.

Figure 4. (a) The average cumulative reward value change curve of the three methods in Experiment 1; (b) the average cumulative reward value change curve of the three methods in Experiment 2.

Figure 5. Operation states of the generator in Experiment 1.

Figure 6. Operation states of the generator in Experiment 2.

Figure 7. Robustness comparison between the IDRL method and the improved DQN method. (a) Generator operation state before and after faults with the IDRL method. (b) Generator operation state before and after faults with the improved DQN method.

Table 1. The online Q-network and target Q-network structure parameters.

Layer	Input	Output	Activation
Conv1	242 × 1 × 1 × minibatch	128 × 128 × 128 × 32	ReLu
Conv2	128 × 128 × 128 × 32	256 × 256 × 256 × 64	ReLu
Conv3	256 × 256 × 256 × 64	128 × 128 × 128 × 128	ReLu
Conv4	128 × 128 × 128 × 128	64 × 64 × 64 × 54	ReLu
FC	64 × 64 × 64 × 54	54	Linear

Table 2. Comparison of the 14 operation modes.

	Load Level	Tripped-Off Line	$P_{G}$ (MW)	$P_{S l a c k}$ (MW)	m	K
Operation Mode 1	1.0	None	3861.0	513.9	19	3.03%
Operation Mode 2	0.6	None	2456.5	49.9	17	1.97%
Operation Mode 3	0.8	None	3413.1	87.3	18	3.05%
Operation Mode 4	1.2	None	4455.6	802.1	27	3.18%
Operation Mode 5	1.4	None	5252.8	797.4	38	2.83%
Operation Mode 6	1.6	None	6150.6	780.2	47	2.94%
Operation Mode 7	1.8	None	6959.9	802.3	53	2.79%
Operation Mode 8	2.0	None	7843.8	783.7	53	2.82%
Operation Mode 9	1.0	23–24	3861.0	514.2	19	3.04%
Operation Mode 10	1.4	23–24	5349.4	752.9	42	2.68%
Operation Mode 11	1.0	23–25	3861.0	521.8	19	3.21%
Operation Mode 12	1.4	23–25	5322.6	787.0	40	2.80%
Operation Mode 13	1.0	24–70	3861.0	513.8	19	3.04%
Operation Mode 14	1.4	24–70	5302.1	803.3	38	2.72%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, Z.; Zhou, B.; Yang, D.; Wu, W.; Lv, C.; Cui, Y. Power System Operation Mode Calculation Based on Improved Deep Reinforcement Learning. Mathematics 2024, 12, 134. https://doi.org/10.3390/math12010134

AMA Style

Yu Z, Zhou B, Yang D, Wu W, Lv C, Cui Y. Power System Operation Mode Calculation Based on Improved Deep Reinforcement Learning. Mathematics. 2024; 12(1):134. https://doi.org/10.3390/math12010134

Chicago/Turabian Style

Yu, Ziyang, Bowen Zhou, Dongsheng Yang, Weirong Wu, Chen Lv, and Yong Cui. 2024. "Power System Operation Mode Calculation Based on Improved Deep Reinforcement Learning" Mathematics 12, no. 1: 134. https://doi.org/10.3390/math12010134

APA Style

Yu, Z., Zhou, B., Yang, D., Wu, W., Lv, C., & Cui, Y. (2024). Power System Operation Mode Calculation Based on Improved Deep Reinforcement Learning. Mathematics, 12(1), 134. https://doi.org/10.3390/math12010134

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Power System Operation Mode Calculation Based on Improved Deep Reinforcement Learning

Abstract

1. Introduction

2. Problem Formulation

2.1. Introduction of the MDP

2.2. Power Flow Convergence MDP Model

3. The Improved DQN Model for OMC

3.1. Introduction of DQN

3.2. Improved Mapping Strategy

3.3. Training of the Improved DQN

4. Experimental Verification and Analyses

4.1. Experimental Setup

4.2. Training Process

4.3. Analysis of Experimental Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI