Control Method of Buses and Lines Using Reinforcement Learning for Short Circuit Current Reduction

: This paper proposes a reinforcement learning-based approach that optimises bus and line control methods to solve the problem of short circuit currents in power systems. Expansion of power grids leads to concentrated power output and more lines for large-scale transmission, thereby increasing short circuit currents. The short circuit currents must be managed systematically by controlling the buses and lines such as separating, merging, and moving a bus, line, or transformer. However, there are countless possible control schemes in an actual grid. Moreover, to ensure compliance with power system reliability standards, no bus should exceed breaker capacity nor should lines or transformers be overloaded. For this reason, examining and selecting a plan requires extensive time and e ﬀ ort. To solve these problems, this paper introduces reinforcement learning to optimise control methods. By providing appropriate rewards for each control action, a policy is set, and the optimal control method is obtained through a maximising value method. In addition, a technique is presented that systematically deﬁnes the bus and line separation measures, limits the range of measures to those with actual power grid applicability, and reduces the optimisation time while increasing the convergence probability and enabling use in actual power grid operation. In the future, this technique will contribute signiﬁcantly to establishing power grid operation plans based on short circuit currents.


Introduction
The regional imbalance between generation and load will inevitably intensify in South Korea because generation is more concentrated in the non-metropolitan areas and load is more concentrated in the metropolitan areas. Generation is planned to be increased from 8.2 to 18 GW in the east coast region, 8 to 15 GW in the west coast region, and 6.5 to 10 GW in the south coast region [1]. However, because of the concentration of the metropolitan area, loads are also continuously concentrated there.
Alleviating this imbalance requires large-scale, long-distance transmission, necessitating the installation of additional power facilities such as transmission lines and transformers. The stability of the power grid increases with the number of power facilities, owing to a decrease in the transmission impedance of the grid. Furthermore, transmission loss decreases alongside transmission impedance, thereby enhancing the voltage stability and transient stability of the grid.
Conversely, when the transmission impedance decreases, the short circuit current magnitude increases. This is a natural phenomenon that occurs as the power grid expands, power output becomes concentrated, and cities become overpopulated. Left unchecked, the short circuit current magnitude will increase to unforeseen levels, with the currently available breakers being unable to block such currents. Indeed, when the Korea Electric Power Corporation (KEPCO) grids are completely integrated without bus separation or line separation, the operating short circuit current of 154 kV buses increases to 187 kA, and the short circuit current of 345 kV buses increases to 86 kA [2]. The maximum current of breakers developed thus far is 80 kA; as such, they are insufficient to prevent the short circuit currents developed at either 345 kV or 154 kV. Moreover, even if capable breakers are developed, they would be so large that installation would be impossible.
For this reason, there is no alternative to systematically managing the short circuit currents of power grids. There are various techniques for reducing short circuit current, among which the traditional and basic methods are bus separation and line separation [3][4][5]. These methods allow existing facilities to be used and only require changing operation methods. This enables flexibility in terms of grid operation because it can be employed immediately in the current situation without additional costs. However, while using these methods to reinforce the power grid, the facilities installed in the grid are not used in isolation. As a result, overload can occur in nearby lines and transformers, and the stability of the grid can be degraded. Nevertheless, these are the most widely used methods to reduce short circuit current, as they require effectively no additional time or expense. Related research has been conducted on methods of separating the grid using the "splitting" function, which cuts a grid connected in ring-bus form into its optimal form [6][7][8]. Developed a very long time ago, it is the most economical approach for separating short circuit current and takes into account the number of breakers required to divide the grid. However, this measure was primarily chosen for economic feasibility and does not account for the reduction of short circuit current or the overload of nearby lines and transformers. Moreover, rather than use computational techniques, such as determining the impedance resulting when splitting the grid, it presents only the results of repeated performance.
Another traditional approach for reducing short circuit current is to install a current-limiting reactor, which forcibly increases the line impedance. The current-limiting reactor forces the inductor (L) component in the transmission line to increase the impedance of the grid and reduce the short circuit current. Increasing the impedance of a line through which a large amount of short circuit current is expected to flow reduces the inflow of short circuit current and spreads it to other lines with relatively low impedance. Related research on this technique is ongoing. These studies generally investigate how much capacity can be installed at various locations to obtain the optimal effect [9][10][11][12][13][14][15][16][17]. The installation of a current-limiting reactor is practiced widely owing to its relatively low cost and short construction period. However, because impedance is input in series, it has an operational disadvantage in that the transmission line must be cut for input and removal. Furthermore, the stability of the grid is reduced owing to the increased impedance.
Various special facilities enabling the reduction of short circuit current have recently emerged. The most representative of these is the back-to-back (BTB) system using High Voltage Direct Current (HVDC). As the short circuit current of AC lines generally does not pass through DC lines, the BTB system is installed on a line with significant short circuit current flow to reduce the short circuit current [18][19][20][21]. BTB is a very powerful method for reducing short circuit current and does not reduce the stability of the grid. This is because the short circuit current theoretically never passes through the installation side. However, BTB is very expensive and has a long construction timeline. The characteristics of each fault current reduction method introduced above are summarized in Table 1. This study applies reinforcement learning (RL) [22][23][24][25][26][27][28][29][30][31][32][33][34] to conduct bus and line separation more systematically; these are the most widely used techniques for grid operation as they can be performed immediately and without additional cost. Because there are many buses and lines in a grid, there are numerous ways to reduce short circuit current. As the effects of bus separation vary in how the transformers and lines are separated and connected afterward, there exist even more measures for reducing short circuit current. Therefore, in carrying out bus separation and line separation, only practically possible measures should be selected [35]. These many bus and line separation measures cause difficulty in determining the optimal method for reducing short circuit current, and they also consume large amounts of computation time. Accordingly, rather than simply configuring groups of all bus and line separation measures, this study determines the overload of the transformers and lines and selects only measures that effectively reduce short circuit current. The selected measures can be applied in actual power grids by configuring the policy for use in RL. It is very important how to set the action and reward in order to obtain the optimal result in performing any control [36]. This method can accurately determine the optimal solution for reducing short circuit current, and it also substantially reduces the computation time.
The remainder of this paper is organised as follows: Section 2 defines the necessary elements for RL configuration. Section 3 defines the simulation conditions and discusses the results. Section 4 provides concluding remarks.

Materials and Methods
RL is an area of machine learning concerned with how software agents ought to take actions in an environment to maximise the notion of cumulative reward. RL differs from supervised learning in that it does not need the presentation of labelled input and output pairs nor the explicit correction of sub-optimal actions. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) [22]. The environment is typically stated in the form of a Markov decision process (MDP), because many RL algorithms for this context utilise dynamic programming techniques [23]. The main difference between the classical dynamic programming methods and RL algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.
In the typical framing of an RL scenario, an agent performs actions in an environment that are interpreted into a reward and a representation of the state and fed back into the agent. This is illustrated in Figure 1. Here, the agent is the learner and the decision-maker. The environment is the system that it interacts with, comprising everything outside the agent. The action is all the possible moves that the agent can make. The state is the current situation returned by the environment. The reward is an immediate return sent back from the environment to evaluate the last action by the agent. and selects only measures that effectively reduce short circuit current. The selected measures can be applied in actual power grids by configuring the policy for use in RL. It is very important how to set the action and reward in order to obtain the optimal result in performing any control [36]. This method can accurately determine the optimal solution for reducing short circuit current, and it also substantially reduces the computation time. The remainder of this paper is organised as follows: Section 2 defines the necessary elements for RL configuration. Section 3 defines the simulation conditions and discusses the results. Section 4 provides concluding remarks.

Materials and Methods
RL is an area of machine learning concerned with how software agents ought to take actions in an environment to maximise the notion of cumulative reward. RL differs from supervised learning in that it does not need the presentation of labelled input and output pairs nor the explicit correction of sub-optimal actions. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) [22]. The environment is typically stated in the form of a Markov decision process (MDP), because many RL algorithms for this context utilise dynamic programming techniques [23]. The main difference between the classical dynamic programming methods and RL algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible.
In the typical framing of an RL scenario, an agent performs actions in an environment that are interpreted into a reward and a representation of the state and fed back into the agent. This is illustrated in Figure 1. Here, the agent is the learner and the decision-maker. The environment is the system that it interacts with, comprising everything outside the agent. The action is all the possible moves that the agent can make. The state is the current situation returned by the environment. The reward is an immediate return sent back from the environment to evaluate the last action by the agent. At every time step t, the agent executes action at, receives state st, and receives scalar reward rt. The environment receives action at, emits state st+1, and emits reward rt+1 at every time step. Each action influences the agent's future state. Success is measured by a scalar reward signal. RL selects  At every time step t, the agent executes action a t , receives state s t , and receives scalar reward r t . The environment receives action a t , emits state s t+1 , and emits reward r t+1 at every time step. Each action influences the agent's future state. Success is measured by a scalar reward signal. RL selects actions to maximise future reward. The strategy, policy(π), is employed by the agent to determine the next action based on the current state. A policy written π(s, a), describes a way of acting. It is a function that takes in a state and an action and returns the probability of taking that action in that state.
where P is the probability of doing action a in state s. The value function V π (s) is defined as the expected return starting with state s, i.e., s 0 = s, and successively following policy π. Hence, the value function estimates how good it is to be in a given state.
where E is expected (future) cumulative reward, R is the sum of future discounted rewards, r t is immediate reward, and γ is discount factor (which is less than 1, as a particular state becomes older, its effect on the later states becomes progressively less; thus, its effect is discounted). The Q-value function at state s and action a is the expected cumulative reward from taking action a in state s. The Q-value function estimates the adequacy of a state-action pair. ( The optimal Q-value function Q*(s, a) is the maximum expected cumulative reward achievable from a given state-action pair. The concept of optimal Q-value function is shown in Figure 2.
To obtain the optimal Q-value, RL breaks the decision problem into smaller sub-problems. Bellman's principle of optimality describes how to do this [23].
The principle of optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
The Bellman equation is classified as a functional equation, because solving it means finding the unknown function V, which is the value function. Recall that the value function describes the best possible value of the objective, as a function of the state s. By calculating the value function, the function that describes the optimal action as a function of the state is also found; this is called the policy function. Equations (5)-(8) express this process.
V π (s) = a∈A π(a|s) * q π (s, a) Sustainability 2020, 12, 9333 By expressing the above equations as a recursive function and calculating it repeatedly, the optimal value could be obtained. a in state s. The Q-value function estimates the adequacy of a state-action pair.
The optimal Q-value function Q*(s, a) is the maximum expected cumulative reward achievable from a given state-action pair. The concept of optimal Q-value function is shown in Figure 2.
To obtain the optimal Q-value, RL breaks the decision problem into smaller sub-problems. Bellman's principle of optimality describes how to do this [23].
The concept of optimal Q-value function. Figure 2. The concept of optimal Q-value function.
In power system operation, the agent is the system operator, and the environment is the power system. There are many actions to reduce the short circuit current, but in this study, the separation or merging of bus and line, and movement of the transformer are defined as actions. Agents in the environment that change accordingly can be classified into three categories: short circuit current, overload, and radial system. For these three agents, the state changed after the action is executed can be classified into three categories: decrease in number, increase in number, and no change.

Action
Before proceeding with the plan to reduce short circuit current using RL, it is necessary to define the action. This study uses bus and line separation actions to reduce short circuit current. The subjects for analysis are 154 kV buses and transmission lines in which excessive short circuit current in the KEPCO grid is greatest. As of 2020, there were 1,213,154 kV buses and 2784 transmission lines (including transformers) in the KEPCO grid. Selecting an action that separates all of these would result in too many possible actions. For bus separation in particular, an additional option involves choices for dividing and connecting the transformer and line after bus separation; this would result in even more actions.
To obtain good results, it is necessary to define the action plan clearly. For the bus separation action, the choice should be limited to actions that effectively reduce the short circuit current while considering the transformer's capacity and spare capacity. For the line separation action, the choice should be limited to actions that divide the grid into independent grids without making them radial and do not cause overload. Using these conditions, the action set is selected. The bus separation action and line separation action are discussed separately below [36].

Bus Separation Action
Before establishing the action for short circuit current reduction, it is necessary to examine why the short circuit current of the bus is large. As the short circuit current magnitude is inversely proportional Sustainability 2020, 12, 9333 6 of 18 to the impedance magnitude, the maximum short circuit current generally occurs at the voltage level for which there are the most transmission lines. In the KEPCO grid, the largest short circuit current occurs at the 154 kV voltage level [36]. Examining the inflow path of short circuit current in 154 kV buses shows that the short circuit current flowing through the transformers is largest at the high voltage level of 345 kV. To reduce this, it is important to separate the 154 kV buses to separate the connected 345/154 kV transformers. For example, if there are a total of four 345/154 kV transformers, the 154 kV buses are separated into the 1st and the 2nd, and two transformers for each are connected to reduce the short circuit current flowing into each 154 kV bus.
Attention should be paid to whether the transformer is overloaded when disconnecting it in this manner. As the transformers are split for each bus, the total amount of power that each can withstand must be sufficient to cover even a single transformer failure. This is specified in the Korean power grid reliability standards, which states that there must be no problems in operation with an N-1 failure. To satisfy this condition, the transformer must have spare capacity relative to the existing power consumption. If there is spare capacity, then the two transformers can be split while also independently configuring the 154 kV voltage level, resulting in a much greater reduction in short circuit current. However, if there is no spare capacity, transformer failure may result in a violation of the reliability standards. Therefore, the rear end must be connected with a 154 kV level transmission line to eliminate any potential problems. This is illustrated in Figures 3 and 4. Attention should be paid to whether the transformer is overloaded when disconnecting it in this manner. As the transformers are split for each bus, the total amount of power that each can withstand must be sufficient to cover even a single transformer failure. This is specified in the Korean power grid reliability standards, which states that there must be no problems in operation with an N-1 failure. To satisfy this condition, the transformer must have spare capacity relative to the existing power consumption. If there is spare capacity, then the two transformers can be split while also independently configuring the 154 kV voltage level, resulting in a much greater reduction in short circuit current. However, if there is no spare capacity, transformer failure may result in a violation of the reliability standards. Therefore, the rear end must be connected with a 154 kV level transmission line to eliminate any potential problems. This is illustrated in Figures 3 and 4.  In Figure 1, the transformer has spare capacity. As such, the 154 kV substation can be independently configured to block the additional inflow path of the short circuit current, thereby substantially reducing the short circuit current. In addition, grid stability can be maintained even if a transformer fails, thus complying with the reliability standards. Therefore, if the transformer has spare capacity, an additional line separation action is implemented in a suitable location to prevent the connection of the lower end, and the action to reduce the short circuit current of the grid is selected.
In Figure 2, the transformer does not have spare capacity. Hence, configuring the 154 kV substation independently would violate the reliability standards if a single transformer failed. The Attention should be paid to whether the transformer is overloaded when disconnecting it in this manner. As the transformers are split for each bus, the total amount of power that each can withstand must be sufficient to cover even a single transformer failure. This is specified in the Korean power grid reliability standards, which states that there must be no problems in operation with an N-1 failure. To satisfy this condition, the transformer must have spare capacity relative to the existing power consumption. If there is spare capacity, then the two transformers can be split while also independently configuring the 154 kV voltage level, resulting in a much greater reduction in short circuit current. However, if there is no spare capacity, transformer failure may result in a violation of the reliability standards. Therefore, the rear end must be connected with a 154 kV level transmission line to eliminate any potential problems. This is illustrated in Figures 3 and 4.  In Figure 1, the transformer has spare capacity. As such, the 154 kV substation can be independently configured to block the additional inflow path of the short circuit current, thereby substantially reducing the short circuit current. In addition, grid stability can be maintained even if a transformer fails, thus complying with the reliability standards. Therefore, if the transformer has spare capacity, an additional line separation action is implemented in a suitable location to prevent the connection of the lower end, and the action to reduce the short circuit current of the grid is selected.
In Figure 2, the transformer does not have spare capacity. Hence, configuring the 154 kV substation independently would violate the reliability standards if a single transformer failed. The lower end should, therefore, be linked to maintain grid stability. When implementing bus separation In Figure 1, the transformer has spare capacity. As such, the 154 kV substation can be independently configured to block the additional inflow path of the short circuit current, thereby substantially reducing the short circuit current. In addition, grid stability can be maintained even if a transformer fails, thus complying with the reliability standards. Therefore, if the transformer has spare capacity, an additional Sustainability 2020, 12, 9333 7 of 18 line separation action is implemented in a suitable location to prevent the connection of the lower end, and the action to reduce the short circuit current of the grid is selected.
In Figure 2, the transformer does not have spare capacity. Hence, configuring the 154 kV substation independently would violate the reliability standards if a single transformer failed. The lower end should, therefore, be linked to maintain grid stability. When implementing bus separation as above, and as additional line separation is not conducted, separating the transformer can considerably reduce the short circuit current, even though the reduction amount is limited. Accordingly, this is utilised for the bus separation action.

Line Separation Action
Although line separation is an effective means for reducing the short circuit current of the corresponding bus, separating the line may also result in degraded stability. Moreover, transformer overload and line overload may also occur, which would violate reliability standards. Line separation should, therefore, be minimised. An action used in conjunction with the transformer separation action is effective in this case [36]. If there are no problems with the transformer capacity, then an action is selected that separates the line of the bus where the short circuit current is the greatest at the lower end and independently configures the grid; this maximises the reduction in short circuit current. The action is then selected based on this action. This concept is shown in Figure 5. considerably reduce the short circuit current, even though the reduction amount is limited. Accordingly, this is utilised for the bus separation action.

Line Separation Action
Although line separation is an effective means for reducing the short circuit current of the corresponding bus, separating the line may also result in degraded stability. Moreover, transformer overload and line overload may also occur, which would violate reliability standards. Line separation should, therefore, be minimised. An action used in conjunction with the transformer separation action is effective in this case [36]. If there are no problems with the transformer capacity, then an action is selected that separates the line of the bus where the short circuit current is the greatest at the lower end and independently configures the grid; this maximises the reduction in short circuit current. The action is then selected based on this action. This concept is shown in Figure 5.  Figure 6 shows the transformer movement action used when the capacity of the transformer is insufficient. If the transformer capacity is insufficient, then the transformer is driven to one side while performing bus separation, the lines are separated at the lower end, and the grid is configured independently and does not violate the power grid reliability standards.  Figure 6 shows the transformer movement action used when the capacity of the transformer is insufficient. If the transformer capacity is insufficient, then the transformer is driven to one side while performing bus separation, the lines are separated at the lower end, and the grid is configured independently and does not violate the power grid reliability standards.  Figure 6 shows the transformer movement action used when the capacity of the transformer is insufficient. If the transformer capacity is insufficient, then the transformer is driven to one side while performing bus separation, the lines are separated at the lower end, and the grid is configured independently and does not violate the power grid reliability standards. Figure 6. Transformer move action when the transformer capacity is insufficient. Figure 6. Transformer move action when the transformer capacity is insufficient.

Bus/Line Merging Action
Bus or line merging action is simpler than separation action. It is only necessary to reconnect the separated bus or the open line. If this action is performed, the overload is generally reduced, but conversely, the short circuit current increases. Figure 7 shows the overall concept of separation, including the separation action for the transformers and lines. By performing separation as below, grid stability in compliance with reliability standards can be obtained along with a substantial reduction in short circuit current. The buses and lines that can be separated from the grid are thus selected based on this action, through which the action is selected. Bus or line merging action is simpler than separation action. It is only necessary to reconnect the separated bus or the open line. If this action is performed, the overload is generally reduced, but conversely, the short circuit current increases. Figure 7 shows the overall concept of separation, including the separation action for the transformers and lines. By performing separation as below, grid stability in compliance with reliability standards can be obtained along with a substantial reduction in short circuit current. The buses and lines that can be separated from the grid are thus selected based on this action, through which the action is selected.

Agents and State
In terms of reducing short circuit current, the agents can be classified into three categories: Short circuit current violation (SCCV), overflow violation (OFV), and radial grid violation (RGV). Each item is considered a violation and the number is considered. The closer to zero each item is, the more optimal is the environment. Therefore, the state after taking any action can be classified into three categories according to whether the number of violations increases, decreases, or remains the same. Figure 8 shows the degree of reward according to state change for each agent. In each agent, it is best to go in the direction of decreasing number of violations. Therefore, when an action is performed in the direction that decreases each number, the reward is provided as a positive value. If the number does not change after taking the action, the reward is given as zero. In addition, if one action instead increases the number of violations, the reward is provided as a negative number. By varying the degree of reward, the policy is configured so that the RL can take more effective actions.

Agents and State
In terms of reducing short circuit current, the agents can be classified into three categories: Short circuit current violation (SCCV), overflow violation (OFV), and radial grid violation (RGV). Each item is considered a violation and the number is considered. The closer to zero each item is, the more optimal is the environment. Therefore, the state after taking any action can be classified into three categories according to whether the number of violations increases, decreases, or remains the same. Figure 8 shows the degree of reward according to state change for each agent. In each agent, it is best to go in the direction of decreasing number of violations. Therefore, when an action is performed Sustainability 2020, 12, 9333 9 of 18 in the direction that decreases each number, the reward is provided as a positive value. If the number does not change after taking the action, the reward is given as zero. In addition, if one action instead increases the number of violations, the reward is provided as a negative number. By varying the degree of reward, the policy is configured so that the RL can take more effective actions.
In terms of reducing short circuit current, the agents can be classified into three categories: Short circuit current violation (SCCV), overflow violation (OFV), and radial grid violation (RGV). Each item is considered a violation and the number is considered. The closer to zero each item is, the more optimal is the environment. Therefore, the state after taking any action can be classified into three categories according to whether the number of violations increases, decreases, or remains the same. Figure 8 shows the degree of reward according to state change for each agent. In each agent, it is best to go in the direction of decreasing number of violations. Therefore, when an action is performed in the direction that decreases each number, the reward is provided as a positive value. If the number does not change after taking the action, the reward is given as zero. In addition, if one action instead increases the number of violations, the reward is provided as a negative number. By varying the degree of reward, the policy is configured so that the RL can take more effective actions. The degree of reward for each agent can be the same, but it can also be varied so that more reinforced results can be obtained for the desired environment. As shown in Figure 9, it is possible to The degree of reward for each agent can be the same, but it can also be varied so that more reinforced results can be obtained for the desired environment. As shown in Figure 9, it is possible to lead the direction of the result to the other side by changing the reward according to each agent without setting the degree of reward to a fixed value, or to reinforce the result in the direction the agent wants.
Sustainability 2020, 12, x FOR PEER REVIEW 9 of 18 lead the direction of the result to the other side by changing the reward according to each agent without setting the degree of reward to a fixed value, or to reinforce the result in the direction the agent wants. As described above, if the degree of reward is classified into nine types, and the value is different for each agent, the compensation degree for each action, state, and environment can be varied and reinforced in a desired direction to obtain a result.

Deep Q-Network Algorithm
Before explaining the deep Q-network, Q-learning will be described. Each agent repeats the following steps to maximise the Q-value.

•
Choose an action a to perform in the current state, s.  (s, a),a)} Through the above process, an action that maximises the Q-value is selected, which is called greedy action selection. However, if the action that maximises the Q-value is only taken here, the opportunity to learn about various environmental changes would be lost. This would result in optimisation failure, called local optimum. Therefore, to obtain the global optimum, a certain nongreedy action needs to be selected. In other words, choosing an action that gives a rather small reward rather than an action that gives the maximum reward has to be learned. This is called explore, and the method of performing explore with a certain probability (ε) is called the ε-greedy method. Therefore, using the ε-greedy method, non-greedy actions can be explored with a certain probability (ε) to learn about various situations. At this time, it is very important to maintain a balance between greedy action and non-greedy action to obtain optimal results.
The deep Q-network can be thought of as a form that combines artificial intelligence learning by combining a neural network with Q-learning. To formulate this in order to create an algorithm, the Q-value function is first approximated to the Q-network: As described above, if the degree of reward is classified into nine types, and the value is different for each agent, the compensation degree for each action, state, and environment can be varied and reinforced in a desired direction to obtain a result.

Deep Q-Network Algorithm
Before explaining the deep Q-network, Q-learning will be described. Each agent repeats the following steps to maximise the Q-value.

•
Choose an action a to perform in the current state, s. Through the above process, an action that maximises the Q-value is selected, which is called greedy action selection. However, if the action that maximises the Q-value is only taken here, the opportunity to learn about various environmental changes would be lost. This would result in optimisation failure, called local optimum. Therefore, to obtain the global optimum, a certain non-greedy action needs to be selected. In other words, choosing an action that gives a rather small reward rather than an action that gives the maximum reward has to be learned. This is called explore, and the method of performing explore with a certain probability (ε) is called the ε-greedy method. Therefore, using the ε-greedy method, non-greedy actions can be explored with a certain probability (ε) to learn about various situations. At this time, it is very important to maintain a balance between greedy action and non-greedy action to obtain optimal results.
The deep Q-network can be thought of as a form that combines artificial intelligence learning by combining a neural network with Q-learning. To formulate this in order to create an algorithm, the Q-value function is first approximated to the Q-network: If the objective function is developed in the direction of reducing the difference between the current Q-value and the target Q-value in a mean-square error (MSE) method, the equation is as follows: where U(D) is replay memory, θ is the parameter of the neural network, and θ − is the old parameter.
Using the above equation, the gradient descent method is applied to find the optimum value. Replay memory D appearing here is called experience replay, which stores a dataset of the agent's experience excluding other relationships. If the reward r t is received after selecting action a t in state s t and the new state being observed is s t+1 , the transition (s t , a t , r t , s t+1 ) is stored in replay memory D. The transitions stored in the memory are used for the optimised MSE, and in some cases, the minibatch is partially executed to improve the speed and the optimisation result. The process for this is expressed in Algorithm 1.

Algorithm 1 Deep Q-Network algorithm
Initialise replay memory D to capacity N Initialise action-value function Q with θ Initialise target action-value function Q with θ − = θ For episode = 1 to num episodes do For t = 1 to T do With probability ε select a random action a t , otherwise select a t = max a Q(s, a; θ) Execute action a t in emulator and observe reward r t and state s t Store transition (s t , a t , r t , s t+1 ) in D Sample random minibatch of transitions (s j , a j , r j , s j+1 ) from D Perform a gradient descent step on L j (θ) with respect to the network parameters θ

End For End For
As there are three types of agents in this study, the state and reward according to the action are also divided into three types. Every action causes a change in the environment and a different effect on each agent. Therefore, one action is executed, and each pair of agent state and reward is given accordingly. This concept is shown in Figure 10.
Accordingly, the objective function is also transformed, and the form that adds the optimisation function for each agent becomes the final objective function. This is shown in the following equation: Sustainability 2020, 12, 9333 11 of 18

End For
As there are three types of agents in this study, the state and reward according to the action are also divided into three types. Every action causes a change in the environment and a different effect on each agent. Therefore, one action is executed, and each pair of agent state and reward is given accordingly. This concept is shown in Figure 10. Accordingly, the objective function is also transformed, and the form that adds the optimisation function for each agent becomes the final objective function. This is shown in the following equation:

Applied System and Simulation Conditions
The 8th power grid supply and demand plan was used for grid analysis to minimise the sites exceeding the fault capacity, and the power grid data from 2018 to 2025 were used to verify the effectiveness of the proposed method. In the grid data, minimal bus separation and line separation should be performed in advance for 345 kV buses to ensure that there is no excessive short circuit current for these buses. For the 154 kV buses, both bus separation and line separation were removed, and an integrated grid was created. Additionally, for short circuit current analysis, all generators in the power grid were turned on, and sub-transient impedance was used for generator impedance to generate the maximum short circuit current.

Applied System and Simulation Conditions
The 8th power grid supply and demand plan was used for grid analysis to minimise the sites exceeding the fault capacity, and the power grid data from 2018 to 2025 were used to verify the effectiveness of the proposed method. In the grid data, minimal bus separation and line separation should be performed in advance for 345 kV buses to ensure that there is no excessive short circuit current for these buses. For the 154 kV buses, both bus separation and line separation were removed, and an integrated grid was created. Additionally, for short circuit current analysis, all generators in the power grid were turned on, and sub-transient impedance was used for generator impedance to generate the maximum short circuit current.
To yield meaningful optimisation results from various simulations that can be applied to actual power grids, the range for each variable must be set. First, the number of buses with short circuit current that exceeds the breaker capacity must be zero to comply with the reliability standards. Second, the number of overloaded lines and transformers must also be zero to comply with the reliability standards. Third, radial systems should not be created after line separation or bus separation. As these three contents are implemented with an agent, it moves in the optimal direction as RL is executed.
The number of separable buses and lines used for each year are shown in Table 2. The number of merged buses and lines, and the number of transformers to be moved can be considered to be included in the separation plan, so values are not set separately.  2018  1210  2780  2019  1210  2782  2020  1213  2784  2021  1216  2788  2022  1217  2792  2023  1220  2801  2024  1222  2805  2025 1225 2813 The results obtained in this study were compared with the results manually generated when making the 8th power grid supply and demand plan before verifying the effect [2].

Simulation Results
Simulations were performed for the eight years from 2018 to 2025 and the purpose of RL was to maximise reward which is to make the number of violations (SCCV OFV and RGV) of each agent zero.
In practice in a situation where all violations are zero the number of bus separations and the number of line separations should be minimised. Therefore, even if the reward continues to increase the best result in system operation is to adopt the bus separation actions and the line separation actions at the moment when the violation becomes zero rather than selecting the value thereafter.
As shown in Figures 11-18 the numbers of violations could be reduced to zero in all graphs with the RL. Table 3 shows the final number of bus separation actions and line separation actions obtained for each year. To compare the results of the RL with the optimal results the number of bus and line separations applied to the 8th basic plan were imported and compared [2]. In the 8th basic plan when separating the bus and lines they were all manually heuristically executed.                 As a result of performing the RL even though the number of violations was made all zeros the number of separate buses and lines could be reduced compared to the existing results. This can be said to be the result of obtaining the highest stability without violating the reliability.
There are numerous methods for reducing short circuit current and reducing the number of buses exceeding the breaker capacity. To solve this problem many simulations must be performed manually. However, as shown in this study, defining equations in RL form that can be applied to actual grids' optimal results can be obtained via the deep Q-network. RL can be used to create effective short circuit current reduction methods systematically for the 9th power grid supply and demand plan in the future and this plan can also be continuously utilised.