Deep Reinforcement Learning-Based Approach for Autonomous Power Flow Control Using Only Topology Changes

: With the increasing complexity of power system structures and the increasing penetration of renewable energy, driven primarily by the need for decarbonization, power system operation and control become challenging. Changes are resulting in an enormous increase in system complexity, wherein the number of active control points in the grid is too high to be managed manually and provide an opportunity for the application of artiﬁcial intelligence technology in the power system. For power ﬂow control, many studies have focused on using generation redispatching, load shedding, or demand side management ﬂexibilities. This paper presents a novel reinforcement learning (RL)-based approach for the secure operation of power system via autonomous topology changes considering various constraints. The proposed agent learns from scratch to master power ﬂow control purely from data. It can make autonomous topology changes according to current system conditions to support grid operators in making effective preventive control actions. The state-of-the-art RL algorithm—namely, dueling double deep Q-network with prioritized replay—is adopted to train effective agent for achieving the desired performance. The IEEE 14-bus system is selected to demonstrate the effectiveness and promising performance of the proposed agent controlling power network for up to a month with only nine actions affecting substation conﬁguration.


Introduction
Electricity is the driving factor of the modern world, and concerns for continued supply and efficient use are becoming more important. Power systems are among the most complex systems ever designed by humans [1], and they require planning and operating in a manner for economic and reliable operation. In recent years, the power system has evolved and faced changes. As the demand for electrical energy continues to increase, the power system is expands and becomes more complex. Low carbon policy and market deregulation have led to significant integration of renewable energy resources with very ambitious targets for the incorporation of more renewable generation in the future. Transmission lines and transformers are often congested and this could lead to network splitting and blackout. Named changes pose challenges to the traditional way of power system control and provide an opportunity for the application of artificial intelligence technology in the power system.
In line with the increasing concerns regarding power systems, quite a number of review works in the literature have presented a detailed description of the problems and notable solutions, including reinforcement learning approaches [2][3][4][5]. They survey the application of RL in various fields-such as energy management, demand response, electricity market, operational control, cyber-security, and so on. The most recent work in reviewing reinforcement learning for selective key applications in power systems is presented in [6]. The authors comprehensively analyze the research status of reinforcement learning in the power system field. Studies [2][3][4][5][6] arrive at the conclusion that RL is not s t from set S, and the agent selects the action a t from set A. To the selected action, the environment responds to the agent by a reward r t = r(s t , a t ) and produces the succeeding new state s t+1 . Discounted return is defined by G t = ∑ ∞ k=0 γ k R t+k+1 and the agent seeks to maximize its value. A factor γ is a constant that can take values between 0 and 1, determining the relative value of delayed versus immediate reward.
In terms of future rewards that can be expected, value functions are involved to estimate how good it is for an agent to be in a given state. For an agent behaving according to a stochastic policy π, the value function v π (s) and of the action value function q π (s, a) are defined as follows: v π (s) = E π [G t | S t = s] = E π R t+1 + γR t+2 + γ 2 R t+2 + · · · S t = s (1) The optimal action-value function is defined q * (s, a) = max π q π (s, a), for all s ∈ S , a ∈ A(s). It obeys an important identity known as the Bellman equation: q * (s, a) = E[R t+1 + γ max a q * s , a ], for all s ∈ S , a ∈ A(s)

Deep Q Networks (DQN)
The basic idea behind many reinforcement learning algorithms is to estimate the action-value function (Q-function) by using the Bellman equation as an iterative update. Q-learning [22,23] is an off-policy temporal difference control algorithm defined by: where α represents the learning rate. Such value iteration algorithms converge to the optimal action-value function independent of the policy being followed [22]. In practice, it is common to use a function approximator to estimate the action-value function, such as a neural network,q(s, a, w) ≈ q π (s, a). Weights w of the neural network function approximator represent the mapping from states to Q-values. A Q-network can be trained by adjusting weight updates by the following: w t+1 = w t + α R + γ max aq (S t+1 , a, w t ) −q(S t , A t , w t ) ∇q(S t , A t , w t ) (5) where w t is the vector of the network's weights, A t is the action selected at time step t, and S t and S t+1 are respectively the preprocessed observation input to the network at timesteps t and t + 1. The gradient in (5) was computed by backpropagation. DQN is selected as a fundamental DRL algorithm to train an agent for power system control. However, several limitations of this algorithm are known. Comprehensive research of DQN extensions is in [24]. To overcome overestimation of an algorithm, double DQN is proposed [25]. To generalize learning across actions and to improve policy evaluation in the presence of many similar-valued actions without imposing any change to the underlying reinforcement learning algorithm, Dueling DQN is presented [26]. Prioritized experience replay lets RL agents remember and reuse experiences from the past in order to replay important transitions more frequently, and therefore learn more efficiently [27]. Thus, double dueling DQN with prioritized replay is selected as the baseline model in this work.

Test System Description
The modified IEEE 14-bus test system shown in Figure 1 was chosen to assess the performance of the proposed RL agents for safe power network management in significantly disturbed operating conditions.

Test System Description
The modified IEEE 14-bus test system shown in Figure 1 was chosen to ass performance of the proposed RL agents for safe power network management in s cantly disturbed operating conditions. Its model includes 14 buses (blue circles), 20 branches, 11 loads (orange circle 6 generators (green circles). Generation includes hydro, nuclear, thermal, wind, an solar power plants to represent the current energy mix. The power grid model is av under the name "l2rpn_case14_sandbox" in Python open-source module Grid2O The module also comes with a dataset representing a realistic time series of ope conditions. The dataset for the IEEE 14-bus test system contains 1004 monthly sce wherein each represents 28 continuous days in 5-min time intervals. Each scena cludes pre-defined load variations and generation schedules, shown in Figure 2, are representative of the French grid [21]. The distribution of injections was restri be representative of the winter months, over which peak loads are observed [21].  Its model includes 14 buses (blue circles), 20 branches, 11 loads (orange circles), and 6 generators (green circles). Generation includes hydro, nuclear, thermal, wind, and two solar power plants to represent the current energy mix. The power grid model is available under the name "l2rpn_case14_sandbox" in Python open-source module Grid2Op [28]. The module also comes with a dataset representing a realistic time series of operating conditions. The dataset for the IEEE 14-bus test system contains 1004 monthly scenarios wherein each represents 28 continuous days in 5-min time intervals. Each scenario includes pre-defined load variations and generation schedules, shown in Figure 2, which are representative of the French grid [21]. The distribution of injections was restricted to be representative of the winter months, over which peak loads are observed [21].

Test System Description
The modified IEEE 14-bus test system shown in Figure 1 was chosen to assess the performance of the proposed RL agents for safe power network management in significantly disturbed operating conditions. Its model includes 14 buses (blue circles), 20 branches, 11 loads (orange circles), and 6 generators (green circles). Generation includes hydro, nuclear, thermal, wind, and two solar power plants to represent the current energy mix. The power grid model is available under the name "l2rpn_case14_sandbox" in Python open-source module Grid2Op [28]. The module also comes with a dataset representing a realistic time series of operating conditions. The dataset for the IEEE 14-bus test system contains 1004 monthly scenarios wherein each represents 28 continuous days in 5-min time intervals. Each scenario includes pre-defined load variations and generation schedules, shown in Figure 2, which are representative of the French grid [21]. The distribution of injections was restricted to be representative of the winter months, over which peak loads are observed [21].

Objective and Conditions of the Power System Control
The main objective is to create an agent that can operate the power grid successfully for as many scenarios as possible using only topology adjustment actions. As in the real-world, the agent must respect several operational constraints to make sure the power grid operates properly. A power system blackout will occur if hard constraints are violated:
No generator or load may be disconnected; 3.
No electrical islands are formed due to topology changes; 4.
AC power flow must converge.
When the power in a line increase above its thermal limit, the line becomes overloaded. It can stay overloaded for 10 min (two timesteps) before it becomes disconnected. If the overload is too high (above 200% of the thermal limit), the line becomes disconnected immediately. This can lead to a cascading failure if other lines become overloaded due to power flow redistribution. The thermal limits of the lines are shown in Table 1. Overloaded lines can be recovered after 50 min (10 timesteps). The agent's task is to manage congestion in the power system and prevention of cascading failures and blackouts. For this purpose, we analyze the usage of only grid topology reconfiguration actions.
Available topological actions in the simulator are: • Reconnecting/disconnecting a line; • Changing the substation configuration.
Only one substation can be modified per timestep, and every substation can be sectioned into two sections. A 'cooldown time' is 15 min which needs to be respected before a switched line or node can be reused for action.

Analysis of Scenarios
The dataset for the test system contains 1004 monthly scenarios. Load flow is computed for each scenario with the grid topology being fixed to base topology and with power system control constraints as mentioned in Section 3.1. Results from Figure 3 show, for intervals of timesteps, the number of scenarios with corresponding timesteps before blackout. These results will be a baseline for the evaluation of RL agents since they demonstrate how long the power network can be maintained without any interventions/actions.
The mean number of timesteps in this analysis is 1089 or less than 4 days. A minimal number of timesteps is 3 (15 min) and this occurred in nine scenarios. A maximal number of timesteps is 8064, which corresponds to a successfully managed grid through the entire month and this is recorded in three scenarios.  To analyze all available data within each scenario-i.e., to go through all of the timesteps in the available dataset-the rules of the power system control are set using 'soft' parameters. The parameters set in this way ensure that lines are not disconnected due to overflow. An overloaded branch is with a loading ≥95% of the thermal limit. The highest number of overloads is recorded on branches 9 and 17, while branches 4 and 7 are less frequently overloaded. The average number of overloads on branch 9 per month is 421 timesteps, slightly more than 35 h (about one and a half days). The average number of overloads of branch 17 is 394 timesteps or slightly less than 33 h. Maximal amounts of overload and percentage of timesteps that are overloaded in each scenario are shown in Figure 4.  To analyze all available data within each scenario-i.e., to go through all of the timesteps in the available dataset-the rules of the power system control are set using 'soft' parameters. The parameters set in this way ensure that lines are not disconnected due to overflow. An overloaded branch is with a loading ≥95% of the thermal limit. The highest number of overloads is recorded on branches 9 and 17, while branches 4 and 7 are less frequently overloaded. The average number of overloads on branch 9 per month is 421 timesteps, slightly more than 35 h (about one and a half days). The average number of overloads of branch 17 is 394 timesteps or slightly less than 33 h. Maximal amounts of overload and percentage of timesteps that are overloaded in each scenario are shown in Figure 4.  To analyze all available data within each scenario-i.e., to go through all of the timesteps in the available dataset-the rules of the power system control are set using 'soft' parameters. The parameters set in this way ensure that lines are not disconnected due to overflow. An overloaded branch is with a loading ≥95% of the thermal limit. The highest number of overloads is recorded on branches 9 and 17, while branches 4 and 7 are less frequently overloaded. The average number of overloads on branch 9 per month is 421 timesteps, slightly more than 35 h (about one and a half days). The average number of overloads of branch 17 is 394 timesteps or slightly less than 33 h. Maximal amounts of overload and percentage of timesteps that are overloaded in each scenario are shown in Figure 4.

Reinforcement Learning Model
This study intends to apply a state-of-the-art RL approach in an easily applicable way which can serve as a baseline for further algorithm usage for more complex tasks. This paper considers RL algorithms from the standard RL framework RLlib [29], which makes results easily reproducible and verifiable. We consider thoroughly selected actions (based on expert knowledge), reduced observation space, and simple reward.

Reinforcement Learning Model
This study intends to apply a state-of-the-art RL approach in an easily applicable way which can serve as a baseline for further algorithm usage for more complex tasks. This paper considers RL algorithms from the standard RL framework RLlib [29], which makes results easily reproducible and verifiable. We consider thoroughly selected actions (based on expert knowledge), reduced observation space, and simple reward.

Description of Frameworks and Tools Used for This Research
In this section, all tools and frameworks used for this research will be mentioned. Grid2Op (grid to operate) is an open-source python framework used primarily as a testbed platform for sequential decision making in the world of power system [28]. The simulator can simulate the operation of a power network of any size and characteristic in discrete timesteps. It can simulate cascading outages, where overloaded branches are switched off and the calculation is still carried out considering the following timestep operating conditions. Grid2Op has datasets for several test networks of different sizes and complexities. Grid2Op comes with a machine learning environment that has all the necessary elements for reinforcement learning and which is compatible with the OpenAI Gym framework [30].
OpenAI Gym is an open-source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between environments and algorithms [30]. It provides an easy API to implement custom environments. The purpose of converting the Grid2op environment into a Gym environment is to make it possible to use it with standard RL frameworks.
Ray is an open-source framework that provides a simple API for building distributed applications and it is a widely used platform for reinforcement learning [31]. Ray can greatly speed up training and make it easier to begin with deep reinforcement learning. Ray is packaged with libraries for accelerating machine learning workloads. A wide range of state-of-the-art algorithms are available through RLlib and can be easily accessed and modified. Ray provides foundations for parallelism and scalability which are simple to use and allow Python programs to scale anywhere from a personal computer to a large cluster.
RL applications can be quite computationally and memory intensive and often need to scale out onto a cluster. A cluster allows using significant computational resources in demanding data processing. Due to that, the learning process for this research was carried out on the high-performance computing (HPC) cluster "Isabella".

Observation Space, Action Space, and Reward
In agent-environment interaction, as the main parts of reinforcement learning, one should take care of observation space, action space, and reward signal. The goal of investments in the analysis of mentioned parts of RL is to make the agent work. Many analyses were taken before the results become noticeable and worth mentioning. As the result of the analysis in this paper, proper action selection was crucial in reaching the main goal which is secure power system control.
In this study, only topology actions are considered. The goal is to operate the power network as long as possible, and for that reason it is necessary to keep power flow in lines at a desirable level, so they do not become overloaded. Due to this requirement, actions regarding changing the status of power lines are not considered because it is preferable to keep them all connected, as they initially are. Actions regarding busbar splitting are carefully considered. For the RL agent, we focus on the final substation configurations. The number of actions depends on the number of elements connected to the substation and can be calculated as 2 n−1 where n is the total number of elements connected to the substation. The number of illegal actions-such as disconnecting of load or generator-can be calculated as 2 B − 1, where B is the number of generators and loads on the substation. Analysis of busbar configurations is in Table 2. If illegal actions are ejected, the overall number of actions for busbar splitting for the observed test network is 179 (with added action for do nothing).  The presented number of actions is too big for this power network. Driven by expert knowledge, the power network does not have such a big number of topologies that would satisfy power network control requirements. Further action selection is based on the basic principle of n − 1 security in network planning states that say if a component should fail or be shut down in a network operating at the maximum forecast levels of transmission and supply, the network security must still be guaranteed. This leads to incorporating the following constraints:

1.
A minimum number of elements that must be connected on each substation is 2; 2.
At least 2 of the elements connected to a substation must be a line.
With the two above criteria, the number of potential substations for splitting is halved. Since only one of these topology actions can be chosen at each timestep, the final selection of actions would consider only actions that lead to desirable power network topologies. The depth of changes-i.e., number of substations that are split at the same time-is selected to be ≤3 for this power network. Thus, only three substations are considered for the final selection of potential topologies. Besides selected actions presented, in the action space actions are added that set the initial topology of substations (every bus element is connected to the same bus section) and action for 'do nothing' which results in a final action space with only nine actions shown in Table 3.
An observation is a description of the power grid perceived by an agent. The observation space with all available observations at the current timestep has 368 features as shown in Table 4. We applied reduction on the observation space, so it contains voltages and currents at both sides of branches, the capacity of each powerline, and the topology vector that for each object (load, generator, ends of a powerline) gives on which bus this object is connected in its substation. Reduced observation space contains 157 features and selected attributes are bolded in Table 4.  Action 6 Initial topology substation 3 Action 7 Initial topology substation 4 Action 8 Initial topology substation 1

Action 9 Do nothing
An observation is a description of the power grid perceived by an agent. The observation space with all available observations at the current timestep has 368 features as shown in Table 4. We applied reduction on the observation space, so it contains voltages and currents at both sides of branches, the capacity of each powerline, and the topology vector that for each object (load, generator, ends of a powerline) gives on which bus this object is connected in its substation. Reduced observation space contains 157 features and selected attributes are bolded in Table 4.
The input layer of the neural network has the size of observation space 157. The size of the output layer corresponds to the size of the action space (9). There are also two hidden layers, each with 128 neurons.
Finally, we used a simple reward function that just counts the number of timesteps the agent has successfully managed to perform. It adds a constant reward for each timestep successfully handled.  Action 6 Initial topology substation 3 Action 7 Initial topology substation 4 Action 8 Initial topology substation 1

Action 9 Do nothing
An observation is a description of the power grid perceived by an agent. The observation space with all available observations at the current timestep has 368 features as shown in Table 4. We applied reduction on the observation space, so it contains voltages and currents at both sides of branches, the capacity of each powerline, and the topology vector that for each object (load, generator, ends of a powerline) gives on which bus this object is connected in its substation. Reduced observation space contains 157 features and selected attributes are bolded in Table 4.
The input layer of the neural network has the size of observation space 157. The size of the output layer corresponds to the size of the action space (9). There are also two hidden layers, each with 128 neurons.
Finally, we used a simple reward function that just counts the number of timesteps the agent has successfully managed to perform. It adds a constant reward for each timestep successfully handled.  Action 6 Initial topology substation 3 Action 7 Initial topology substation 4 Action 8 Initial topology substation 1

Action 9 Do nothing
An observation is a description of the power grid perceived by an agent. The observation space with all available observations at the current timestep has 368 features as shown in Table 4. We applied reduction on the observation space, so it contains voltages and currents at both sides of branches, the capacity of each powerline, and the topology vector that for each object (load, generator, ends of a powerline) gives on which bus this object is connected in its substation. Reduced observation space contains 157 features and selected attributes are bolded in Table 4.
The input layer of the neural network has the size of observation space 157. The size of the output layer corresponds to the size of the action space (9). There are also two hidden layers, each with 128 neurons.
Finally, we used a simple reward function that just counts the number of timesteps the agent has successfully managed to perform. It adds a constant reward for each timestep successfully handled.  Action 6 Initial topology substation 3 Action 7 Initial topology substation 4 Action 8 Initial topology substation 1

Action 9 Do nothing
An observation is a description of the power grid perceived by an agent. The observation space with all available observations at the current timestep has 368 features as shown in Table 4. We applied reduction on the observation space, so it contains voltages and currents at both sides of branches, the capacity of each powerline, and the topology vector that for each object (load, generator, ends of a powerline) gives on which bus this object is connected in its substation. Reduced observation space contains 157 features and selected attributes are bolded in Table 4.
The input layer of the neural network has the size of observation space 157. The size of the output layer corresponds to the size of the action space (9). There are also two hidden layers, each with 128 neurons.
Finally, we used a simple reward function that just counts the number of timesteps the agent has successfully managed to perform. It adds a constant reward for each timestep successfully handled.  Action 6 Initial topology substation 3 Action 7 Initial topology substation 4 Action 8 Initial topology substation 1

Action 9 Do nothing
An observation is a description of the power grid perceived by an agent. The observation space with all available observations at the current timestep has 368 features as shown in Table 4. We applied reduction on the observation space, so it contains voltages and currents at both sides of branches, the capacity of each powerline, and the topology vector that for each object (load, generator, ends of a powerline) gives on which bus this object is connected in its substation. Reduced observation space contains 157 features and selected attributes are bolded in Table 4.
The input layer of the neural network has the size of observation space 157. The size of the output layer corresponds to the size of the action space (9). There are also two hidden layers, each with 128 neurons.
Finally, we used a simple reward function that just counts the number of timesteps the agent has successfully managed to perform. It adds a constant reward for each timestep successfully handled. The input layer of the neural network has the size of observation space 157. The size of the output layer corresponds to the size of the action space (9). There are also two hidden layers, each with 128 neurons.
Finally, we used a simple reward function that just counts the number of timesteps the agent has successfully managed to perform. It adds a constant reward for each timestep successfully handled.

Algorithm Selection and Hyperparameters Used
Due to the nature of the selected observation space and action space type which are both discrete, as well as algorithm simplicity, DQN and its derivatives are selected for analysis. The deep Q Network algorithm and its derivatives observed for this research are from the standard RL library RLlib. They implement all of the DQN improvements evaluated in [24]. DQN framework is shown in Figure 5. Among DQN derivatives, the best performing algorithm is dueling double DQN with prioritized replay and it is used in this research.

Algorithm Selection and Hyperparameters Used
Due to the nature of the selected observation space and action space type which are both discrete, as well as algorithm simplicity, DQN and its derivatives are selected for analysis. The deep Q Network algorithm and its derivatives observed for this research are from the standard RL library RLlib. They implement all of the DQN improvements evaluated in [24]. DQN framework is shown in Figure 5. Among DQN derivatives, the best performing algorithm is dueling double DQN with prioritized replay and it is used in this research. The hyperparameters used in this research for training RL agent are default parameters set in Ray RLlib for the DQN algorithm, except for the hyperparameters which are tuned as listed in Table 5.  The hyperparameters used in this research for training RL agent are default parameters set in Ray RLlib for the DQN algorithm, except for the hyperparameters which are tuned as listed in Table 5.

Results
In this section, the performance of the proposed approach is analyzed. First, the training phase and results for several agents are presented. Second, evaluation of agent testing is presented on 100 unseen scenarios. The testing process is described to demonstrate that using the proposed RL agent to control the power network leads to greatly improved performance. A comparison of the results is performed with similar agents from the literature to illustrate the benefits of the proposed approach. Third, the application effects of the proposed agent are shown. Forth, the limitations of the presented algorithm are presented.

Training Phase
An episode is a sequence of observations, actions, and rewards from an initial state to a terminal state, either succeeding or failing to manage the grid for the duration. For the training phase, we used a set of 800 scenarios available from the dataset for the adopted IEEE 14-bus network used in research from [28]. Mean episode length is tracked during training since the main purpose of the agent is to manage the power network as long as it is possible. During training, early stopping is used to stop training when the agent's performance is satisfactory.
Due to the nature of the selected observation and action space types (both discrete) and algorithm simplicity, DQN and its derivatives are selected for analysis. DQN algorithms considered in this paper are available in the RLlib package named DQN, double DQN, dueling DQN, dueling double DQN, and dueling double DQN with prioritized replay. The training was conducted with 1000 episodes and the results are in Figure 6. According to the results, DDDQN with prioritized replay has the best performance, so further analyses were conducted with that agent.
is presented on 100 unseen scenarios. The testing process is described to demon using the proposed RL agent to control the power network leads to greatly impr formance. A comparison of the results is performed with similar agents from the to illustrate the benefits of the proposed approach. Third, the application effe proposed agent are shown. Forth, the limitations of the presented algorithm sented.

Training Phase
An episode is a sequence of observations, actions, and rewards from an in to a terminal state, either succeeding or failing to manage the grid for the dur the training phase, we used a set of 800 scenarios available from the datas adopted IEEE 14-bus network used in research from [28]. Mean episode length during training since the main purpose of the agent is to manage the power n long as it is possible. During training, early stopping is used to stop training agent's performance is satisfactory.
Due to the nature of the selected observation and action space types (both and algorithm simplicity, DQN and its derivatives are selected for analysis. D rithms considered in this paper are available in the RLlib package named DQ DQN, dueling DQN, dueling double DQN, and dueling double DQN with prio play. The training was conducted with 1000 episodes and the results are in Fig  cording to the results, DDDQN with prioritized replay has the best performan ther analyses were conducted with that agent.  Figure 7 shows the training progress of the agent for several observation sp yellow curve represents the final agent whose performance will be evaluated scenarios. Compared to the other agents, the selected agent can control the powe for a longer period (timesteps) in the earlier phases of the training process. Th observed metric is monotonically increasing, indicating that the agent is lear made our agent act only in hazardous situations when the power flow of a giv larger than the threshold (in this case 95% of the thermal limit) and the topolo power grid is not optimal. In this case, training took 1096 episodes.  Figure 7 shows the training progress of the agent for several observation spaces. The yellow curve represents the final agent whose performance will be evaluated in unseen scenarios. Compared to the other agents, the selected agent can control the power network for a longer period (timesteps) in the earlier phases of the training process. The agents' observed metric is monotonically increasing, indicating that the agent is learning. We made our agent act only in hazardous situations when the power flow of a given line is larger than the threshold (in this case 95% of the thermal limit) and the topology of the power grid is not optimal. In this case, training took 1096 episodes.

Testing Phase
To evaluate the agent, we used 100 scenarios that were not used in the training phase to see how the agent would perform in unseen scenarios. Evaluation results of the proposed RL agent are compared to the do-nothing agent. The result of the test is in Figure 8. Results for the do-nothing agent illustrate that the plugged-in renewable energy sources will lead to heavy overload in the power system if no control measure is taken. In that case, the system will blackout at an average number of timesteps around 1528 (about 5 days). Using the proposed RL agent to control the power network leads to greatly improved performance. For all scenarios, the proposed RL agent automatically operates the power grid for longer than a day (288 timesteps) without expert help. For only 11 scenarios, it manages the power network for less than a week; and for 62 scenarios, it operates the grid successfully through the entire month (8064, maximum number of timesteps). On average, it controls the power network for 6574 timesteps, or almost 23 days.

Testing Phase
To evaluate the agent, we used 100 scenarios that were not used in the training phase to see how the agent would perform in unseen scenarios. Evaluation results of the proposed RL agent are compared to the do-nothing agent. The result of the test is in Figure 8. Results for the do-nothing agent illustrate that the plugged-in renewable energy sources will lead to heavy overload in the power system if no control measure is taken. In that case, the system will blackout at an average number of timesteps around 1528 (about 5 days). Using the proposed RL agent to control the power network leads to greatly improved performance. For all scenarios, the proposed RL agent automatically operates the power grid for longer than a day (288 timesteps) without expert help. For only 11 scenarios, it manages the power network for less than a week; and for 62 scenarios, it operates the grid successfully through the entire month (8064, maximum number of timesteps). On average, it controls the power network for 6574 timesteps, or almost 23 days.

Testing Phase
To evaluate the agent, we used 100 scenarios that were not used in the training pha to see how the agent would perform in unseen scenarios. Evaluation results of the pr posed RL agent are compared to the do-nothing agent. The result of the test is in Figure  Results for the do-nothing agent illustrate that the plugged-in renewable energy sourc will lead to heavy overload in the power system if no control measure is taken. In th case, the system will blackout at an average number of timesteps around 1528 (about days). Using the proposed RL agent to control the power network leads to greatly im proved performance. For all scenarios, the proposed RL agent automatically operates t power grid for longer than a day (288 timesteps) without expert help. For only 11 scena ios, it manages the power network for less than a week; and for 62 scenarios, it operat the grid successfully through the entire month (8064, maximum number of timesteps). O average, it controls the power network for 6574 timesteps, or almost 23 days.  For a fair comparison, our agent is compared with agents from the literature that aim to control the IEEE 14-bus power network with only topological actions and/or use the same RL algorithm (dueling double DQN). The details for observed agents from the literature and the results of our agent are summarized in Table 6. Besides a much longer period of managing the power network, our approach is much simpler since it does not require additional learning techniques. Observation and action spaces are reduced. The proposed DRL agent managed the grid in a way that it determines optimal power network configuration and set it at beginning of the episode to prevent cascading failure and blackouts in the smart grid. The agent learned that the initial topology is not optimal, and that optimal topology can be set by a combination of actions 3 and 5. Setting optimal topology at the beginning of the episode made power system control much easier since the sequence of decision-making actions is much shorter in disturbing situations. This approach belongs to system-level control and considers only modifying substation configuration. The control is designed to work in normal operating states of the system and applies control actions to control flows over transmission lines.

Performance of the Proposed Method
To fully evaluate the performance of the proposed DRL agent, we selected four scenarios with a different share of renewables shown in Figure 9. Cases 2 and 3 are similar in energy profiles, but case 3 is interesting because a blackout occurs in the network in the third timestep if no actions are taken. Case 4 is interesting due to the larger share of energy from renewable resources, where wind and solar energy made up almost 20% of total production.

Performance of the Proposed Method
To fully evaluate the performance of the proposed DRL agent, we selected four scenarios with a different share of renewables shown in Figure 9. Cases 2 and 3 are similar in energy profiles, but case 3 is interesting because a blackout occurs in the network in the third timestep if no actions are taken. Case 4 is interesting due to the larger share of energy from renewable resources, where wind and solar energy made up almost 20% of total production.  To demonstrate the application effects of DRL, we investigate agent control performance for each case. Since the objective of the agent is power flow control to prevent cascading failures and blackouts, we focus on the maximum line capacity usage. We record the maximum line capacity usage for selected cases. A comparison is made between the do-nothing agent and our proposed RL agent. As shown in Figure 10, in all cases the agent controlled the network longer and made the maximum line capacity usage ratio lower than the do-nothing agent. Case 3 was very challenging, wherein overload occurred at the beginning of the scenario and the agent had to respond quickly to eliminate overload and prevent a blackout. This demonstrates the adaptability of our proposed control method. To demonstrate the application effects of DRL, we investigate agent control performance for each case. Since the objective of the agent is power flow control to prevent cascading failures and blackouts, we focus on the maximum line capacity usage. We record the maximum line capacity usage for selected cases. A comparison is made between the do-nothing agent and our proposed RL agent. As shown in Figure 10, in all cases the agent controlled the network longer and made the maximum line capacity usage ratio lower than the do-nothing agent. Case 3 was very challenging, wherein overload occurred at the beginning of the scenario and the agent had to respond quickly to eliminate overload and prevent a blackout. This demonstrates the adaptability of our proposed control method.

Limitations of the Work
The limitation of the proposed method is a limitation of the RL algorithm used in this research. DQN algorithms support only discrete actions. This means that future extensions with continuous actions, such as generator redispatch, should be carried out carefully. In this case, to alleviate this problem, continuous actions must be converted into discrete actions. The continuous features should be subjected to techniques such as binning and clustering that group continuous values into discrete bins, thereby discretizing continuous features into discrete ones.

Limitations of the Work
The limitation of the proposed method is a limitation of the RL algorithm used in this research. DQN algorithms support only discrete actions. This means that future extensions with continuous actions, such as generator redispatch, should be carried out carefully. In this case, to alleviate this problem, continuous actions must be converted into discrete actions. The continuous features should be subjected to techniques such as binning and clustering that group continuous values into discrete bins, thereby discretizing continuous features into discrete ones.

Conclusions
All tools and frameworks used in this research are open-source, and the algorithm is used as it is in the standard RL library without further improvements in the learning process. This is in contrast to previous studies where more complicated RL algorithms are employed and additional techniques such as supervised learning, imitation learning, or guided exploration are needed.
Besides highlighted simplicity and reproducibility, we empirically demonstrated that the presented method significantly outperforms similar agents available in the literature. This work shows the possibility of an intelligent agent that automatically operates the power grid without expert help with only a less costly method-substation reconfiguration. The possibility of an agent controlling a power network for up to an entire month with similar methods is not recorded in literature according to the author's knowledge.
Future work can include the application of RL agents on networks bigger and more constrained than the IEEE 14-bus test system with varied RE penetration. Incorporation of other control variables (such as line switching, transformer tapping control, and generator dispatch) in the RL agent formulation is also recommended. It is also recommended that RL algorithms other than the DQN be explored to improve the overall performance and computational speed of the training. Aside from secure power system management, power system losses should be considered as a further improvement of the proposed method.