Deep-Reinforcement-Learning-Based Two-Timescale Voltage Control for Distribution Systems

: Because of the high penetration of renewable energies and the installation of new control devices, modern distribution networks are faced with voltage regulation challenges. Recently, the rapid development of artiﬁcial intelligence technology has introduced new solutions for optimal control problems with high dimensions and dynamics. In this paper, a deep reinforcement learning method is proposed to solve the two-timescale optimal voltage control problem. All control variables are assigned to different agents, and discrete variables are solved by a deep Q network (DQN) agent while the continuous variables are solved by a deep deterministic policy gradient (DDPG) agent. All agents are trained simultaneously with specially designed reward aiming at minimizing long-term average voltage deviation. Case study is executed on a modiﬁed IEEE-123 bus system, and the results demonstrate that the proposed algorithm has similar or even better performance than the model-based optimal control scheme and has high computational efﬁciency and competitive potential for online application.


Background and Motivation
The high penetration of distributed generation (DG) energy sources, such as photovoltaic (PV), has made distribution networks faced with the problem of voltage regulation. Usually, the voltage profiles in distribution networks are regulated by the control of slow regulation devices (e.g., on-load tap changers (OLTCs) and shunt capacitors) and fast regulation devices (e.g., PV inverters and static var compensators (SVCs)). While these regulators are all applied to adjust the distribution of reactive power in the grid, the real power flow can also impact the nodal voltages in distribution networks [1,2]. Thus, the real and reactive power control of different devices should be taken into account in order to mitigate possible voltage violations.
The lack of measurement systems (e.g., supervisory control and data acquisition (SCADA) and phasor measurement units (PMUs)) in traditional distribution networks leads to the insufficient measurement of network information, and voltage control methods generally adopt model-based regulation, which rely highly on the precise physical model. In essence, voltage control through real and reactive power optimization is a highly nonlinear programming problem with abundant variables and massive constraints. Solving such problems using mathematical programming methods (e.g., the second-order cone relaxation technique [3] and duality theory [4]) is often limited by the number of variables, and may even fail when the scale of the distribution network is too large. Therefore, heuristic algorithms which are less dependent on the model are applied to solve these problems (e.g., particle swarm optimization (PSO) [5] and genetic algorithm (GA) [6]). However, these algorithms have the shortcomings of high randomness and long search time, and easily fall into local optimal solutions, and as a result cannot meet the requirement of real-time voltage control in a fast time scale. In addition, in these mathematical programming methods and heuristic algorithms, each optimization solution is independent of each other, and if the actual operating condition (e.g., DG outputs) changes slightly, the previous optimization results are often unable to be made full use of in order to achieve a rapid solution.
Recently, artificial intelligence (AI) technology has developed rapidly and been successfully applied in different fields. Therefore, many scholars are interested in exploring its application in power systems. Among AI technology, as a branch of reinforcement learning (RL) theory, DRL employs the "trial and error" mechanism to interact with the dynamic environment to find the optimal policy for the agent [7]. It has great advantages for solving complex multivariable problems, and has already been employed in various power system optimization problems, such as, electricity market planning [8], multi-agent equilibrium games [9], battery energy arbitrage [10], and scheduling the charging of electric vehicles [11].
The expansion of the coverage of SCADA and PMU [12], as well as the construction of internet of things technologies, has provided an effective way for the advanced applications of voltage control based on data-driven and model-free methods. For example, the authors in [13] employ Q-learning to solve a reactive power optimization problem with the discrete control variables being the transformer tap ratio and shunt capacitor. However, the Qlearning method easily falls into the curse of dimensionality since it applies a table to store the corresponding action-value function, and it is only suitable for the problems where the action space and state space are both discrete. Inspired by the strong exploration capacity of neural networks (NNs) towards high-dimensional searching space, deep Q network (DQN) employs an NN to approximate the action-value function to deal with continuous state domains. In [14], a DQN is used to control shunt capacitors. In [15], a double DQN is applied to achieve the optimal control of thermostatically controlled loads (TCLs) to provide voltage-control-based ancillary services. In [16], a multi-agent DQNbased algorithm is put forward to control switchable capacitors, voltage regulators, and smart inverters, where the continuous variables of voltage regulators and smart inverters are discretized. Further, to deal with the problems with continuous state and action space, deep deterministic policy gradient (DDPG) is put forward, where two NNs are used to approximate the policy function and action-value function. In [17], DDPG is applied to learn the control policy of generators in order to regulate all bus voltages into a predefined safe zone. In [18], a voltage-sensitivity-based DDPG method is proposed to realize the local control of PV inverters, and with a specifically designed reward, the goal of minimizing the network power loss and ensuring safe operation in the grid can be achieved. In [19], a novel adaptive volt-var control algorithm based on DDPG is proposed to realize the real-time control of smart inverters in reactive power management. However, the existing voltage control methods using DRL only focus on the reactive power control, and cannot deal with the discrete and continuous control variables simultaneously.

Novelty and Contribution
To overcome the above limitations, a DRL method combining a DQN method with DDPG is proposed in this paper to deal with discrete and continuous control variables simultaneously. In this method, the discrete variables are resolved based on a DQN agent and the continuous variables are resolved based on a DDPG agent. Considering the response time and control cost of different equipment, a two-timescale voltage control problem is put forward, where the capacitor configuration is decided in the long timescale and the outputs of PV inverters and energy storage batteries are adjusted in the short timescale. The problem is further turned into the Markov decision process (MDP) and solved by the proposed DRL method. A reward is specially designed to achieve the goal of minimizing the long-term average voltage deviation. Both the DQN agent and DDPG agent share the same environment (i.e., the distribution network), and are trained by their interaction with the environment.
The contributions of this article are outlined as follows.
(1) Multiple types of control equipment including capacitors, energy storage batteries, and PV inverters are considered, and a two-timescale voltage control problem is formulated in order to take the control requirements of these devices into account. (2) The control variables are assigned to different agents according to their properties, and these agents share the same environment and are trained simultaneously. (3) A DRL method is proposed to solve the optimal voltage control problem, where the discrete variables are solved using a DQN agent and the continuous variables are solved using a DDPG agent, and this method can realize real-time control.

Voltage Problem Formulation
In this section, the two-timescale voltage control problem is formulated, where the control devices, including capacitors, batteries, and PV inverters, are considered.

System Description
In this voltage control problem, the real power regulation is achieved by adjusting the charging/discharging power of batteries, and the reactive power regulation is realized by adjusting the on/off states of capacitors and the outputs of PV inverters. Taking the response time and control cost of different devices into account, the total control process can be divided into long timescale control and short timescale control. Specifically, the entire time period can be divided into N T intervals, and each interval can be further divided into N t slots. In the long timescale, the capacitors' configurations are made at the beginning of each interval T, and in the short timescale the outputs of PV inverters and batteries are adjusted at the beginning of each slot t.
(1) Capacitor modeling During each T, the reactive power supported by the capacitor installed at bus i, Q Cap,i (T, t), can be expressed as a function of binary control variable a cap,i (T)∈{0,1}, which indicates the off/on status of the capacitor; that is, where Q reted Cap,i is the rated reactive power of the capacitor. When a cap,i (T) = 1, the capacitor is connected to the grid and the reactive power provided by it is Q reted Cap,I ; when a cap,i (T) = 0, the capacitor is disconnected from the grid.
(2) Battery modeling During every t, the state variable (state of charge) of the battery installed at bus i is denoted as SOC i (T, t), which is subject to the upper and lower boundaries [20]; that is, The charging/discharging power of the battery, P batt,i (T, t), can be expressed as where a batt,i (T, t) is the control variable and when a batt,i (T, t) is positive, the battery is charging. When a batt,i (T, t) is negative, the battery is discharging. The new state of charge after taking the control action can be represented as (

3) PV inverter modeling
Suppose all PV units in the grid are equipped with a smart inverter. On every t, the active power provided by the PV unit installed at bus i is known as P PV,i (T, t) and its apparent power rating is S rated PVi . Then, the reactive power provided by the smart inverter, Q PV,i (T, t), can be expressed as where a pv,i (T, t) is the control variable of the PV inverter.

Two-Timescale Voltage Control Model Formulation
In this paper, a radial distribution network with N + 1 buses is considered, where Bus 0 is the root, representing the point of common coupling. The voltage magnitude, active power, and reactive power are all converted to per unit (p.u.). The objective of the voltage control problem is to minimize the long-term average voltage deviation by configuring the on/off status of capacitors on every interval T in the long timescale and adjusting the charging/discharging power of batteries and the outputs of PV inverters on every t in the short timescale. Then, the two-timescale voltage control problem based on power flow equations can be formulated as follows: where U j (T, t) is the voltage magnitude of bus j; I ij (T, t) is the current amplitude of segment (i, j); r ij and x ij are the resistance and reactance of segment (i, j), respectively; P ij (T, t) and Q ij (T, t) are the active and reactive power flowing from bus i to bus j, respectively; ψ(j) is the parent bus set of bus j, where the power flows from the parent bus to bus j; and φ(j) is the child bus set of bus j, where the power flows from bus j to the child bus. It can be observed that the optimization problem (9) involves many control variables, including the continuous variables of batteries and PV inverters and the discrete variables of capacitors, which makes problem (9) non-convex and generally NP-hard. When the grid is large and thus involves many control variables, traditional model-based methods may obtain suboptimal solutions, which will consume more time or even be impossible to solve. Additionally, this is a multi-stage planning problem, where the decisions of each type of controller are not made at the same stage. To overcome these difficulties, a model-free method based on deep reinforcement learning is introduced to solve the problem, which is detailed in Section 3.

Deep Reinforcement Learning Solution
In this section, the voltage control problem is first formulated as an MDP, and then a model-free solution based on deep reinforcement learning is put forward, in which the control variables of different controllers are assigned to different agents. The solution of discrete variables is based on a DQN agent while the solution of continuous variables is based on a DDPG agent.

Markov Decision Process
In order to solve the voltage control problem with DRL algorithms, the optimal configurations of different controllers have to be modeled as an MDP. An MDP is defined by the tuple (S, A, P, R, γ), and it is used to describe the interaction process between the agents (i.e., different controllers) and the environment (i.e., the power flow of distribution systems). In this paper, for each agent, the state space S is continuous while the action space A is either discrete or continuous. P, usually unknown, is the state transition probability indicating the probability density of the next state s t+12 ∈S under the current state s t and action a t . R is the reward on each transition, which is denoted as r t =R(s t , a t ), and γ∈[0, 1] is the discount factor. Then, the goal of the voltage control problem is to solve the MDP-that is, to learn the optimal policy of each agent to maximize the reward, which is associated with the long-term average voltage deviation.
In DRL algorithms, the policy µ, expressed as µ(a|s), is a mapping function of the action a t taken by the agent in the state s t . During the training process, the actionvalue function, also called the Q-function, represents the expected discounted reward after taking action a in the state s with policy µ, and can be denoted as Q µ (s, a) = E µ ∑ T all k=0 γ k r k s 0 = s, a 0 = a , where T all is the episode length. Using the Bellman equation, the Q-function can be further expressed as Q µ (s t , a t ) = E µ r t + γQ µ (s t+1 , a t+1 ) s t , a t . Then, solving the optimal policy µ * is equivalent to solving the optimal Q-function, that is,

DQN-Based Agent for Discrete Variables
The configurations of the capacitors are made at the beginning of each interval T. For the discrete variables of capacitors, a DQN-a value-based DRL method-is introduced to handle the control problem with continuous state space and discrete action space.
The classic DQN method, based on the Q-learning method, uses a deep neural network (DNN) to estimate the continuous Q-function, and the DNN can be indicated as a Q network, that is, Q µ (s, a; θ Q ), whose input is the state vector and output is the Qvalues for all possible actions. The experience replay buffer D is used to store experiences e T = (s T , a T , r T , s T+1 ), and a mini batch is applied to store M randomly sampled experiences. In order to update the parameters of the Q network, a target Q network is employed with the parameters of θ Q . Then, using stochastic gradient descent (SGD), the parameters of the Q network can be updated based on the mini batch and the loss function, which can be expressed as where the parameters of the target Q network θ Q are updated by copying the parameters of the Q network θ Q periodically for every B.
In order to ensure that the agent can both explore the unknown environment and make use of the knowledge it has already grasped, the ε-greedy strategy is employed to select action; that is, where ε∈[0, 1] and is a constant, and β∈[0, 1] and is randomly generated by computer. When β < ε, the agent randomly selects an action in the action space; otherwise, the agent selects the action that has the maximum Q-value in the current state. The capacitors' configuration based on the DQN is depicted in Figure 1. During the training period, the agent selects actions based on (18), while during the execution process, based on the current state, the agent selects the action that has the maximum Q-value.

DDPG-Based Agent for Continuous Variables
Based on the configuration of capacitors in the intervals, the outputs of PV inverters and batteries are adjusted at the beginning of each slot t. For the continuous control variables of PV inverters and batteries, DDPG is applied to deal with the control problem with continuous state space and continuous action space.
The DDPG method not only employs a DNN to simulate the Q-function, but also uses a DNN to estimate the policy function. It adopts a typical actor-critic framework, which realizes the policy action and action evaluation by designing the actor network µ(s; θ µ ) and critic network Q µ (s, a; θ Q ), respectively. Like the DQN, the target actor network µ'(s, θ µ' ) and target critic network Q' µ' (s, a; θ Q' ) are also applied.
During the training process, the continuous action is decided based on the following function: a t = µ s t ; θ µ + ξ t (19) where ξ t is the noise used to randomly search actions in the action space. Experience replay buffer and mini batch are also employed. Then, the critic network can be updated by minimizing the loss function as in (17); that is, The actor network can be updated using the policy gradient, which can be expressed as Then, the target networks are soft-updated as follows: where the parameter λ 1.

For the PV inverters and batteries agent, the state consists of the voltage amplitude of all buses and the state of charge of batteries at time t, that is, s PVbatt (t) = [U T (t), SOC T (t)] T . The action includes the action variables of PV inverters and batteries, that is, a PVbatt
Furthermore, the action implementation method of forced constraint output is adopted in order to take the capacity boundaries of batteries into account-that is, when SOC i (t + 1) = SOC i (t) + a PV,i (t)·P max batt,i is outside of the upper and lower boundaries, only the amount of chargeable (dischargeable) power SOC i.max -SOC i (t) (SOC i (t) -SOC i.min ) is charged (discharged).
The control of PV inverters and batteries based on DDPG is demonstrated in Figure 2. During the training period, the state s PVbatt (t) is fed into the actor network and an action a PVbatt (t) is generated based on (19). Then, the state and action enter the critic network and the corresponding Q-value is generated. During the execution period, with well-trained networks, the agent chooses its action based on the state, that is, a PVbatt (t) = µ(s PVbatt (t); θ µ ).

Algorithm and Computation Process
The two-timescale voltage control for distribution systems based on DRL is demonstrated in Algorithm 1. Store the experience in the replay buffer D DDPG 10: Sample a random mini batch from D DDPG and update the actor and critic networks using (20), (21) and (22). 11: Soft update the target actor and critic networks using (23)

Numerical Study
In this section, the implementation details of the proposed two-timescale voltage control scheme based on DRL are described.

Simulation Setup
A modified IEEE 123-bus distribution test system was applied to carry out the numerical tests. Based on the original 123-bus multi-phase unbalanced network [21], the system was changed into a balanced system and the numbering of each node was reorganized as shown in Figure 3. Twelve PV units with smart inverters were installed at 12 buses, and their capacities and locations are listed in Table 1. Four capacitors were installed in the grid at buses 20, 59, 66, and 114, each with a capacity of 40 kvar. Four energy storage batteries were installed at buses 56, 83, 96, and 116, each with a maximum capacity of 600 kWh and rated charge/discharge power of 100 kW. The load power was modified from the real data of an area in Jiangsu province, China (i.e., on each bus, the peak load value was set to the sum of the loads on three phases in the original 123-bus distribution network, and the load curve after standardization was the same as that after the standardization of the real load of an area in Jiangsu province, China). Thus, the load value of each bus was equal to the load curve multiplied by its peak load. All parameters in this distribution system were converted to a consistent base, where the base voltage was 4.16 kV and the base power was 100 MVA.  In the theory of DRL, the parameters that define the architecture of the NNs are of great importance, and the selection of architecture depends on the actual application scenario. For example, convolutional neural networks (CNNs) are often used to deal with complex problems in the image domain, whereas recurrent neural networks (RNNs) are often used to process sequence data. For the voltage control problem raised in this paper, a fully connected NN was sufficient for the task at hand. Based on [22], the number of hidden layers was chosen to be 2 and the number of neurons in each hidden layer was first selected according to that in the upper layer and the lower layer. Then, the model was trained, the output was checked to see if there was overfitting, and the parameters were adjusted until a satisfactory output was obtained. According to the above system setting, the four capacitors generated 2 4 = 16 combinations of discrete actions, and the PV inverters and batteries produced 16 continuous actions. For the DQN agent, the Q network consisted of three fully connected layers: one input layer, two hidden layers with 95 and 22 neurons, respectively, and one output layer with 16 neurons. The sigmoid function was used at the end of the output layer to keep the Q-value within [0, 1]. For the DDPG agent, the actor and critic networks were also composed of three fully connected layers, with hidden layers of 90, 30 units, and 46, 14 units, respectively. The output layer of the actor network consisted of 16 neurons and the output layer of the critic network had 1 neuron. The tanh function was applied at the end of the actor network to keep the action variables within [−1, 1]. All the hidden layers used rectified linear unit (ReLU) as the activation function. The detailed settings of other hyper-parameters are declared in Table 2. Optimal power flow was employed as the environment for these DRL agents. The proposed algorithm was run in Python using the Pytorch framework, and the training process was executed on CPU.

Case Study
In this subsection, we evaluate the performance of the proposed DRL scheme using the modified IEEE 123-bus system. A total of 2880 data points, comprising load data and PV outputs, were used as training data, as demonstrated in Figure 4. Meanwhile, another 288 data points were used as test data, as depicted in Figure 5.  First, based on the optimal power flow, the voltage distribution of this grid without any voltage control was analyzed. In this paper, the interval T was defined as 30 min and the slot t was assumed to be 5 min. The PV outputs were based on the clear day in Figure 5. The buses experiencing voltage issues were bus 1, bus 2, bus 7, and bus 123, which violated the maximum voltage limit of 1.05. Take for example the voltage amplitudes at bus 1 and bus 24 (with PV unit installation), which are depicted by the blue curve in Figure 6. Then, the learning performance of the proposed DRL method was investigated. Following the procedure shown in Algorithm 1, the DQN and DDPG agents were trained. During the training process, the daily PV generation and load consumption combinations were randomly chosen from the training set, as demonstrated in Figure 4, to represent different grid operation conditions. Training was performed for 300 episodes, and each episode finished after training 288 samples from one day. Figure 7 displays the episode reward values and average rewards in the training period, where the episode reward value is the sum of all the rewards obtained during a given episode and the average reward value is the average of every four episode reward values. It can be observed that in the early stages the reward value was very low because of the limited learning experiences. As the training process continued, the agents gradually evolved and the reward value increased. Later, the reward curves fluctuated due to the random control attempts of both the DQN and DDPG agents to determine the correct policy actions. After about 60 episodes, the reward curve flattened out gradually, indicating the DRL agents' ability to realize voltage control. During the test period, the trained DRL agents were employed to control the capacitors, batteries, and PV inverters according to the test data in Figure 5, and the PV outputs were based on the clear day. As demonstrated in Figure 6, compared with the case without voltage control, these trained DRL agents demonstrated an effective performance for voltage control and all bus amplitudes were maintained within the safety limit, especially the buses having voltage issues. Thus, we can conclude that the proposed algorithm enables the controllers to explore the relationship between their configurations and the inherent uncertainty and variability in the PV outputs and load power, and to take corresponding policies when faced with new operating conditions.

Comparison with the Model-Based Optimal Control Scheme
In order to compare the performance of the proposed DRL method, a model-based optimal control scheme called a two-stage optimal control scheme was applied. The modelbased optimal control scheme aims to minimize the daily voltage deviation, and was assumed to have full knowledge of the model and parameters of the distribution network. In this method, the configurations of capacitors, batteries, and PV inverters were decided at the beginning of each interval and the outputs of batteries and PV inverters were further adjusted based on the capacitors configuration at the beginning of each slot.
As demonstrated in Figure 6, in most cases the control effect of the DRL-based method was similar to or even better than that of model-based method, since the model-based method considers the optimal control throughout a day, while the DRL-based method can realize real-time control according to the current state of the power grid.
The execution time of the model-based control method and our proposed DRL-based method targeting all of the day's 288 samples is demonstrated in Table 3. It can be seen that the proposed DRL-based method took only 0.1964 s, much less than the 1107.7629 s of the model-based method. Therefore, the proposed algorithm shows high computational efficiency and has competitive potential in online application. Table 3. Execution time of model-based control method and DRL-based method.

Method Time (s)
Model-based control method 1107.7629 Our proposed DRL-based control method 0.1964 In order to evaluate the dynamic response performance of the proposed DRL-based controller under time-varying PV outputs, the case of a cloudy day was studied. The outputs of the PV units were based on the cloudy day in Figure 5, which show a great deal of fluctuation. The results of the model-based optimal control scheme and the proposed DRL-based voltage control scheme are depicted in Figure 8. It can be seen that the proposed DRL-based controller could respond quickly to the PV fluctuations, which is very important in order to realize the demand of real-time control. Additionally, in the model-based scheme, the controller needs prior knowledge of the PV outputs over a period of time, which is often inaccurate in the case of PV fluctuations. In the proposed DRL-based scheme, the controller adjusts its action based on the current state of the power grid, and is more reliable. In order to evaluate the control performance of the proposed DRL-based voltage controller under extreme weather conditions, a case study was carried out for a scenario where there were no PV outputs at all. The results of the model-based optimal control scheme and the proposed DRL-based voltage control scheme are demonstrated in Figure 9. The voltage amplitudes of all buses were controlled below 1.05 and it can be seen that the proposed DRL-based controller still had better control performance without PV outputs. From the results of Figures 7-9, it can be observed that the trained DRL-based controller worked very well in different scenarios, and could adapt to similar but slightly different data, which verifies the generalization ability of the proposed algorithm.

Conclusions
In this paper, a two-timescale voltage control scheme based on a DRL method is proposed to control multiple types of equipment, including capacitors, energy storage batteries, and PV inverters, for optimal voltage control in the distribution network. Control variables are assigned to different agents according to their properties, which share the same environment and are trained simultaneously to cooperate with each other. Specifically, the discrete variables are solved using a DQN agent and the continuous variables are solved using a DDPG agent. A specially designed reward is applied to achieve the goal of minimizing long-term average voltage deviation. Case studies showed that the proposed algorithm had similar or even better performance than the model-based optimal control scheme, and had high computational efficiency, enabling the realization of real-time control. Additionally, the proposed DRL-based controller could adjust its action based on the current state of the power grid. It had better dynamic response performance and could enact a quick response to PV fluctuations. Future work will focus on designing the reward function to achieve more control objectives and take various operating constraints into consideration.  Data Availability Statement: Data sharing not applicable. No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest:
The authors declare no conflict of interest.