Twin-Delayed Deep Deterministic Policy Gradient for Low-Frequency Oscillation Damping Control

Due to the large scale of power systems, latency uncertainty in communications can cause severe problems in wide-area measurement systems. To resolve this issue, a significant amount of past work focuses on using emerging technology, including machine learning methods such as Q-learning, for addressing latency issues in modern controls. Although the method can deal with the stochastic characteristics of communication latency, the Q-values can be overestimated in Q-learning methods, leading to high bias. To address the overestimation bias issue, we redesign the learning structure of the deep deterministic policy gradient (DDPG). Then we develop a damping control twin-delayed deep deterministic policy gradient method to handle the damping control issue under unknown latency in the power network. The purpose is to address the damping control issue under unknown latency in the power network. This paper will create a novel reward algorithm, taking into account the machine speed deviation, the episode termination prevention, and the feedback from action space. In this way, the system optimally damps down frequency oscillations while maintaining the system’s stability and reliable operation within defined limits. The simulation results verify the proposed algorithm in various perspectives, including the latency sensitivity analysis under high renewable energy penetration and the comparison with conventional and machine learning control algorithms. The proposed method shows a fast learning curve and good control performance under varying communication latency.


Introduction
Inter-area low-frequency oscillations cause significant challenges to reliable control and economic operations in a typical cyber-physical system such as transmission networks. For example, if the inter-area oscillation has poor damping, it will cause catastrophic disturbances, such as forming multiple outages, which lead to widespread oscillations [1]. The failure to control frequency oscillation can cause severe damage to power system stability and reliability. In the worst case, it can cause large-scale power outages and even blackouts. There have been several incidents of low-frequency inter-area oscillation, such as the one on 14 August 2003 at the Eastern Interconnection located in the United States [2]. This incident caused 45 million people to lose their power supply for periods of up to three hours. The outage was caused by poor damping of low-frequency oscillations. Another incident took place in the southern region, where the power system broke down on 15 September 2011. The incident was a result of poor frequency oscillation damping of the power system. Such events are harder to avoid with traditional solutions because the maximum available transfer capability is limited [3][4][5]. For example, traditional power engineering approaches damp oscillations with power system stabilizers (PSSs), relying on local measurements. However, it is hard to satisfy controllability and observability over inter-area modes if signals are measured locally [6]. Fortunately, the observability of inter-area modes improved with the advent of wide-area measurement systems (WAMS) and the implementation of phasor measurement units (PMUs) [7][8][9][10][11]. Due to the communication delay in modern Information and Communication Technologies (ICT) [12], the uncertainty of the communication delay becomes an essential research focus in power system damping control.
Since the communication delay significantly affects the damping control performance, researchers have proposed various control methods to solve this issue [13][14][15]. Mokhtari et al. develop a WAMS controller using fuzzy logic algorithms [16]. The purpose is to mitigate the adverse impact stemming from the continuous latency in the inter-area mode. Meanwhile, a series of inter-area oscillation damping controllers were developed to handle the varying-latency issues [17,18]. Ref. [19] utilizes a special controller to reduce the latency below 300 ms. The work in [20] employs networked predictive control to mitigate inter-area oscillations in power systems. While [21] utilizes H∞ control to achieve effective damping.
Furthermore, ref. [22] proposes a multi-input multi-output (MIMO) ARMAX model. It has similar accuracy as the MIMO subspace state space model but a lower model order. From a device perspective, the static VAR compensators are adopted under various system conditions and types of generations [23][24][25]. Although these methods have merits from a different perspective, some of them neglect communication delays and some assume correct network topology and system parameters. Unfortunately, such assumptions are hard to achieve in reality, due to their lack of accessibility, network growth, and the instantaneous communication congestion condition.
To target these issues, intelligent methods that are physically model-free are proposed. For example, ref. [26] takes advantage of the data management technique CART, while ref. [27] utilizes a data-driven control method for damping control. Ref. [28] uses a deep learning WADC, but such a method relies extensively on past data and is unable to adapt to the changes in transmission networks. One observation shows that the exploration of the system does help in capturing such transformations. Fortunately, reinforcement learning (RL) provides a platform to understand the environment and learn the control policy accordingly. Among different RL methods, Q-learning can handle problems with stochastic transitions and rewards using a value function. Ref. [29] leverages this capability of learning a stochastic control through exploring the network by using Q-learning. However, Q-learning fails where there is a large state space [30][31][32][33].
The contributions of this paper are summarized here. First, to overcome control challenges of large state space, we can combine the advantage of RL and deep learning to provide a stochastic and robust control through WAMS, so that both the uncertainties and time-varying delays can be taken into account through interactive learning. Second, this paper introduces a novel policy-based RL method to address the damping control due to unknown latency in inter-area oscillation. Specifically, we build a power system testbed for RL's interactive environment. Third, we define the state and action most suitable for damping control. In the end, a reward function that considers the physical measurements and sustainability of RL is proposed. This paper has the following outlines: Section 2 explains the background knowledge of the RL and policy gradient method. Section 3 elaborates on the specific design of the RL-based controller, including the design of state, action, and reward to maximize the control benefits and merge them into the power system concept. Further simulation results and shortcomings of the policy-based RL method are described in Section 4, followed by the conclusions in Section 5.

The Principles of RL Algorithms
Power systems damping control deals with uncertainties and ambiguities in the entire power system. RL is a perfect tool to solve such issues because it helps the control agents learn the optimal control actions with the highest accumulative reward. Essentially, most RL algorithms explore the environment by viewing it as a Markov decision process (MDP) [34]. The MDP usually has a tuple expression (S, A, E, r). It is composed of a state space S, an action space A, and a transition function E[S t+1 |s t , a t ] that predicts the next step s t+1 based on the current state and action (s t , a t ). Each of these (s t , a t ) is used to calculate a reward at the present state while taking the present action. In power systems, the state-action pair is defined as the control action taken under present operating conditions, whereas the reward is the score obtained after a control action. The policy function π reflects the agent's "intelligence" since it decides the action to be taken in some state s t . Therefore, the objective of the RL agent is to formulate an optimal model for policy-making. This model could use a gradient-based learning rule for the policy to achieve the highest accumulative reward. The corresponding RL operational principle diagram shown in Figure 1, clearly displays the relationship between environment and agent. The agent acquires the state of the system under study through measurement and communication devices. Based on the observations, the agent then determines the corresponding action to control the state of the environment through the calculation of the reward. To improve control performance, the agent updates its policy at each time step of the process.

DDPG Algorithm
Many researchers focus on using machine learning methods such as deep Q-network (DQN) to address latency issues in modern controls. Despite the fact that the DQN overcomes the exploring challenges in high-dimensional space, it cannot work in continuous action space. This is not acceptable in power system damping control where the action space is continuous. Therefore, the DDPG algorithm seems to be a promising solution because it relies on an actor-critic model to explore the continuous action space [35]. Besides, DDPG has good accuracy in learning under complex environments as shown in [36]. Its value function is expressed as follows: In the critical networks, we use Q to estimate the state of the WAMS and follow a specific distribution [35], as it estimates the effectiveness of the action being taken. In power system damping control, the action can be the reference value for the generator terminal voltage. To predict the actions, an actor-network µ(s t+1 ) is used, as it samples the states and conducts the value function estimation. However, the estimation is supervised by the critic and its evaluation is denoted as: The above two equations are approximated by neural networks and parameterized by θ Q and θ µ [35]. In power systems, they are the parameters for the control performance models. For M time steps, we use the following loss function: where i is the mini-batch sample number [35]. The actor-network is iteratively updated through the chain rule based on the initial reward distribution. The expected return is parameterized by θ and expressed as: The target actor and critic are denoted by Q and µ . Starting with their initial values, they are updated respectively in an iterative way [35].

Reinforcement Learning Based Controller
Based on the DDPG algorithm, we further realized that the twin-delayed deep deterministic policy gradient algorithm is needed to avoid the overestimation bias of DDPG. The framework of the proposed damping control scheme is shown in Figure 2, and is motivated by WAMS, in which the PMUs transmit phasor data to the phasor data concentrator (PDC). Since different media are deployed, the transmission delay affects the receiving time of the important information for determining the algorithm inputs. The goal of Figure 2 is to achieve state observability of the whole system for RL control. The twin-delayed deep deterministic policy gradient controller interprets the PDC data and calculates the states and rewards. After that, control actions are output through the learning process. The neural networks are used to represent the actor and the critic. The backpropagation of the networks is realized by minimizing the loss function, as shown in Figure 3. To mitigate the oscillations and prevent the overestimation of Q-values, we use a pair of actors and critics to form a twin-delayed deep deterministic policy gradient algorithm. However, further work is needed for defining the states and the actions. One of the challenges is the twin-delayed deep deterministic policy gradient reward calculation, which is the key to the RL controller design. In the following subsections, we demonstrate how the RL controller is designed.

State of the Controller
The twin-delayed deep deterministic policy gradient controller has three elements: the state, the action, and the reward. We start with the design of the states. The generator voltage, current, and phase angle are monitored through PMUs [37]. We define the states s t for the generators under study, denoted by g = 1, · · · , G. Then, we define ω t,g as the generator speed. The associated speed perturbation is defined as ∆ω t,g . We use θ t,b for the generator phase angle. In addition, the bus number and time are denoted by b and t, with an upper bound of B and T, respectively. Simply, the speed perturbation is: Meanwhile, the states are summarized as follows: We design the state to directly capture the rotor speed; therefore, we include s t,1 . Meanwhile, to capture the dynamic characteristics of the rotor, we utilize the rotor speed deviation to monitor the direct results after an action. Therefore, we have s t,2 in Equation (6). Since the voltage angle is another significant factor that quantifies the state of the generators, we incorporate s t,3 in the design of the state.

Action of the Controller
After designing the controller's state, it is important to identify the control action. For the twin-delayed deep deterministic policy gradient controller, it serves as a special power system stabilizer (PSS) that regulates the synchronous generator g's field winding voltage (v t,g ) at time t. Consequently, the twin-delayed deep deterministic policy gradient controller's action is the increasing and decreasing of field voltages for generators. Since the voltage is established and controlled immediately, we simply use the resulting voltage as the action. When there are multiple generators, the voltage control actions are grouped in an action vector: a t = {v t,1 , v t,2 , v t,3 , · · · , v t,G }.
By grouping the voltage signals, the learning agent can easily process and deliver the control signal to the automatic voltage regulator of each generator.

Reward Design for Enhanced Control Results
Now, we create the reward function. To dampen the power systems' frequency oscillations, different indicators are collected, including the rotor speed, its deviation, and phase angle deviation between remote buses. To solve the high dimensionality and complexity of stability problems, we not only use variables for the power system but add more control effort from RL into the reward function to improve the performance. The form of reward function is shown in the following: where, v t−1 is the variable of the action vector. It keeps a record of the feedback from the previous time steps. This variable will help have higher reward values and reduce the system oscillations. It is observed that there are five terms in Equation (8), ranging from physical quantity associated reward (the first three terms) to the episode control (the fourth term) and the feedback of actions (the fifth term). The first term gives the ability to control the speed of the generators ω t,g . The second term captures the generator speed variation. The third term measures the bus angle differences for the generators under study. Due to WAMS, we are able to obtain information that describes the angle difference in a large area. This provides a fresh perspective for better observation. The fourth term refers to the constant reward for preventing the termination of the episode due to zero reward. The fifth term in (8) refers to the feedback for the action spaces from previous time steps. As we capture the major variables that impact the system stability, we add them into one reward function. To emphasize the feedback, the fifth term is designed in a square form. The reward design is novel, since it quantifies the physical values, includes the episode control, and adds the feedback from the RL agent's action. Together, these innovations help to achieve superior control performance.

Twin-Delayed Deep Deterministic Policy Gradient Method
Conventional solutions for damping control are usually model-based. However, the parameter variation over time, and the communication latency, are major issues for model-based solutions. Under this circumstance, RL algorithms are gaining popularity. However, some RL algorithms, like Q-learning, suffer from the overestimation issue in the Q-values. Therefore, we solve the overestimation issue by implementing the TD3 algorithm [38] to the power system damping control. The twin-delayed deep deterministic policy gradient algorithm is an off-policy RL method that uses deep neural networks to compute an optimal policy that damps down the power system oscillations.
In the RL-based controller design, we expect the RL-based controller to have some salient features. First, the control agent manages and stores the actor and critic networks, and both continue to improve the stability during the learning process. The twin-delayed deep deterministic policy gradient algorithm maintains the actor and critic networks. The actor network inputs the system state and outputs the damping control action. The mapping from the input to the output is a control policy, learned in the twin-delayed deep deterministic policy gradient algorithm; whereas the critic evaluates the action made by the actor.
Next, the twin-delayed deep deterministic policy gradient controller maintains a dynamic replay memory collected from an environment interface. The reply memory stores the "experience" about the environment's reaction to the control agent's action.
The replay memory is utilized so that it mitigates the association in the sampled data by randomly selecting a batch of data from the replay memory at each time step. To dampen oscillations of highly non-linear power systems under communication delays, we design the twin-delayed deep deterministic policy gradient controller with two deep neural networks for the actor and the critic respectively.
Lastly, for the loss function design of the twin-delayed deep deterministic policy gradient controller, we use the deterministic policy gradient method for the actor training:

Avoid the Overestimation Bias
To make sure the control tasks are in a continuous action space, an actor-critic setting is adopted. To avoid high bias in the damping control, the clipped Double Q-learning method is used. This method takes advantage of the overestimation bias and sets it as the upper limit of the real estimated value [38]. This method is inspired by Double DQN [39], where the target network is used to approximate the value function and extract the policy. When translated into the actor-critic environment, we update the present policy instead of the target policy with two actor values (π φ 1 , π φ 2 ), two critic values (Q θ 1 , Q θ 2 ), and the objective p: To prevent the propagation of the overestimation when the smaller Q θ has already overestimated the true value, the strategy is to use the biased Q θ value as the upper bound of the less biased one [38]. After summing up the experience reward r and the minimized reward of the critics, we formulate the value function target as follows:

Numerical Validation
In this section, first we will discuss the validation setup. Then, we demonstrate the performance of the proposed control agent with validation on the latency. Lastly, a detailed comparison with the DDPG and a classical damping control method is shown.

Benchmark System
Various simulation studies were performed in benchmark systems like the Kundur system and the IEEE 39-bus system. The performances are similar, but due to the space limit, we use the Kundur's system in Figure 4 for illustration, as it is a widely used benchmark system for dynamic oscillation studies [40]. There are two areas in the system. Area 1 includes generators G1 and G2, while area 2 has generators G3 and G4, as shown in Figure 4. Both areas 1 and 2 share two identical generators, 20 kV, 900 MW, except for the inertia value. The two areas are connected through two 230 kV transmission lines. The distance between the two areas is 220 km; in Figure 4, this is the distance between bus 7 and 9. Table 1 summarizes the benchmark system parameters. There are two solar farms connected to bus 7 and 9, respectively. The benchmark system presents a stressed operating condition, where area 2 imports 413 MW from area 1.

Twin-Delayed Deep Deterministic Policy Gradient Control Agent: Fast Learning Curve
To show the robustness of the damping controller, the controller is comprehensively trained under various communication latency. Using the proposed controller design in Section 3, we obtain the learning results shown in Figure 5. The average reward reflects the control performance directly. We also notice that attempts at the initial stage often vary and that the average reward gradually increases at a later stage. As observed in Figure 5, after 100 episodes of simulation, the episode reward of the agent is shown to reach a value close to zero-the highest value in the whole learning curve.
Being out of synchronization can result in system collapse, so we determine a low reward for this scenario. When the system loses synchronization in an episode, the agent learns associated parameters so that the system does not explore the outage of the system and learns what parameters to avoid next time. Therefore, we design such cases as an indicator to terminate the episode training in the learning process. We observe that the agent gradually increases to an appropriate policy that reflects proper control strategies when the episode number is below 40, as shown in Figure 5. As the attempts increase, a high average reward is established. Episodes 40 and 85 present the convergence points for the average reward and episode reward. As the learning process accumulates, extensive exploration leads to a stabilization stage as shown in Figure 5. The control agent then becomes very effective in damping the oscillation.  Table 2 for the parameter details.
The advantage of this reward function design is presented in Figure 6, which clearly shows the necessity of the five terms in the reward equation. When there are less than three terms in the reward function, the learning curve cannot be accomplished. The first three terms in Equation (8) show that the green learning curve in Figure 6 is much worse than the blue curve that has five terms. The proposed twin-delayed deep deterministic policy gradient technique with the five-term reward function gains higher reward values in general, which indicates the system becomes the stabilization condition faster than regular TD3 with only three terms.

Control Agent: Robust to Communication Latency
We analyze the control performance by altering the average communication delay in the system. The communication latency depends on the communication medium. For example, fiber optic cables have a one-way delay of 100-150 ms, whereas satellite links exhibit a latency range of 500-700 ms. A full list of the latency ranges can be found in [41]. Based on these ranges, four scenarios (S1-S4) that fully capture the delay range are created between 0.13∼0.19 s. The details of the four testing scenarios are shown in Appendix A.
Since the signal routing and its internal mechanism present uncertainty and variability, the controller learns these characteristics in each episode. Figure 7 shows that the proposed agent with the proposed controller has an excellent performance in controlling the generators, but shows that the performance based on the variation of average communication delay successfully damps down the oscillations. To show the improvement of the proposed control scheme, we compare the proposed method with the existing control scheme-the multi-band power system stabilizer (MBPSS). MBPSS is one of the classical practices in IEEE Std 421.5 [42]. We test the performance of MBPSS under the same system operating conditions as the proposed method. In steady-state conditions, the control difference is subtle. However, under transient conditions, the proposed twin-delayed deep deterministic policy gradient method show faster control performance than MBPSS. As shown in Figure 7, MBPSS, with a communication latency of 0.13 s, brings the machine speed to normal with almost 20 s under the test scenario. This is almost twice the time as the proposed method. As discussed in the literature, conventional methods suffer from the latency issue during their control efforts. Through multiple simulations, we observe that the performance of the proposed control agent is no longer dependent on the communication delay due to the fact that the largest time latency is not always associated with the highest speed deviation. Figure 7 shows one of the selected control results, where, unlike conventional methods, the shortest time delay does not result in the least amount of speed deviation. Here, the yellow line with 0.18 s of time delay presents the smallest speed deviation. The results show that the proposed RL agent decouples its performance with the time latency in it's communication. It is the hyper-parameters in the RL algorithm that affect the damping performance.

Performance Comparison with DDPG
The compared results between the twin-delayed deep deterministic policy gradient and DDPG agents are demonstrated in Figure 8, including all the agent discount factors and batch sizes in Table 2. The twin-delayed deep deterministic policy gradient agent achieves the optimal policy faster than the DDPG agent does; whereas, the DDPG agent could not converge into the stabilization condition in limited episodes. However, it has only several explorations before reaching out to the optimal policy. In other words, there could be the case that the agent might not learn the parameters.  Table 2.

Conclusions
We solve problems of instability in large systems by proposing a twin-delayed deep deterministic policy gradient control agent that takes into consideration all the uncertainties of the unbalanced systems. In this way, the system is optimally explored to damping down frequency oscillations while keeping the system's balance within defined limits. Besides, the novel design of the state, action, and the reward is described in detail. By using the twin-delayed deep deterministic policy gradient algorithm, we show that low-frequency oscillation can be significantly improved from its learning curves. The simulation results show the proposed controller has a fast convergence rate and is robust to communication latency variation. When compared to other conventional damping control methods, the proposed twin-delayed deep deterministic policy gradient algorithm most effectively dampen the speed oscillation. Future work can focus on the knowledge transfer of the learned control experience. Specifically, if the control agent does not need to acquire the control knowledge from the scratch, the proposed twin-delayed deep deterministic policy gradient method would show great advantages of generality when applied to various power systems.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. The Parameters of Four Testing Scenarios
The four testing scenarios that were used in Section 4 are listed here. They include four representative latency cases under different means and variances of the signals.