A Sparse Neural Network Based Control Structure Optimization Game under DoS Attacks for DES Frequency Regulation of Power Grid

With the rapid growth of distributed energy sources, power grid has become a flexible and complex networked control system. However, it increases the chances of being a denial-of-service attack, which degrades the performance of the power grid, even causing cascading failures. To mitigate negative effects from denial-of-service attack and enhance the reliability of the power grid, we propose a networked control system structure based optimization scheme that is derived from a Stackelberg game model for the frequency regulation of a power grid with distributed energy sources. In the proposed game model, both denial-of-service attacker and control system designer as a defender are considered without using any analytical model. For defenders, we propose a sparse neural network based DES control and system structure design scheme. The neural network is used to approximate the desired control output and reinforce signals for the improvements of short- and long-term performance. It also introduces the sparse regulation of column grouping in the neural network learning process to explore the structure of control system that involves the placement of sensor, distributed energy sources actuator, and communication topology. For denial-of-service attackers, the related attack constraints and attack rewards are established. The solution of game equilibrium is considered as an optimal solution for both denial-of-service attack strategy and control structure. An offline optimization algorithm is proposed to solve the game equilibrium. The effectiveness of proposed scheme is verified by two cases, which illustrate the optimal solutions of both control structure and denial-of-service attack strategy.


Introduction
With the development of network techniques, electricity supply via the modern power grid is increasingly depending on networked control systems (NCSs). In the power grid and communication integrated network, the efficiency and reliability of power grid are gradually enhanced [1,2]. However, new network control techniques generate related vulnerabilities in the control system of power grids. As the connection of virtual and physical worlds, cyber attacks against NCSs can render the large disturbances of power grid that have been confirmed during the past few years [3].
As a common type of cyber attack, denial-of-service (DoS) attack occupies communication resources to prohibit the transmission of measurement and control signals in NCSs [1]. Compared with deception attacks [4], DoS attacks not only require little prior knowledge about NCSs, but also performance. The system performance is also improved under limited cyber resources by optimizing the control structure, which involves the placement of DES, RTU sensors and communication topology. It is optimized by imposing a group sparse regulation on neural network weights. The Stackelberg game model is used, which derives a minimax optimization of system performance, so that the optimized control structure is robust to DoS attacks under the consideration of worst case attacks. The structure consists of the placement of DES, RTU sensors and communication topologies. The contributions of this paper are summarized as follows: 1.
Sparse neural network based reinforcement learning is proposed to improve the frequency regulation of DES in control systems without using a power system analytical model, which involves adaptiveness, performance, and structure.

2.
The Stackelberg game model is used to derive the optimal control scheme and structure, so that the proposed frequency regulation system is robust to the worst case of DoS attacks. In addition, the reliability of proposed frequency regulation system is enhanced.
The remainder of this paper is organized as follows: the system model and related problems are formulated in Section 2, which introduces the power system and DoS attack model, and describes the frequency regulation as well; Section 3 elaborates control structure and control law design by sparse neural network based reinforcement learning; Stackelberg game model and the optimization scheme of control structure are derived under DoS attacks in Section 4; and Section 5 demonstrates the simulation results to verify the proposed algorithm.

Problem Formulation
The formulation of power grid and control objective will be introduced in this section. We consider a multi-area system that is integrated with Distributed Energy resources. The control objective is to mitigate the frequency regulation. The DoS attack may degrade the control system performance and even lead to failures. Thus, a design problem involving controller design and structure design under the cyber attack is also introduced.

Power Grid Frequency Dynamic Model
We consider the interconnected multi-area power system. Each one of n areas is connected to each other by tie-line (also called transmission line). As shown in Figure 1, each area equips a turbine generator and DES, such as wind power, solar power, battery, etc. [20].
It also contains a load frequency controller (LFC) and tie-line bias controller (TBC) for frequency synchronization. Even if the synchronization measures regulate the frequency, an auxiliary control offered by DES may be necessary to enhance the system performance, when the power system encounters a severe disturbance, such as system fault, and sudden large load drop. Considering the auxiliary control, the dynamic model of area i can be formulated as a discrete linear difference equation [24]: is area state. ∆ f i is the deviation related to synchronized frequency; ∆P mi is the mechanical power deviation of generator; ∆P vi is the valve position deviation of turbine; ∆P tie−i is the deviation of tie-line power injection from other physical neighbored areas; ACE i is the ACE signal of area i and ACE i = α i ∆ f i + ∆P tie−i . u i as the auxiliary control output of DES for frequency regulation is the sum of all the powers generated from power-electronic interfaced DES; w i is the disturbance caused by model error or other time-varying factors; A i is the system transition matrix; B i and B ji are the gains of control effect and other physical neighbored systems; E i is the disturbance gain. N p (i) denotes the physical neighbored areas of area i. In this linear model, loads are assumed to be constant because the variation of loads is slow relative to the dynamic frequency regulation. Therefore, A i , B i , and B ji can be modeled in time invariant [24]. The system is linear time invariant (LTI) as well as an NCS. The DES controller of area i for frequency regulation are written as where the time stage k is neglected. N c (i) denotes the cyber connected areas of area i. The controller calculates DES control outputs by the received state x i and x j as well as j ∈ N c (i) from local and remote areas, respectively. The control objective is to mitigate the frequency deviation and reduce the overall costs defined in Section 3, which consider the least quadratic of state x i and the control output u i .

DoS Attack Model
For the previously mentioned NCS, we consider attacker launch attacks, when the power system requires the emergency auxiliary control offered by DES. DoS attack blocks communication channels to degrade the control performance, even causing system failures. The blocks of communication channels probably result in the absence of some remote states x j and j ∈ N c (i) of the current time stage k [25]. When controller cannot obtain the remote state from cyber connected area j, "zero-control" strategy is applied to assigning x j = 0. u i becomes where α ji ∈ {0, 1} is distributed according to Bernoulli distribution, which is where δ ji ∈ [0, 1] is the probability of packet drop in the communication from area j to i, k ∈ N + . The probability δ ji increases the intensity of DoS attack. DoS attacker has the limited cyber resources, which constrains the intensity of DoS attack at a certain degree. The constraint in our model is subject to The probability δ ji is limited to the intensity of DoS attack. The intensity of DoS attack is under the constraints of cyber cost, which is also limited. Therefore, the sum of probability δ ji must be less than a maximal real number C.

Control, Structure Design and Optimization Problem
Even though the NCS of power grid suffers DoS attacks, one of the NCS objectives is to enhance the system performance by mitigating the frequency deviation ∆ f i and minimizing the control cost. To achieve the control objective, the control law ϕ i should be investigated. Moreover, the design of control structure and its optimization also need to be considered. Since the control structure associates the placement of sensors (in remote terminal unit (RTU)), actuators (in DES) and communication topology, these placements seriously affect the control performance in a large-scale multi-area power system. Generally, the placed resources (RTU, DES, etc.) are expensive.
In mathematical form, the control design gives a ϕ i to achieve the system performance. We define Q shown in Equation (6) as the control cost to measure system performance.
where "argmin" here also denotes the functional minimization; the control structure design aims to select cyber connected area set N c (i) for i = 1, . . . , n, which is Therefore, as an optimization, it searches a combination of ϕ i and N c (i) for N c (i), i = 1, . . . , n to minimize the control cost in the worst case of DoS attack with constraint Equation (5) as shown in Equation (8): where We assume the attacker knows the information of the designed power system. The attacker can figure out the minimal system performance, when the controller reaches its maximal performance. The details of optimization in a Stackelberg game are discussed in Section 4.

Control Design by Reinforcement Learning
Before introducing control design, a cost function is defined to measure the control cost Q. Control performance usually involves the quadratic of state and control output. Thus, we define the control cost Q by a control strategy utility function used in our previous work [24]: where 0 < α < 1 is the discount rate, and N is a positive integer. p i as a binary performance index is defined as: where · denotes 2-norm, a 1 and a 2 are the weights of state and control cost respectively, and c is the threshold of performance indication. If the current cost is in an allowed range (less than c), the binary performance index shown in Equation (10) is 1; otherwise, it is 0. Binary performance index makes the strategy utility function limited. When time horizon is limited, even the system diverges, and it can avoid a numerical crash in the learning process. To prevent the system diverge, we construct a desired control output u di to apply a damping rate L i ∈ R 5×5 , L i < 1 to the system, so that the power grid frequency dynamic becomes If we approximate u di by a neural network outputû di , the neural network is defined aŝ where φ ∈ R nl×1 is the radial basis of neural network calculated from state x 1 , x 2 , ..., x n , W i ∈ R nl×nl is the trainable weight, and M a ∈ R 1×nl is a given constant matrix. As shown in Figure 2, the proposed neural network structure is based on a radial basis function (RBF). An RBFb-based neural network can identify nonlinear dynamical systems [26]. l is the dimension of radius basis for one area. Under the control ofû di = u i , the system of area i becomes According to the control theory of stability, (13) are bounded, the system is ultimately uniformly bound (UUB) [27]. We assume the disturbance The control cost involving system performance is also approximated by a neural network as shown in Equation (14):Q where M c ∈ R 1×nl is a given constant matrix. We train the neural network weight W i to approximate the desired control output u di as well as the performance measurement for the improvement of system performance. Therefore, we define a loss function for training as The first term denotes the approximation error of the desired control output, the second term denotes the approximation error of system performance, and the third term is used to improve the long-term system performance. As shown in Equation (16), the desired control output u di and system performance Q i can be expressed by the neural network: where υ ai (k) ≤ v and υ ci (k) ≤ v are the optimal approximation errors, v is is a small number, and W * i is the optimal approximation of a constant matrix. We derive the defined loss function V i (k) subject to W i (k) and neglect v ai and v ci , so the converted online iteration formula is shown as

Structure Design by Sparse Neural Networks
The structure design involves the placement of sensor, actuator, and communication topology. It is formulated by the neural network weight W i . The weight matrix is separated into column groups, so we can define the radius basis φ as T , x cj is a given radius center, j ∈ N + ∩ [1, l], and σ is a given radius width. According to the definition of φ, we know that the neural weight is separated into n column groups. Each group that has l columns corresponds to the radius basis φ i of one area. If the weight matrix W i is group sparse [28] with respect to these columns, the structure of control system can be figured out. For instance, if the column group j of W i corresponding to the gain of φ i is zero, the controller of area i does not need the information from area j, and the communication channel between area j and i is also not necessary. If all the controllers do not require the information from area j, any sensor is not necessary for area j. If W i = 0, it means that the actuator in area i is not required. The key is to force W i to be column group sparse. Thus, a group sparse regulation term is added to the loss function. Thus, the loss function becomes where W iG(j) denotes the column group j or the gain of radius basis φ j . γ i is the regulation weight. The sparse regulated learning iteration is demonstrated in Equation (20), which can be also derived by Equation (17): , and 1= [1, 1,..., 1] ∈ R l . W ij means the jth weight block in W i , which involves φ j . Thus, we havê The stability analysis of iteration in Equation (20) can be taken in a similar way used in [24]. We know that if β i and γ i are small enough, the iteration in Equation (20) is stable and the errors in results are limited.

Structure Optimization under DoS Attacks
In this section, the control system structure design is solved by a structure optimization problem. Considering DoS attack, a Stackelberg game is formulated in the structure optimization. In addition, the optimization algorithm is also presented.

Stackelberg Game Formulation
As shown in Equation (20), the iteration searches a system structure and control scheme in simulation, and then a DoS attacker observes the system and takes attack actions. It is a leader-follower sequence, therefore a Stackelberg game model is proposed to obtain the optimal solution of control structure optimization and DoS attacks.
There are two actors in the proposed model. One is a defender that is the system designer, and the other one is an attacker that is a DoS attacker. The defender's action is N c = {N c (1), N c (2), ..., N c (n)}. The reward function of defender is defined as The attacker's action is δ= δ N c (1)1 δ N c (2)2 · · · δ N c (n)n . The reward of attacker is −r (N c , δ). Therefore, it is a zero-sum game. The structure optimization becomes a min-max optimization problem as follows: where N c * and δ * are the equilibrium of game model, which can also be treated as the optimal solution for both defender and attacker. Thus, it can obtain r(N c , δ * ) ≤ r(N * c , δ * ) ≤ r(N * c , δ).

Structure Optimization under Dos Attacks
For a given DoS attack δ, we know that the optimal structure and control law can be figured out by the iteration of Equation (20): The iteration derives as follows.
If j / ∈ N * c (i), the iteration of Equation (20) usually obtains a small vec(W ij ) instead of vec(W ij ) =0 because it is an numerical method. Therefore, a threshold method is used in our algorithm to obtain N * c | δ . We can obtain the estimated N * c | δ by Equation (25): where N + is the integer set, and ρ i is a given threshold that is a small positive number. The equilibrium of Stackelberg game can be reached by solving the optimization problem of Equation (23), or an optimization problem of minimization with constraints as follows: In summary, the algorithm of structure optimization under DoS attacks is described in Algorithm 1.

Algorithm 1 Algorithm of Structure Optimization under DoS Attacks
Input: : the radial basis dimension l, area number n, learning rate β i and γ i . Output: : system structure N * c and optimal attack strategy δ * .
a. Initialize M a , M c , W i , i = 1, 2, . . . , n and δ(0), each element is subject to Gaussian distribution with small average value 0.01 and variance 0.14. Set t = 0 and maximal iteration number as MAXITER; b. For given δ(t), findN * c δ(t) by the iteration of Equation (20) and (25); c. Measure the gradient of r [ N * c | δ , δ(t)] that is subject to δ; d. Update δ to obtain δ(t + 1) based on the gradient obtained in step c, and handle the constraints of Equation (26) by Lagrange method [29] or barrier function method [30]; e. Check whether the updated variance is less than a threshold or t equals to the maximal iteration number MAXITER. If yes, end the algorithm and go to step f , otherwise t = t + 1 and return to The obtained N * c by Algorithm 1 is the optimized control structure of power system. The worst case of system performance with our controller is also known by Algorithm 1. However, we do not further consider the unstable control system caused by a large C [5] or the attacker has a lot of cyber resources to launch DoS attacks. These issues will be investigated in future.

Experiments and Analysis
This section illustrates two cases of IEEE 14 bus system and 24 bus system [31] to show the effectiveness and advantages of the proposed scheme. The scheme takes DoS attacks into account and considers the worst case of attacks under the constraints in the Stackelberg game model mentioned above. In the Stackelberg game model, the scheme uses the optimal structure design to enhance the system performance by reinforcement learning. We assume sub-system parameters are as follows: where M i and D i are inertia and damping constant respectively, T gi and T di are the governor and gas turbine constant respectively, and R gi and T ij are the regulation and synchronizing constant respectively [20]. The related parameters of sub-system are listed in Table 1 [24]. The constraints of DoS attacks are set as follows: The subsystems in grid communicate with each other by communication channels. The above constraint means that the sum of all communication channel jamming possibilities is 1.4, or the average jamming possibility is 0.1 for each communication channel. The larger C in Equation (26) is, the more cyber resources for attackers would be taken. In the following simulations, we set the sampling period to 0.1 s.

Case I: IEEE 14 Bus Test System
The IEEE 14 bus test system is carried out under reinforcement learning control and DoS attacks. The initial state of power grid simulation is assumed to be affected by a large disturbance. The simulation time lasts 5 s. Without any control, we can see that the system collapses under serious DoS attacks after a large power grid disturbance occurs as shown in Figure 3. The frequency increases to a large value, which reaches to a maximal value about 40 Hz in 5 s. Therefore, a controller is required to maintain the stability. The parameters of reinforcement learning algorithm are listed in Table 2. The detail of parameter selection principle can be referred to [2]. The parameter β i decides the result of IEEE 14 bus test system. If β i is too large, the online learning of Equation (20) diverges; otherwise, too small β i may cause the slow convergence rate. The selection of γ i is another key factor. Too large γ i may result in the poor control performance and high sparsity of W i .   5 show the results that use the proposed control scheme and the optimal DoS attack strategy. The optimal DoS attack strategy is obtained by solving Equation (26), and the optimal controllers are learned by the online iteration of Equation (20). According to the results in Figure 4, we can see the frequency deviation curves of all sub-systems converge to a steady value around 0. The maximal fluctuant magnitude of these curves is a small value about 1.2 Hz. The swings end in about 4 s.
For the optimal DoS attack strategy shown in Figure 5a, the attacks focus on the communication from buses 3, 6, 10, 11, and 13. For the optimal control structure about placement of sensor, actuator, and communication topology shown in Figure 5b, it shows the F-norm for block W ij in W i , i, j = 1, 2, ..., n. If the norm value is large, it means sub-system j needs information from sub-system i. Thus, sub-system i needs to install sensors, and sub-system j requires actuators as well as the communication topology from sub-system i to j. The sensors are mainly installed in buses 2, 3, 6, 10, 11, 13, and 14. According to the solution, those communication lines that are not attacked maintain the system stable. Thus, the obtained solution is optimal for both attacker and defender (control system designer) as Nash equilibrium.

Case II: IEEE 24 Bus Test System
This section verifies the effectiveness of the proposed scheme in a relatively large system-IEEE 24 bus test system. The simulation time lasts 10 s. Both reinforcement learning control and DoS attacks are applied. The optimal attack strategy and the optimized control structure are obtained by solving the optimization of Equation (26). Specifically, the optimal attack strategy is solved by the offline optimization algorithm described in Section 4.2, and the optimized control structure is obtained by the online learning of Equation (20).
The parameters of reinforcement learning algorithm are listed in Table 3. The detail of parameter selection can be referred to [2]. The simulation results are shown in Figures 6 and 7. Figure 6 shows the frequency deviation curves of power grid under the worst case of DoS attacks. The results illustrate the effectiveness of propose solution under the large-size disturbance and DoS attacks. The swings of all the frequency deviation curves end in about 5 s. The magnitude of swings is small, and its maximal value is 1.5 Hz.  From the optimal attack strategy results shown in Figure 7a, we know the attacker should focus on the communications from buses 2, 3, 4, 14, 19, 21, and 22 with the attack possibility of 0.2-0.35. The optimal control structure under the worst case of DoS attacks is shown in Figure 7b. The sensor should be installed in buses 2, 3, 4, 6, 14, 15, 19, 22, 23, and 24. The weight of bus 19 is high, which means the information obtained from bus 19 has high importance. Thus, DoS attacks have such an attack strategy that attacks the communication lines of bus 19 with a relatively high attack rate.

Conclusions
This paper proposes a novel optimization method of control structure as well as a reinforcement learning method in an integrated neural network under DoS attacks. The reinforcement learning, which involves the frequency damping and control performance, is approximated by the integrated neural network. The frequency damping includes the desired control input holding the system stable and accelerating the convergence rate in online learning. The approximation of control performance is adopted as the reinforcement signal to enhance the long-term system performance. To obtain a reasonable control scheme and structure under DoS attacks, a Stackelberg game model is also proposed. The worst case of DoS attacks is considered and solved by the proposed optimization solution. The optimization solution also derives the optimal control structure, which involves the placement of sensors, actuators (DES), and the communication topology. The simulation results illustrate the effectiveness of the proposed scheme. The optimal DoS attacks and the control structure are obtained in the simulation. In future, the convexity of the game and the existence of game equilibrium will be analyzed. The constraints of DER considering the environment and variation will also be considered. In addition, the error of the proposed algorithm will be analyzed theoretically and numerically in the next steps.