Preventive Control Policy Construction in Active Distribution Network of Cyber-Physical System with Reinforcement Learning

: Once an active distribution network of a cyber-physical system is in alert state, it is vulner-able to cross-domain cascading failures. It is necessary to transit the state of an active distribution network of cyber-physical system from an alert state to a normal state using a preventive control policy against cross-domain cascading failures. In fact, it is difﬁcult to construct and analyze a preventive control policy via theoretical analysis methods or physical experimental methods. The theoretical analysis methods may not be accurate due to approximated models, and the physical experimental methods are expensive and time consuming for building prototypes. This paper presents a preventive control policy construction method based on a deep deterministic policy gradient idea (shorted as PCMD) to generate and optimize a preventive control policy with Artiﬁcial Intelligence (AI) technologies. It adopts the reinforcement learning technique to make full use of the available historical data to overcome the problems of high cost and low accuracy. Firstly, a preventive control model is designed based on the ﬁnite automaton theory, which can guide the data collection and learning policy selection. The control model considers the voltage stability, frequency stability, current overload prevention, and the control cost reduction as a feedback variable, without the speciﬁc power ﬂow equations and differential equations. Then, after enough training, a local optimal preventive control policy can be constructed under the comparability condition among a ﬁtted action-value function and a ﬁtted policy function. The constructed preventive control policy contains some control actions to achieve a low cost and in accord with the principle of shortening a cross-domain cascading failures propagation sequence as far as possible. The PCMD is more ﬂexible and closer to reality than the theoretical analysis methods and has a lower cost than the physical experimental methods. To evaluate the performance of the proposed method, an experimental case study, China Electric Power Research-Cyber-Physical System (shorted as CEPR-CPS), which comes from China Electric Power Research Institute, is carried out. The result shows that the effectiveness of preventive control policy construction with the PCMD is better than most current methods, such as the multi-agent method in terms of reducing the number of failure nodes and avoiding the state space explosion.


Introduction
Active distribution network (ADN) is a typical cyber-physical system (CPS), which consists of a wide variety of distributed generators connected to a power network (PN) and a communication network (CN) [1]. An ADN at every moment of time can be abstracted as a directed graph, where equipment is abstracted as nodes, and power lines or communication connections among equipment are abstracted as edges [1], and the directed graph at different time points may be different due to changes in edges or nodes of an ADN. There are five types of nodes in the PN according to the role of nodes: a power generation node type V P1 , a substation node type V P2 , a distribution node type V P3 , a load node type V P4 , and an external node type V P5 (the energy from power transmission lines). Similarly, nodes in the CN are categorized into an information relay node class V C1 , a sensor node class V C2 , an actuator node class V C3 , and a control node class V C4 . Nodes in the PN are mutually interdependent with nodes in the CN. Specifically, a substation node n P2 ∈ V P2 in the PN supplies power to nodes in the CN, a sensor node n C2 ∈ V C2 in the CN collects the information from nodes in the PN, and an actuator node n C3 ∈ V C3 drives the behaviors of nodes in the PN. There is a risk of the potential cross-domain cascading failures (CCF) spreading alternatively across the PN/CN due to the interdependence between the PN and CN in an ADN. The propagation process of CCF can be divided into two stages [2]. The first stage is a slowly changing process lasting for several minutes. The second stage is an avalanche-collapse process, which could lead to a large area of power outages and cause large economic losses. Hence, it is essential to prevent the propagation of CCF so as to alleviate the side effect of this event and improve the level of the safe and stable operation of an ADN.
A directed graph at every time point could be described as a state of an ADN. These states can be divided into steady states and transient states among ADN operations. A steady state means that an ADN is in stable operation, and a transient state is an intermediate state between steady states. During the propagation process of CCF, an ADN goes through a series of transient states from one steady state to another steady state. If the CCF is not prevented when the CCF is propagating, an ADN must be transferred into the next transient state, and it must stop at a final steady state, and the ADN may collapse. Hence, once the propagation process of CCF is initiated, it is better to use the preventive control to block the propagation process of CCF.
The preventive control process against CCF in an ADN is shown in Figure 1, there are five steps in each routine loop operation. They are state perception, failure diagnosis, decision, preventive control policies, and control action. The state perception gathers the real-time data (such as the voltage and current of nodes) of an ADN. The failure diagnosis detects failure symptoms and predicts failures that could be fired through some propagation paths of the CCF. The goal of decision is to select an optimal preventive control policy (PCP) to prevent potential accidents and the CCF propagation in an ADN. The preventive control policies (PCPs) are some strategies that could be adopted for decision step, and they are constructed in advance according to safety and stability requirement specifications of an ADN. The control action is the last routine step, which just carries out a sequence of operations on an ADN according to decisions and affects the state transition process of an ADN. In a process of the preventive control routine, a PCP is very important, it provides roadmaps of operations against CCF and determines the performance of the preventive control routine. A strategy of PCPs needs to minimize the length of CCF sequences so as to block the propagation of CCF in an ADN by exerting the control actions with low cost. The selected PCP needs meet some prevention goals optimally, such as preventing voltage collapses, transient instabilities, etc. Theoretical analysis methods [3][4][5][6][7][8] and physical experimental methods [9][10][11] are two typical solutions for a PCP construction. The theoretical analysis method needs to obtain the approximated models of controlled plants, and the experimental method needs to build physical prototypes and owns a heavy cost. So, traditional methods for constructing PCPs always meet some challenges, such as high costs, long durations, and inaccuracies.
A data-driven approach can make full use of a lot of available historical data to overcome the shortcomings of the existing methods, generating and optimizing a PCP through AI technologies with lower cost. According to the goals of preventive controls against failures, there are many AI technologies that could be adopted for constructing PCPs. Chengxi Liu et al. propose a dynamic security assessment method and obtain a PCP based on a decision tree idea to ensure the dynamic security [12]. In order to improve the voltage stability margin of power systems, Mauricio C. Passaro et al. propose a preventive control method for rescheduling power generation based on a neural sensitivity model. The time series data samples from time domain simulations and the dynamic behavior information of the system are used, and the sensitivity is used to select the most effective set of generators to improve the security of the power system [13]. C. Fatih Kucuktezcan et al. use population optimization techniques to construct a PCP. They reduce the search space of each algorithm according to the size of an objective function and make multiple optimization algorithms run continuously, so as to improve the transient security of a power CPS [14]. Soni B P et al. use a wide area measurement system (WAMS) and phasor measurement devices to conduct real-time transient stability assessment. Specifically, the support vector machine (SVM) based on least squares is used to identify the steady state of the power systems in real time, and then, the appropriate dispatching generators are selected for preventive control to ensure transient stability [15]. Kou P et al. use the algorithm of deep deterministic policy gradient with a safety exploration layer for preventive control to ensure that the node voltages of active distribution network are within the limit [16]. However, these studies focus on the construction of PCPs for only a single failure in CPS, so these constructed PCPs cannot prevent the occurrence of cascading failures. In order to prevent the occurrence of cascading failures, researchers construct many PCPs. Rabie Belkacemi et al. use a distributed adaptive multi-agent algorithm to get a PCP. The PCP could block the propagation paths of cascading failures by dispatching the power of generators based on N-1 criterion [17]. Sina Zarrabian et al. use neural network techniques to construct a PCP to prevent the propagation of cascading failures [18], and they also use the reinforcement learning method based on Q learning to obtain a PCP to prevent cascading failures [19]. Mojtaba Khederzadeh and Arash Beiranvand use a strategy of specific thresholds to evaluate the lines with high vulnerabilities and then obtain a PCP based on a genetic algorithm. The PCP eliminates the overload of high vulnerability lines through load shedding, so as to prevent the spread of cascading failures in a CPS [20]. Dutta O et al. develop a distributed management system using adaptive critical design based on adaptive dynamic programming. The system can flexibly take preventive actions and corrective actions to deal with the thermal overload of lines in active distribution network [21]. The existing research work about data-driven methods only focuses on a single PN and is not studied from the perspective of the interdependence between PN and CN. Since the CCF propagates between the PN and CN alternatively and the existing data-driven methods for PCPs construction only collect the observation data from the PN, which are not applicable for preventing the CCF due to insufficient data. In order to prevent the propagation of CCF in an ADN, not only the measured data from the PN but also the measured data from the CN should be collected.
It is difficult to establish a collaborative simulation of the physical process and the computational process in an ADN, such as digital twin. However, reinforcement learning techniques can analyze and summarize the interaction between the physical process and the computational process of an ADN by using the existing empirical data and play the role of simulation and experiment. Using reinforcement learning techniques, there are three advantages for constructing a PCP: less time, lower cost, and higher accuracy. Therefore, a PCP construction method based on deep deterministic policy gradient idea (shorted as PCMD) is proposed to prevent the propagation of CCF in an ADN, and this method needs to collect the data about voltage stability, frequency stability, current overload prevention, robustness for CCF, and control cost. However, there are some challenges during the construction process. That is, how to choose the concrete objectives and weight them into the reward function; how to adjust parameters so as to ensure that the construction process of PCMD converges and a PCP against CCF exists; how to ensure that the construction process converges to a local optimal solution; how to verify the effectiveness of the proposed PCMD.
Some contributions are concluded as follows.
(1) A modeling method is proposed for the preventive control using the finite automaton theory. The preventive control model describes the effect of CCF in an ADN after the intervention of control actions and can guide the gather of required trajectory dataset and the construction of a PCP. (2) PCMD for PCP construction is presented, and it guarantees the voltage stability, the frequency stability, the current overload prevention, and the improved robustness against CCF of an ADN. The constructed PCP can generate control actions with low cost based on the principle of shortening a CCF propagation sequence as far as possible. The result shows that the effectiveness of the PCP constructed from the PCMD is better than that with others, such as a multi-agent method in terms of reducing the number of failure nodes and avoiding the state space explosion.
The paper is organized as follows. A preventive control model is given in Section 2. In Section 3, the specific construction method is presented. Section 4 gives a case provided by China Electric Power Research Institute to validate the effectiveness of the construction method. Section 5 outlines the discussion.

Preventive Control Model
In order to obtain the learning dataset for PCP construction, a preventive control model is constructed to describe the steady state transition process with external interventions against CCF in an ADN. The model can describe transition relationships among the steady states after applying preventive actions to block the propagation of CCF in an ADN, and CCF sequences should be as short as possible. For example, there are three steady states (noted as g 0 , g 1 , and g 2 ), as shown in Figure 2. In a steady state g 0 , substation nodes (Bus 1 and Bus 3 ) supply power to all connected nodes of the CN together, and Bus 1 is a backup power to enable all alternative power edges when primary power Bus 3 fails with disconnected edges to the nodes in the CN; in this case, the steady state g 0 is transited to another steady state g 1 after nodes (PG 1 , Bus 3 , and EN 1 ) fail. Similarly, when the power generation node PG 1 , the substation node Bus 3 and the external node EN 1 are repaired via the external interventions, and all disconnected edges have connected again, but they are inactive, the ADN transfers a steady state g 2 from state g 1 . This transfer process can be described by a finite state machine. (1) G = {<V, E>} represents a finite set of directed graphs, which is a steady state under the stable operation of an ADN. The steady state g i = <V i , E i > ∈G(i ≥ 0) represents a directed graph in an ADN at discrete time i * ∆t, and ∆t denotes the sample interval, V i represents the node set in the steady state g i of an ADN. g 0 = <V 0 , E 0 > is the initial steady state of an and, which is in normal work condition as shown in Figure 2a. A node n i ∈V i is represented by a feature vector (a i1 , a i2 , . . . , a iNOF ) T , NOF is the number of features of the node n i . The set E i represents connected relations between a pair of nodes among the PN and CN under the steady state g i of an ADN at discrete time i * ∆t. For example, the features of a power generation node n i = PG 1 has 6 feature attributes, which include voltage a i1 = v i , current a i2 = I i , active power a i3 = P Gi , reactive power a i4 = Q Gi , power adjustment a i5 = ∆P Gi , frequency a i6 = fref.
(3) A⊆R m represents a set of control actions, which is composed of some control actions following a PCP to prevent the CCF propagation. A PCP is a mapping (l ⊥ : S→A) from the CCF set to the control action set. For example, in order to prevent the instability of an ADN caused by the failure of the source node Bus 3 , a backup power Bus 1 is activated to enable alternative power edges to the nodes in the CN primarily powered by Bus 3, and all connections from nodes (PG 1 , Bus 3 and EN 1 ) to other devices are cut off, the topology of the ADN is changed, as shown in Figure 2b.
(4) F: G×S→G represents a steady state transition function between two steady states of an ADN. For example, the CCF propagation sequence s 0 of a state g 0 = <V 0 , E 0 > ∈G in an ADN is blocked by a control action l(s 0 )∈A, so that the ADN can quickly reach a new steady state (5) Prb: F→R represents a state transition probability function under a certain action during the propagation process of CCF, R represents the set of real numbers. The propagation process of CCF has the Markov property [22], and this process is described as a Markov stochastic process. So, the state transition probability function is only related with the current state and action.
(6) r: F→R is an immediate reward function after taking a certain action on a transition.
The preventive control model is a finite automaton, which can be described as a transition system. The state transition of an ADN shown in Figure 3 can be described via a finite automaton, which corresponds to Figure 2.

Preventive Control Policy Construction
In the preventive control model, a PCP (l ⊥ : S→A) against CCF is described. Theoretical analysis methods and physical experimental methods are used to construct this mapping (l ⊥ : S→A). However, the theoretical analysis methods may not be accurate due to approximated models, and the physical experimental methods are expensive and time consuming for building prototypes. A PCP construction method based on deep deterministic policy gradient idea (shorted as PCMD) is proposed to obtain the mapping (l ⊥ : S→A). In the PCMD, two problems-which are the convergence and optimization conditions of the construction process-need to be solved. The specific construction method, solutions to two problems, and an algorithm implementation are explained later.

Construction Method
The PCMD is divided into three steps, which are data collection, failure diagnosis, and preventive control policy fitting. In the first step of PCMD, it needs to collect data including the voltage of nodes, frequency, the current of nodes, and the robustness for CCF, and it needs also to consider the control cost reduction as a feedback variable. In the second step of PCMD, it needs to detect failure symptoms and predict a CCF propagation sequence from the occurrence of an initial failure. In the third step of PCMD, it uses the deep neural network to maintain a parameterized PCP l(s,w l ) pursuing the maximum of the infinite horizon discounted expected cumulative reward. Based on the model defined in definition 1, the infinite horizon discounted expected cumulative reward from an initial state g 0 is is the expectation about different trajectories, F i is a steady state transition function in a trajectory, and γ is the discount factor. In addition, the PCMD needs also deep neural networks to maintain a parameterized target PCP l'(s,w l' ) and parameterized action-value functions (Q(s,l(s,w l ),w Q ) and Q'(s,l'(s,w l' ),w Q' )) for constructing l(s,w l ) steadily. Where s∈S is a CCF propagation sequence, and s i is a CCF propagation sequence in a steady state g i . w l , w l' , w Q , and w Q' are the parameters of parameterized functions.
According to Definition 1, many possible expected steady states could be reached from the current steady state when a PCP is attempted to block a CCF propagation sequence. The goal of the constructed PCP is to be selected to reach an optimal steady state from the current steady state with minimum cost. Those requirements about the expected steady state are described by an objective function, as shown in Formula (1).
where V P1 denotes the set of power generation nodes (distributed generators), and V P5 denotes the set of external nodes (the energy from power transmission lines). W Gi denotes the weight of the control cost objective, and W RP denotes the weight of the robustness (against CCF) objective. ∆P Gi is defined in Definition 1. R PF (a discrete quantity) denotes the proportion of failure nodes during the propagation of CCF in an ADN. Formula (1) describes the control cost objective and the robustness (against CCF) objective. Instead of using the Pareto idea, the commonly used weighting method [19] is used to integrate multiple objectives into a single objective function.
During the process of blocking a CCF propagation sequence, the safety requirements and the basic physical constraints should be obeyed. Formula (2) describes inequality constraints (including continuous quantities and discrete quantities) corresponding to safety requirements. Formula (3) describes the equality constraint of power flow equations in the PN.
where P Gi and Q Gi denote the active power and the reactive power of a power generator node n Gi , respectively, P Li and Q Li denote the active power and the reactive power of a load node n Li , respectively, v i denotes the voltage of a node n i in the PN, I Pi denotes the current of a node n i in the PN, fref denotes the frequency of an ADN, and P Gi , and fref are all defined in Definition 1. To be safe, it is necessary to limit the value ranges of variables. Correspondingly, P Gi_min and P Gi_max denote the lower and the upper active power bound of a power generator node n Gi , respectively, Q Gi_min and Q Gi_ Max denote the lower and the upper reactive power bound of a power generator node n Gi , respectively. Similarly, ∆P Gi_min and ∆P Gi_max denote the lower and the upper control action bound, P Li_min and P Li_max denote the lower and the upper active power bound of a load node n Li , respectively, Q Li_min and Q Li_max denote the lower and the upper reactive power bound of a load node n Li , respectively, v i_min and v i_max denote the lower and the upper voltage bound of a node n i in the PN, I Pi_max denotes the upper current bound of a node n i in the PN. fref min and fref max denote the frequency limits of an ADN. NON P is the number of nodes in the PN. ζ i denotes the voltage phase angle of a node n i in the PN. Y ij and α ij denote the admittance and admittance phase angle between a node n i and a node n j , respectively, in the PN. In order to get the fitted PCP, the objective function and inequality constraints shown in Formulas (1) and (2) are integrated into the infinite horizon discounted expected cumulative reward, and then the immediate reward r of a transition should be given. The immediate reward r is composed of a reward r 1 reflecting voltage stability, a reward r 2 for preventing the CCF caused by the current overload, a reward r 3 reflecting the frequency stability, a reward r 4 reflecting the control cost, and a reward r 5 reflecting robustness against CCF. The reward r 1 is shown in Formula (4).
where r 1 is the weighted sum of r ni_1 , and the definition of r ni_1 is shown in Formula (5). w ni is the weight (proportional to the importance of a node) of a node n i , V P is the set of nodes in the PN. r 1_ni_1 (greater than zero) and r 1_ni_2 (less than zero), respectively, denote the specific immediate reward value when the voltage value of the node n i is in different ranges, • denotes the norm, v i_normal is the reference voltage value of a node n i , v ni = min{ v i_normal -v i_min , v i_max -v i_normal } is the voltage threshold of a node n i .
The reward r 2 is shown in Formula (6).
where r 2 is the weighted sum of r ni_2 , and the definition of r ni_1 is shown in Formula (7). E P is the set of edges in the PN, r 2_ni_1 (greater than zero) and r 2_ni_2 (less than zero), respectively, represent the specific reward value when the current value of the node n i is in different ranges, I i_normal is the reference current value of a node n i in an ADN, I ni = I Pi_max is the current threshold value of the node n i .
The reward r 3 is shown in Formula (8).
where r 3_fref_1 (greater than zero) and r 3_fref_2 (less than zero), respectively, represent the specific reward value when the frequency value is in different ranges, fref normal is the reference frequency value, c fref = min{ fref min -fref normal , fref max -fref normal } is the frequency threshold value of an ADN. The reward r 4 is shown in Formula (9).
where a ∈ R m = A is the control action, r 4_c_1 (greater than zero) and r 4_c_2 (less than zero), respectively, denote the specific reward value when the control action is in different ranges, and c Pc is the control action threshold value. The reward r 5 is the sum of r 5_1 and r 5_2 , as shown in Formula (10).
where r 5_1 denotes the number of failure nodes at the current time t, r 5_2 denotes the cumulative number of failure nodes from the start time 0 to the current time t. r nf1 and r nf2 denote the weights of r 5_1 and r 5_2 , respectively, and they are both greater than zero, and n failure (i * ∆t) (i∈{0,1,2, . . . , t/ ∆t}) denotes the number of failure nodes in an ADN at discrete time i*∆t.
The values of the rewards r 1 , r 2 , r 3 , and r 4 are the larger the better, while the value of the reward r 5 is the smaller the better. Thus, the total immediate reward r is shown in Formula (11).
The convergence and optimization conditions of the construction process in the PCMD need to be guaranteed, and they are considered in the following sections.

Convergence
In the PCMD, it is necessary to ensure that the construction process converges and the PCP against CCF generated from the construction process exists. These requirements are guaranteed by the two step parameter's adjustments. In the first step, the convergence of the construction process could be guaranteed by adjusting the parameters in fitted functions (l(s,w l ), l'(s,w l' ), Q(s,l(s,w l ),w Q ), and Q'(s,l'(s,w l' ),w Q' )). In the second step, the existence of the PCP against CCF could be guaranteed by adjusting additionally the parameters in the reward r.
The parameters fitted functions need to be adjusted, including the learning rate δ, the initial weights of fitted functions (l(s,w l ), l'(s,w l' ), Q(s,l(s,w l ),w Q ), and Q'(s,l'(s,w l' ),w Q' )), the parameter α, and other parameters of fitted functions based on deep neural networks. (1) ϕ(w) has a unique fixed point w ⊥ on the closed interval [w 1 , w 2 ].
The process of fitting l ⊥ (s) is actually a fixed-point iterative scheme, and the weight w l of the fitted PCP l(s,w l ) is obtained by solving the equation w l = ϕ(w l ). The iterative function ϕ is defined as shown in Equation (13).
where N 5 is the size of a trajectory data sample from a data pool, s i is a CCF sequence in a steady state g i . If the adjustments of the learning rate δ and the initial weight of the fitted function l(s,w l ) meet the preconditions of Theorem 1, then the weight sequence of the l(s,w l ) converges to the solution w (⊥) l of the equation w l = ϕ(w l ). The following analysis then explains how the initial weights of the fitted functions (l'(s,w l' ), Q(s,l(s,w l ),w Q ) and Q'(s,l'(s,w l' ),w Q' )) are adjusted. For example, due to the weights update w Q at step k, there is an error equation shown in Equation (14).
where w (k) Q denotes the weight of the Q'(s,l'(s,w l' ),w Q' ) at step k. w (u) Q denotes the weight of the Q(s,l(s,w l ),w Q ) at step u, and u ≥ k. w (⊥) Q denotes the true value of the weights w Q and w Q' . An inequation is obtained by taking the norm of the elements in Equation (14), as shown in Formula (15).
The true action-value function Q ⊥ (s i , a i ) of a steady state g i has the relation shown in Formula (16).
Thus, the objective function value y i of the Q(s i ,l(s i ,w l ),w Q ) in a steady state g i is derived from Formula (16), as shown in Formula (17).
where s i+1 is a CCF sequence of a steady state g i+1 . From Formula (17), a loss function L Q for updating the weight w Q of the Q(s,l(s,w l ),w Q ) is shown in Equation (18).
According to Formulas (17) and (18) So, the weight sequence of the Q'(s,l'(s,w l' ),w Q' ) converges to the true value w (⊥) Q . Likewise, the weight sequence of the Q(s,l(s,w l ),w Q ) converges to the true value w (14) and repeating the substitutions, as shown in Formula (19).

An inequation is obtained by taking the norm of the elements in Equation
where w (0) Q denotes the initial weight of the Q(s,l(s,w l ),w Q ), and w (0) Q denotes the initial weight of the Q'(s,l'(s,w l' ),w Q' ).
According to Formula (19), the weight of the fitted function Q'(s,l'(s,w l' ),w Q' ) tends to the true value w (⊥) Q as u and k all tend to infinity. Because of the parameters α < 1 and η Q ≤ 1, the initial weight w 0 Q' of the Q'(s,l'(s,w l' ),w Q' ) can be selected randomly. The initial weight w Q (0) of the fitted function Q(s,l(s,w l ),w Q ) can be selected randomly due to Formulas (16) and (17). By exploring the similar analysis, the weight sequence of the l'(s,w l' ) converges to the true value w (⊥) l and the initial weight w l' (0) of the l'(s,w l' ) can also be selected randomly. As for the adjustment of the parameter α, the value of the parameter α < 1 can be set as high as possible in order to speed up the convergence of the fitted functions Q'(s,l'(s,w l' ),w Q' ) and l'(s i ,w l' ). However, because of η Q ≤ 1 and η Q' ≤ 1, the value of the parameter α < 1 should not be set too high in order to improve the computational stability of the fitted function Q(s,l(s,w l ),w Q ).
In addition, in order to prevent the output of the fitted functions (l(s,w l ), l'(s,w l' ), Q(s,l(s i ,w l ),w Q ), and Q'(s,l'(s,w l' ),w Q' )) from being saturated, the data should be standardized (per unit is used in this paper), the number of neural network layers should be appropriately reduced, the batch normalization layer should be raised, and the activation function layers should be placed behind them. After applying the above parameters adjustments, the convergence of the construction process could be guaranteed, and a PCP could be generated.
In order to guarantee the existence of the PCP against CCF in an ADN, the parameters in the reward r should be adjusted in addition to the parameters adjustments in fitted functions. The idea of the parameters adjustment in the reward r is to reduce the propagation width and depth of CCF. The basic idea of the parameter adjustments based on reducing the propagation width of CCF in an ADN is to ensure that the number of failure nodes at the same time is as small as possible, and the requirements of the voltage stability, the frequency stability, the current overload prevention, and the control cost reduction should be met. Specifically, the parameters V ni , I ni , r 1_ni_2 , r 2_ni_2 , c fref , r 3_fref_2 , and r 4_c_2 should be adjusted as low as possible, and the values of parameters r 1_ni_1 , r 2_ni_1 , r 3_fref_1 , r 4_c_1 , and r nf1 should be increased. Similarly, the basic idea of parameter adjustments based on reducing the propagation depth of CCF in an ADN is to ensure the serialization steps of nodes with as few as possible successive failures and meet the requirements of the voltage stability, the frequency stability, the current overload prevention, and the control cost reduction. Specifically, the values of parameters V ni , I ni , r 1_ni_2 , r 2_ni_2 , c fref , r 3_fref_2 , and r 4_c_2 should be adjusted as low as possible, r 1_ni_1 , r 2_ni_1 , r 3_fref_1 , r 4_c_1 , and r nf2 should be as high as possible.
So, after the two step parameters adjustments, the PCMD converged, and the PCP against CCF in an ADN generated from the construction process exists.

Satisfiability of Suboptimal Solution
If the PCMD has been guaranteed to be convergent, it is necessary to ensure this process should converge to the optimal solution. The following analysis shows that if the compatibility condition is satisfied, the generated PCP against CCF in an ADN can converge to a local optimal solution.
The performance function for evaluating the PCP against CCF generated in the PCMD is defined as the expected fitted function Q(s,l(s,w l ),w Q ), which is shown in Equation (20). J(l(s, w l )) = E Q(s, l(s, w l ), w Q ) (20) The gradient theorem of the deterministic control policy and the condition of compatibility between the fitted function Q(s,l(s,w l ),w Q ) and the fitted PCP l(s,w l ) against CCF are given, respectively. The gradient theorem of the deterministic control policy shown in Theorem 2 ensures the existence of the gradient of the deterministic control policy. In addition, once the condition of the compatibility between the fitted function Q(s,l(s,w l ),w Q ) and the fitted PCP l(s,w l ) against CCF is satisfied, shown in Theorem 3, a local optimum PCP l(s,w l ) against CCF can be constructed through the construction process proposed in this paper.

Theorem 2. (Gradient Theorem of Deterministic Control Policy):
It is assumed that the environment (the controlled object) satisfies the condition of a Markov decision process (MDP), and the fitted action-value function Q(s,a) and the fitted PCP l(s,w l ) are both continuously differentiable. It is also assumed that the gradient ∇ a Q(s, a) of the function Q(s,a) with respect to the control action a and the gradient∇ w l l(s, w l ) of the function l(s,w l ) with respect to the parameter w l exist. According to Formula (20), then the gradient of the performance function J(l(s,w l )) exists, and it is shown in Equation (21). (The proof of Theorem 2 is given in Appendix A). ∇ w l E(Q(s, l(s, w l ))) == E ∇ a Q(s, a = l(s, w l ))∇ w l l(s, w l ) (21) In the PCMD, there are two fitted PCPs l(s,w l ) and l'(s,w l' ), and there are two fitted action-value functions Q(s,l(s,w l )) and Q'(s, l'(s,w l' )). It is assumed that the gradients ∇ w l Q(s, l(s, w l )), ∇ w l l(s, w l ), ∇ w l l (s, w l ), and ∇ w l Q (s, l (s, w l )) exist. Then, according to Theorem 2, the gradient of the performance function for the fitted PCP l(s,w l ) is shown in Equation (22). ∇ w l J(l(s, w l )) = E ∇ l Q(s, l(s, w l ))∇ w l l(s, w l ) (22) Correspondingly, the gradient of the performance function for the fitted target PCP l'(s,w l' ) is shown in Equation (23).
∇ w l J(l (s, w l )) = E ∇ l Q (s, l (s, w l ))∇ w l l (s, w l ) (23) Theorem 3. (Compatibility Theorem of Fitted Function Q(s,a,w Q ) and Fitted PCP l(s,w l )): In a learning process, the Q(s,a,w Q ) is a fitted function about function Q ⊥ (s, a) of control action a. If the action-value function Q ⊥ (s,a) and the fitted function PCP l(s,w l ) are both continuously differentiable, their gradients∇ a Q(s, a, w Q ),∇ a Q ⊥ (s, a), and ∇ w l l(s, w l ) exist, and the second gradient ∇ w Q ∇ a Q(s, a, w Q ) also exits, and the compatibility condition shown in Equation (24) is satisfied, and the parameter w Q can minimize the function shown in Formula (25), then there is a local optimal PCP, which is noted as l(s,w l ) shown in Equation (26). (The proof of Theorem 3 is given in Appendix B).
∇ a Q(s, a = l(s, w l ), w Q ) = ∇ w l l(s, w l ) T w Q E ∇ a Q(s, a = l(s, w l ), w Q ) − ∇ a Q ⊥ (s, a = l(s, w l ))

(25)
E ∇ a Q(s, a = l(s, w l ), w Q )∇ w l l(s, w l ) = E ∇ a Q ⊥ (s, a = l(s, w l ))∇ w l l(s, w l ) where ∇ w l l(s, w l ) T is the transpose of ∇ w l l(s, w l ). According to Theorem 1, if the parameters adjustments conditions are satisfied, then the construction process for a PCP against CCF converges. According to Theorem 2, the existence of the gradient of the performance function J(l(s,w l )) is guaranteed. According to Theorem 3, if the condition of the compatibility between the fitted function Q(s,l(s,w l ),w Q ) and the fitted PCP l(s,w l ) is satisfied, then the alternative gradient formed by the fitted function Q(s,l(s,w l ),w Q ) and the fitted PCP l(s,w l ) is equal to the gradient of the performance function J(l(s,w l )), and a local optimum PCP l(s,w l ) can be constructed through the PCMD.

Algorithm Implementation
The reinforcement learning algorithms based on the stochastic control policy gradient can deal with the problems with the discrete control action space, the continuous control action space, the continuous state space, and the discrete state space. Because the stochastic control policy gives the corresponding probability for each value in the control action space, it is computation intensive and it could be practically infeasible in large-scale networks. Hence, the deterministic control policy is produced. Deep deterministic policy gradient (DDPG) is a deep deterministic control policy reinforcement learning algorithm [23]. It is different from the random control policy gradient method mentioned above. It integrates the deep neural networks and the deterministic control policy [24], and it updates the parameters of the deep neural networks by using the gradient descent of the deterministic control policy. DDPG uses the deterministic control policy gradient to deal with the continuous action space, and it is also applicable to both the continuous state space and the discrete state space. DDPG idea is used to implement the construction process for the PCP against CCF in an ADN through the offline interactive learning, as shown in Algorithm 1. Algorithm 1 adopts the deep neural network to fit a PCP (l(s,w l )). The weight w l in l(s,w l ) is learned by the PMCD algorithm.
Where Rand denotes the random number, RB denotes the size of the replay buffer, and randN denotes the selected random process. N 1 , N 2 , N 4 and N 5 denote the parameters in Algorithm 1, and N 3 denotes a random process. α and δ are described in Section 3.2.
The execution order of PCMD algorithm for an ADN is as follows. For each time step ∆t under each episode, firstly, the algorithm detects failure symptoms and predict a CCF propagation sequence s t*∆t from the occurrence of an initial failure. Then, the fitted PCP l(s,w l ) selects a control action a t*∆t at discrete time t*∆t according to the current CCF propagation sequence s t*∆t , and the ADN transits from a current state g t*∆t to another state g (t+1)*∆t under the influence of the CCF propagation sequence s t*∆t , and then, the data (s t*∆t , a t*∆t , r t*∆t , s (t+1)*∆t ) are put into a replay buffer. At last, the weights of the Q(s,l(s,w l ),w Q ), Q'(s,l'(s t ,w l' ),w Q' ), l(s,w l ), and l'(s,w l' ) are updated.
The playback buffer mechanism used in Algorithm 1 is first proposed in the paper [25]. The playback buffer mechanism makes the correlated data independent and the noise cancel each other, which makes the learning process of Algorithm 1 converge faster. If there is no playback buffer mechanism, Algorithm 1 could make a gradient descent in the same direction for a period of time. Under the same step size, calculating the gradient directly may make the learning process not converge. The playback buffer mechanism is to select some samples randomly from a memory pool, update the weight of the fitted Q(s,l(s,w l ),w Q ) by the temporal-difference learning, then, use this weight information to estimate the gradients of the Q(s,l(s,w l ),w Q ) and the l(s,w l ), and then update the weight of the fitted PCP. The introduction of the random noise in Algorithm 1 ensures the execution of the exploration process. According to Section 3.2, the convergence of Algorithm 1 can be guaranteed via parameters adjustments, and a PCP against CCF could be obtained. In addition, the introduction of the parameter α in Algorithm 1 ensures the stability of weights updating in the l'(s,w l' ) and the Q'(s,l'(s,w l' ),w Q' ) and thus ensures the stability of the whole learning process. According to Section 3.3, once the compatibility condition is satisfied, the generated PCP against CCF in an ADN can converge to the local optimal solution. Thus, after interacting with the offline trajectory dataset of an ADN iteratively, a fitted PCP can be learned by Algorithm 1.

Case Study
An experimental CEPR-CPS is an ADN case, and it is designed from  Table 1. A directed independence edge is from a node to its successor node. Nodes of PN supply power to nodes of CN. Accordingly, nodes of CN collect the information of nodes in the PN and thus control the actions of them. For example, the node 43 of PN supplies power to nodes 104, 106, and 107 of CN (as shown by the dotted orange line in Figure 4) and the node 99 of CN collects the information of the node 16 in the PN (as shown by the dotted blue line in Figure 4).  The CEPR-CPS is used to verify the effectiveness of the PCP against CCF generated from the PCMD compared with the one from the multi-agent method.

Trajectory Data Collection
In order to construct a PCP against CCF in the CEPR-CPS using the PCMD, it needs to collect the offline trajectory data {i∈Tra|(g i*∆t , a i*∆t , r i*∆t , g (i+1)*∆t )} from the CEPR-CPS. Where Tra = {0,1,2, . . . ,Tr/ ∆t}, Tr denotes the duration of a trajectory. According to Definition 1, the data are collected in the CEPR-CPS at discrete time I * ∆t (i ≥ 0), and the data collected include the state data g i*∆t , the control action data a i*∆t , and the reward data r i*∆t of CEPR-CPS.
The state data g i*∆t of CEPR-CPS include the three-phase voltage and the three-phase current values (complex number, per unit) of the 80 nodes, the rotor mechanical angle (deg) and the rotor speed (per unit) of a hydraulic turbine node, the active and the reactive power values (complex number, per unit) of two wind power generation nodes, and the three-phase voltage values and the three-phase current values (complex number, per unit) from inverters of one photovoltaic node, one micro-gas turbine node, and three energy storage nodes.
According to different operating conditions, leading conditions and subsequent constraints, the control action data a i*∆t of CEPR-CPS include the phase voltage of the controllable voltage source of energy storage nodes, the reactive power of wind power generation nodes, the three-phase voltage of the controllable voltage source of micro-gas turbine nodes and photovoltaic nodes, and the active power of a hydraulic turbine node.
The reward data r i*∆t include a reward r 1 reflecting the voltage stability, a reward r 2 for preventing the CCF caused by the current overload, a reward r 3 reflecting the frequency stability, a reward r 4 reflecting the control cost, and a reward r 5 reflecting robustness against CCF.
After the state data g i*∆t , the action data a i*∆t and the reward data r i*∆t are obtained at the discrete time i*∆t (i ≥ 0), then the transition data (g i*∆t , a i*∆t , r i*∆t , g (i+1)*∆t ) can be constructed. After obtaining the transition data in a successively dependent manner, trajectory data can be obtained. Then, multiple different trajectory data form an offline trajectory dataset {{i∈Tra|(g i*∆t , a i*∆t , r i*∆t , g (i+1)*∆t )}}.

Learning Process
After the trajectory data collection step, the PCMD can interact with the offline trajectory dataset of the CEPR-CPS iteratively. Once parameters in the PCMD are adjusted and the condition of the compatibility between the fitted action-value function Q(s,l(s,w l ),w Q ) and the fitted PCP l(s,w l ) against CCF is satisfied, then the weights of the fitted action-value functions and a local optimal preventive control policy are learned. The parameters used in the PCMD for the CEPR-CPS is shown in Table 2. Where T S is the sampling period (seconds), T F = Tr is the simulation time (seconds), N 3 .var is the variance of the random process, N 3 .decR is the decay rate of the variance of a random process, gr_Th denotes the gradient threshold, stopTrainV denotes the stopping training threshold, cl_RB denotes whether to clear the replay buffer, and isD denotes the stopping threshold of every simulation subprocess. The state is 504 dimensional in total, and the control action is 29 dimensional corresponding to the outputs of distributed generators. The CCF propagation sequence input layer of the fitted Q function is regular and fully connected, including 600 neuron nodes. Its CCF propagation sequence output layer is relu, it is fully connected and contains 400 neuron nodes. The action input layer of the fitted Q function is fully connected and contains 400 neuron nodes, the merging layer of the CCF propagation sequence output and the action output of the fitted Q function is relu, and it is fully connected and contains only one neuron node. The output layer of the fitted PCP l is regularized and fully connected, including 300 neuron nodes. The output layer of the fitted PCP l is tanh, and it is fully connected, containing 29 neuron nodes. However, there is no hidden layer.

Results and Analysis
It focuses on a scenario of the open circuit failure of a node as the initial failure, that is, the initial failure is the physical equipment failure in the CEPR-CPS. The multi-agent method [11] is selected as a control method. Comparison diagram of CCF initiated from a node failure in the PN and in the CN are shown in Figures 5 and 6, respectively.
According to Figures 5 and 6, the PCMD proposed in this paper can block the occurrence of CCF in the CEPR-CPS compared with the existing multi-agent method. Meanwhile, the conclusion that the convergence of PCMD could be guaranteed by the parameters adjustments is verified. The conclusion that a local optimal PCP can be constructed under the comparability condition among a fitted Q function and a fitted policy function is also verified. The PCMD is better than the multi-agent method in terms of reducing the number of failure nodes and avoiding the state space explosion problem. So, an existing problem of the multi-agent method is the curse of dimensionality (feature dimension), and the multi-agent method is not suitable for large-scale networks due to its poor scalability. The PCMD overcome the curse of dimensionality and can be extended to large-scale networks.
The PCMD is compared with the other two kinds of methods: theoretical analysis methods and physical experimental methods in terms of time, cost, reliability, and accuracy. The specific comparison results are shown in Table 3. It can be shown that the PCMD is better than theoretical analysis methods and physical experimental methods in terms of time and cost.

Discussion
In this paper, a preventive control model based on the finite automaton theory is designed, which is a six-tuple to describe preventive actions for blocking the propagation of cross-domain cascading failures in an active distribution network of cyber-physical system, and cross-domain cascading failures sequences should be as short as possible. This model is helpful for guiding the trajectory data collection and learning policy selection.
Then, this paper proposes a methodology (named as PCMD) for constructing a preventive control policy to block the propagation of cross-domain cascading failures in an active distribution network of cyber-physical system. This methodology is based on the deep deterministic policy gradient idea to train the deep neural networks with the trajectory data samples originated from simulations and does not need to consider the specific power flow equations. In addition, the parameter adjustments are proposed to guarantee the convergence of the construction process for generating a preventive control policy against cross-domain cascading failures. The gradient theorem of deterministic control policy and the compatibility condition of the fitted state-action function and the fitted preventive control policy have been given to ensure the suboptimality of the generated preventive control policy.
Finally, the proposed PCMD has been verified in the CEPR-CPS provided by the China Electric Power Research Institute. It is shown that the PCMD is better than the multi-agent method in terms of reducing the number of failure nodes and avoiding the state space explosion problem. So, the scalability of the multi-agent method is poor and not suitable for large-scale networks due to the curse of dimensionality. The space complexity of PCMD is very high, and the PCMD has a defect in storing a lot of data. In addition, the proposed PCMD is compared with the theoretical analysis methods and the physical experimental methods in terms of time, cost, reliability, and accuracy. It is shown that the PCMD is better than theoretical analysis methods and physical experimental methods except in terms of reliability.
In future works, there is also a point we should consider. The propagation process of cross-domain cascading failures is time-dependent, and it also takes time for a preventive control policy to work. Therefore, how to reduce the deviation between the time when the preventive control policy works and the time of failure occurrence, that is, the control actions applied by the preventive control policy, really work before a next failure occurs in a cross-domain cascading failures propagation sequence so as to achieve an accurate preventive control effect.

Conflicts of Interest:
The authors declare that they have no conflicts of interest.