A Hierarchical Self-Adaptive Method for Post-Disturbance Transient Stability Assessment of Power Systems Using an Integrated CNN-Based Ensemble Classiﬁer

: Data-driven approaches using synchronous phasor measurements are playing an important role in transient stability assessment (TSA). For post-disturbance TSA, there is not a deﬁnite conclusion about how long the response time should be. Furthermore, previous studies seldom considered the conﬁdence level of prediction results and speciﬁc stability degree. Since transient stability can develop very fast and cause tremendous economic losses, there is an urgent need for faster response speed, credible accurate prediction results, and speciﬁc stability degree. This paper proposed a hierarchical self-adaptive method using an integrated convolutional neural network (CNN)-based ensemble classiﬁer to solve these problems. Firstly, a set of classiﬁers are sequentially organized at di ﬀ erent response times to construct di ﬀ erent layers of the proposed method. Secondly, the conﬁdence integrated decision-making rules are deﬁned. Those predicted as credible stable / unstable cases are sent into the stable / unstable regression model which is built at the corresponding decision time. The simulation results show that the proposed method can not only balance the accuracy and rapidity of the transient stability prediction, but also predict the stability degree with very low prediction errors, allowing more time and an instructive guide for emergency controls. ensemble classiﬁer for post-disturbance transient stability assessment (TSA).


Introduction
Transient stability or large-disturbance rotor angle stability is referred to as the ability of an interconnected power system to maintain synchronism when subjected to a large disturbance, such as a three-phase short-circuit fault on a transmission line [1]. Transient instability may develop into catastrophes, such as cascading failures and/or widespread blackouts. Therefore, transient stability assessment (TSA) has significant importance in security monitoring of power systems. It is an essential requirement to maintain transient stability in power system operation.
To achieve rapid real-time TSA, the data-driven artificial intelligence (AI) approach [2][3][4][5][6][7] was identified as a novel and promising solution. First of all, through offline training on massive sets of data generated through time-domain (T-D) simulation, it can capture the potential useful knowledge to map the relationship between inputs (features of power system) and outputs (the corresponding dynamic security indices, such as stability status or stability margin/degree). Then, when it comes to online

Generation of Dataset
In terms of data-driven AI methods for TSA, an offline stage of training on a massive amount of data generated by T-D simulations needs to be done at the first step. The methods aim to map the relationship between the inputs (i.e., features) and the outputs (i.e., stability status or stability degree) [4], expressed as follows: where x represents the input predictors, namely, features of the power system, y denotes the corresponding dynamic security indices, such as stability status or stability degree, and f represents the mapping function. The training dataset can be simulated via T-D simulation on a certain power system with various load/generation patterns, network topologies, and fault types, locations, and duration times. TSI [4,22] is defined as follows: where |∆δ| max is the maximum of the absolute value of the angle separation between any two generators. For the classification of transient stability status in this paper, when TSI > 0, the system is considered as stable and vice versa. For the regression of transient stability degree in this paper, a continuous value of TSI is used as the label y. For a stable instance, the TSI is between 0 and 1. The greater the value of TSI is, the more stable the system becomes. For an unstable instance, the TSI is between −1 and 0. The smaller the value of TSI is, the more unstable the system becomes.
Since the label of each simulation instance is determined by TSI, which is calculated in terms of generator rotor angle, the rotor angle-related variables such as rotor angle/speed/acceleration and kinetic energy are usually used as the input features [11][12][13]. The trajectory of relative electromagnetic power, which represents the ratio of the electromagnetic and mechanical power, also contains abundant information of system transient stability status, because it reflects the recovery of the electromagnetic power after the fault clearance [13]. In the literature [4,14], the voltage magnitudes of all the generator buses are often used as inputs. In this research, the dynamic trajectories of voltage magnitude, relative rotor angles, speeds, accelerations, and kinetic energy, electromagnetic power, and relative electromagnetic power of all the generators are used as the inputs.
The input feature types are described in Table 1, where n is the number of total generators in the test system, and s is the number of feature sampling times, which reflects the response time after fault clearance. When s = 1, t1 represents the first sampling time after fault clearing. The calculation of input features and the data normalization process are explained in detail in Appendix A. Taking feature type 1 for example, the normalized input feature type 1 of each instance can be stored as an array which has a similar data structure to that of an image, as shown in Figure 1. Its two axes named generator and time can be viewed as the width and length axes of the image. This characteristic inspired us to apply CNN, a widely recognized deep learning method for image classification, to extract the mapping knowledge for TSA. Voltage magnitude k × n U i (t s ), i = 1, . . . , n; s = 1, . . . , k 2 Relative rotor angle k × nδ i (t s ), i = 1, . . . , n; s = 1, . . . , k 3 Relative rotor speed k × nω i (t s ), i = 1, . . . , n; s = 1, . . . , k 4 Relative rotor acceleration k × nα i (t s ), i = 1, . . . , n; s = 1, . . . , k 5 Relative kinetic energy k × n EK i (t s ), i = 1, . . . , n; s = 1, . . . , k 6 Electromagnetic power k × n P ei (t s ), i = 1, . . . , n; s = 1, . . . , k 7 Relative electromagnetic power k × nP ei (t s ), i = 1, . . . , n; s = 1, . . . , k Energies 2019, 12, x FOR PEER REVIEW 4 of 16 The input feature types are described in Table 1, where n is the number of total generators in the test system, and s is the number of feature sampling times, which reflects the response time after fault clearance. When s = 1, t1 represents the first sampling time after fault clearing. The calculation of input features and the data normalization process are explained in detail in Appendix A. Taking feature type 1 for example, the normalized input feature type 1 of each instance can be stored as an array which has a similar data structure to that of an image, as shown in Figure 1. Its two axes named generator and time can be viewed as the width and length axes of the image. This characteristic inspired us to apply CNN, a widely recognized deep learning method for image classification, to extract the mapping knowledge for TSA.

Convolutional Neural Network
As a well-known deep learning method, CNN works as a deep structure feedforward neural network. It consists of an input layer, multiple hidden layers, and an output layer. There are three types of common construction of hidden layers: convolutional layer, pooling layer, and fully connected layer. Referring to the LeNet-5 [23], it can be found that the order of the layers in a common structure of CNN is as follows: input layer-convolution layer-pooling layer-convolution layerpooling layer-fully connected layer-output layer, as illustrated in Figure 2. The difference in the forward propagation process between the convolution layer and fully connected layer is the convolutional neuron, which shares the same weights as each receptive field (i,j), greatly reducing the number of parameters of network. The output g(n) of a convolution layer with input location (i,j) is as follows: where W is the weight of a filter, b is the bias, "•" is the dot product, representing the sum of the product of the corresponding elements in the matrix, and f is the activation function Rectified linear unit (ReLU), where ReLU( ) = (0, ). The forward propagation process of the convolutional layer is described in Appendix B which gives a simple example. A pooling layer is often added between the convolutional layers. It can effectively reduce the size of the matrix, thereby reducing the parameters in the final fully connected layer. Using a pooling layer can speed up calculations and prevent over-fitting problems. The most commonly used pooling

Convolutional Neural Network
As a well-known deep learning method, CNN works as a deep structure feedforward neural network. It consists of an input layer, multiple hidden layers, and an output layer. There are three types of common construction of hidden layers: convolutional layer, pooling layer, and fully connected layer. Referring to the LeNet-5 [23], it can be found that the order of the layers in a common structure of CNN is as follows: input layer-convolution layer-pooling layer-convolution layer-pooling layer-fully connected layer-output layer, as illustrated in Figure 2. The difference in the forward propagation process between the convolution layer and fully connected layer is the convolutional neuron, which shares the same weights as each receptive field (i,j), greatly reducing the number of parameters of network. The output g(n) of a convolution layer with input location (i,j) is as follows: where W is the weight of a filter, b is the bias, "·" is the dot product, representing the sum of the product of the corresponding elements in the matrix, and f is the activation function Rectified linear unit (ReLU), where ReLU(x) = max(0, x). The forward propagation process of the convolutional layer is described in Appendix B which gives a simple example. The input feature types are described in Table 1, where n is the number of total generators in the test system, and s is the number of feature sampling times, which reflects the response time after fault clearance. When s = 1, t1 represents the first sampling time after fault clearing. The calculation of input features and the data normalization process are explained in detail in Appendix A. Taking feature type 1 for example, the normalized input feature type 1 of each instance can be stored as an array which has a similar data structure to that of an image, as shown in Figure 1. Its two axes named generator and time can be viewed as the width and length axes of the image. This characteristic inspired us to apply CNN, a widely recognized deep learning method for image classification, to extract the mapping knowledge for TSA.

Convolutional Neural Network
As a well-known deep learning method, CNN works as a deep structure feedforward neural network. It consists of an input layer, multiple hidden layers, and an output layer. There are three types of common construction of hidden layers: convolutional layer, pooling layer, and fully connected layer. Referring to the LeNet-5 [23], it can be found that the order of the layers in a common structure of CNN is as follows: input layer-convolution layer-pooling layer-convolution layerpooling layer-fully connected layer-output layer, as illustrated in Figure 2. The difference in the forward propagation process between the convolution layer and fully connected layer is the convolutional neuron, which shares the same weights as each receptive field (i,j), greatly reducing the number of parameters of network. The output g(n) of a convolution layer with input location (i,j) is as follows: where W is the weight of a filter, b is the bias, "•" is the dot product, representing the sum of the product of the corresponding elements in the matrix, and f is the activation function Rectified linear unit (ReLU), where ReLU( ) = (0, ). The forward propagation process of the convolutional layer is described in Appendix B which gives a simple example. A pooling layer is often added between the convolutional layers. It can effectively reduce the size of the matrix, thereby reducing the parameters in the final fully connected layer. Using a pooling layer can speed up calculations and prevent over-fitting problems. The most commonly used pooling A pooling layer is often added between the convolutional layers. It can effectively reduce the size of the matrix, thereby reducing the parameters in the final fully connected layer. Using a pooling layer can speed up calculations and prevent over-fitting problems. The most commonly used pooling layer, max-pooling, is utilized here. In CNN, a larger convolution kernel size corresponds to more free parameters (i.e., weights and bias), which means more computation and more time to train the model. The most commonly used convolution kernel size is 3 × 3, and the size of max-pooling is usually selected as 2 × 2. After several convolutional and pooling layers, the output node matrix should be flattened and connected to a fully connected layer, whereby the calculation process is the same as an artificial neural network (ANN) with ReLU as the activation function. The training algorithm of the CNN used in this paper is outlined in Appendix C [24]. The final output layer uses the softmax function [11,23], described as follows: where y k (X) is the k-th input data of the softmax layer with instance X, and P(C 1 |X) and P(C 2 |X) are the probabilities of instance X identified as category-1 and category-2 for a binary classification problem (i.e., transient stability status prediction). The final output of CNN isŷ i = (P(C 1 |X), P(C 2 |X)); if P(C 1 |X) > P(C 2 |X), the instance will be identified as category-1 and vice versa. In previous literature [11,[17][18][19], it was validated that the CNN is effective on a number of benchmark models and actual life problems in both classification and regression fields. It shows stronger recognition ability for highly nonlinear patterns, learns more useful features automatically from a massive amount of time-series data, dramatically reduces the number of network structure parameters, and has better generalization capacity than some conventional techniques [25][26][27].

The Proposed IS Model
In this section, following the basic idea illustrated in Section 3, we develop an IS model for TSA based on CNN. The workflow chart of the proposed IS model is shown in Figure 3. layer, max-pooling, is utilized here. In CNN, a larger convolution kernel size corresponds to more free parameters (i.e., weights and bias), which means more computation and more time to train the model. The most commonly used convolution kernel size is 3 × 3, and the size of max-pooling is usually selected as 2 × 2. After several convolutional and pooling layers, the output node matrix should be flattened and connected to a fully connected layer, whereby the calculation process is the same as an artificial neural network (ANN) with ReLU as the activation function. The training algorithm of the CNN used in this paper is outlined in Appendix C [24]. The final output layer uses the softmax function [11,23], described as follows: where ( ) is the k-th input data of the softmax layer with instance X, and P( | ) and P( | ) are the probabilities of instance X identified as category-1 and category-2 for a binary classification problem (i.e., transient stability status prediction). The final output of CNN is = P( | ), P( | ) ; if P( | ) > P( | ), the instance will be identified as category-1 and vice versa.
In previous literature [11,[17][18][19], it was validated that the CNN is effective on a number of benchmark models and actual life problems in both classification and regression fields. It shows stronger recognition ability for highly nonlinear patterns, learns more useful features automatically from a massive amount of time-series data, dramatically reduces the number of network structure parameters, and has better generalization capacity than some conventional techniques [25][26][27].

The Proposed IS Model
In this section, following the basic idea illustrated in Section 3, we develop an IS model for TSA based on CNN. The workflow chart of the proposed IS model is shown in Figure 3.

CNN-Based Ensemble Learning
Ensemble learning was widely applied in many research fields, such as machine learning [28,29], electrical power systems [3,11,20], computational biology [30][31][32][33], and so on. It is an effective strategy to increase accuracy. Thanks to the diversity of individual learners, the single learners can adequately compensate for each other and tend to reduce aggregate variance, achieving better results. Different dynamic response trajectories of power systems contain rich dynamic information. Making full use of this information helps better predict the transient stability of power systems. However, the determination of a better classifier with favorable input features can only be done by repeated timeconsuming trials. Moreover, it is hard to say the selected features are versatile for all application scenarios. With this in mind, different CNNs with the same model structure but different types of

CNN-Based Ensemble Learning
Ensemble learning was widely applied in many research fields, such as machine learning [28,29], electrical power systems [3,11,20], computational biology [30][31][32][33], and so on. It is an effective strategy to increase accuracy. Thanks to the diversity of individual learners, the single learners can adequately compensate for each other and tend to reduce aggregate variance, achieving better results. Different dynamic response trajectories of power systems contain rich dynamic information. Making full use of this information helps better predict the transient stability of power systems. However, the Energies 2019, 12, 3217 6 of 20 determination of a better classifier with favorable input features can only be done by repeated time-consuming trials. Moreover, it is hard to say the selected features are versatile for all application scenarios. With this in mind, different CNNs with the same model structure but different types of features as inputs were trained to construct an ensemble classifier. The obtained single CNN models were different even with the same network structure and learning rate, because the input features were different and the initial weights and bias were randomly selected. With this stochastic nature, comprehensively utilizing the diversity of these single classifiers can further improve classification performance. On the basis of previous work [3,11,14,20], this paper proposes a novel IS based on CNN-based ensemble classifiers to more efficiently achieve post-disturbance TSA of power systems. As mentioned in Section 2, seven kinds of input features are utilized, which means that there are seven individual CNN classifiers, and each is constructed using an individual input feature type. The ensemble classifier consisting of m (m = 1, 2, . . . , 7) different individual CNNs is defined as type m. The performance of different types of classifiers is analyzed in Section 5.4.

Integrated Decision-Making Rule
Different from most ensemble models, which determine the final output by voting the majority or average value of the individual outputs [28,29,34], a tailored integrated decision-making rule is proposed in this paper. Each individual CNN classifier exports two probabilitiesŷ n = ŷ [11,23]. They represent the confidence of instance X belonging to each class. The specific integrated decision-making rule [35] for ensemble classification is described in Algorithm 1.

Algorithm 1: Integrated decision-making rule for CNN-based ensemble classification
Given m single CNNs with different types of input features, CNN1, CNN2, . . . , CNNm, whose first outputs are represented byŷ Else it is an uncertain instance at the current decision time, where α and β are user-defined thresholds to judge whether the instance is credible stable/unstable or an uncertain instance. end With the integrated decision-making rule, the ensemble classifier can identify the credible stable/unstable instances and uncertain instances. For those credible instances, they are sent into the corresponding regression model to regress TSI, which reflects their stability degree. Those uncertain instances need further identification at the next TSA decision time. The stable regression model and the unstable regression model were built based on CNN by training a stable dataset and unstable dataset, respectively, with inputs corresponding to time series of post-disturbance voltage magnitude trajectories of the duration from the first post-disturbance sampling to different decision times.

Hierarchical Self-Adaptive Method for TSA
The structure of the proposed hierarchical self-adaptive IS is described in Figure 4. There are a series of individual CNN-based ensemble classifiers, each performing the TSA at a different response times (note that each of the classifiers is trained using different time series of features mentioned above from the first post-disturbance sampling time to the corresponding response time). For an ensemble classifier at response time t = T s , only when the minimum value of these first outputs of these single CNN classifiers is greater than α can this instance be determined as credible stable. Only when the maximum value of these first outputs of these single CNN classifiers is smaller than β can this instance be determined as credible unstable; then, the IS delivers the transient stability prediction result at t = T s . Otherwise, this instance should be determined as an uncertain instance and the classification continues at time t = T s+1 . The instances far away from the classification boundary are easier to identify even at a very short response time. As time goes by, the dynamic characteristics of the power system become more and more obvious and separable. Thus, the uncertain instances are recognized as credible instances at longer response times. Therefore, this proposed hierarchical self-adaptive method allows balancing the rapidity and accuracy of transient stability prediction. Then, the instances that were identified as credible stable and credible unstable instances can be predicted using the stable regression model and the unstable regression model, respectively.

Case Study
The developed IS was tested on a popular benchmark system: IEEE 10-machine 39-bus system. Detailed parameters of this system can be found in Reference [36]. The system frequency was 60 Hz. The IS network was implemented in TensorFlow, a state-of-the-art open-source machine learning framework, by using the computer programming language of Python [37]. Experiments were carried out on a 64-bit personal computer with an Intel Core i5-7200U central processing unit and 4.00 GB of random-access memory. The T-D simulations were conducted by MATLAB Power System Toolbox 3.0.

Data Generation
During the generation of the dataset, the system generation and load patterns were randomly varied within the 75-120% level of the initial operation condition. The contingencies were mainly three-phase permanent short-circuits at ten locations ranging from the beginning to 90% of the length of each transmission line with an increment of 10%. The simulation time was 5 s, and the simulation step was 0.0167 s (one cycle was 0.0167 s for the 60-Hz system). There were 11 assumed types of failure duration times, ranging from 0.0167 s to 0.1837 s with an increment of 0.0167 s. For this test system, 37,400 instances including 24,864 stable instances and 12,536 unstable instances were generated. For the proposed IS model, 22,400 instances were selected for model training, 5000 instances were selected to build the validation dataset which was used for model selection and prevented over-fitting, and the remaining 10,000 instances were selected to form the testing dataset.

Indices for Performance Evaluation
Transient stability prediction is a typical imbalanced classification task, where the number of stable cases is more than that of unstable cases. In the meantime, the costs of misdetections and false alarms are quite different. It is much more serious to misclassify an unstable instance as a stable instance. If a stable case is misclassified as unstable, it may cause malfunction of some control devices, but it has little effect on the safe and stable operation of the entire system. If an unstable case is

Case Study
The developed IS was tested on a popular benchmark system: IEEE 10-machine 39-bus system. Detailed parameters of this system can be found in Reference [36]. The system frequency was 60 Hz. The IS network was implemented in TensorFlow, a state-of-the-art open-source machine learning framework, by using the computer programming language of Python [37]. Experiments were carried out on a 64-bit personal computer with an Intel Core i5-7200U central processing unit and 4.00 GB of random-access memory. The T-D simulations were conducted by MATLAB Power System Toolbox 3.0.

Data Generation
During the generation of the dataset, the system generation and load patterns were randomly varied within the 75-120% level of the initial operation condition. The contingencies were mainly three-phase permanent short-circuits at ten locations ranging from the beginning to 90% of the length of each transmission line with an increment of 10%. The simulation time was 5 s, and the simulation step was 0.0167 s (one cycle was 0.0167 s for the 60-Hz system). There were 11 assumed types of failure duration times, ranging from 0.0167 s to 0.1837 s with an increment of 0.0167 s. For this test system, 37,400 instances including 24,864 stable instances and 12,536 unstable instances were generated. For the proposed IS model, 22,400 instances were selected for model training, 5000 instances were selected to build the validation dataset which was used for model selection and prevented over-fitting, and the remaining 10,000 instances were selected to form the testing dataset.

Indices for Performance Evaluation
Transient stability prediction is a typical imbalanced classification task, where the number of stable cases is more than that of unstable cases. In the meantime, the costs of misdetections and false alarms are quite different. It is much more serious to misclassify an unstable instance as a stable instance. If a stable case is misclassified as unstable, it may cause malfunction of some control devices, but it has little effect on the safe and stable operation of the entire system. If an unstable case is misclassified as stable, no measure is taken to prevent the instability accident, which causes tremendous financial losses and disastrous consequences. Therefore, the accuracy of the overall dataset does not quite reflect the classification performance of the classifier. With this in mind, in addition to the accuracy of the whole dataset, a new index named HRP (harmonic mean of recall and precision) [4] is introduced to better evaluate the performance of unstable instance identification. It was modified to focus on the unstable instances. According to the confusion matrix of TSA shown in Table 2, the related indices are defined as follows: • Accuracy, abbreviated as ACC, is the proportion of instances that are correctly predicted by the classifier.
• HRP is the abbreviation of harmonic mean of recall and precision. It is commonly used in evaluation classifiers.

Implementation Details
In this research, each individual CNN classifier consisted of two convolution layers, two pooling layers, and two fully connected layers. The detailed parameters of each individual CNN classifier are shown in Table A1 (Appendix C). Both the fully connected layer 1 and fully connected layer 2 use dropout technology with the dropout rate as 0.5. For specific details of dropout technology, the reader can refer to Reference [38]. Each individual CNN classifier was trained with a mini-batch stochastic gradient descent with an exponential decay learning rate. Its pre-set initial learning rate was set to 0.01 with an attenuation coefficient of 0.99. The exponential decay learning rate allows the model to quickly approach the best solution in the early stage of training, without too much fluctuation in the later stage of training. Thus, it will be much closer to the local optimum. For specific details of the exponential decay learning rate, the reader can refer to Reference [37]. The scripts of the proposed hierarchical self-adaptive post-disturbance TSA method and the training process of a single CNN are described in Appendix D, where Figure A2 illustrates the processes of offline training and online application of the proposed method.

Parameter Determination
In this test, it was assumed that the generator voltage phasor measurements, rotor angles, electromagnetic power, and mechanical power were sampled at a rate of one sample per cycle. Therefore, we trained each individual CNN classifier with different input trajectories during the period of the first sampling time after the clearance of the fault for each response time. To build the  Figure 4 should be determined. It is helpful to evaluate the performance of the CNN-based ensemble classifier without the integrated decision-making rule, just using the average value of the individual outputs [11]. We randomly selected m (m = 1, 2, . . . , 7) individual CNNs from the seven kinds of individual CNNs to construct classifier type m and calculated the ensemble performance. Then, this operation was repeated many times. Lastly, the average value of these performances was calculated as the performance of classifier type m. The performances of these seven types of classifiers defined in Section 4.1 are shown in Figure 5. The values of the two evaluation indices ACC and HRP obviously increased with the increment of the response time. In addition, the type 7 classifier performed the best among the others. That means that the ensemble classifier with all seven individual CNN classifiers was a good choice. Therefore, m was set as 7 in this paper. times. Lastly, the average value of these performances was calculated as the performance of classifier type m. The performances of these seven types of classifiers defined in Section 4.1 are shown in Figure  5. The values of the two evaluation indices ACC and HRP obviously increased with the increment of the response time. In addition, the type 7 classifier performed the best among the others. That means that the ensemble classifier with all seven individual CNN classifiers was a good choice. Therefore, m was set as 7 in this paper. After determining the specific composition of the ensemble CNN, we further propose a hierarchical self-adaptive method using an integrated decision-making rule to improve the credibility of prediction results and balance the trade-off between TSA speed and accuracy. Before analyzing the results, for the sake of convenience, several definitions are introduced.
is the i-th decision time; CS( ) and CU( ) are the total numbers of credible stable instances and credible unstable instances as of the current decision time, respectively; U( ) and ( ) are the total number of uncertain instances and the rate of uncertain instances (the percentage of the uncertain instances with respect to the total testing instances) as of the current decision time; M(T ) and M( ) are the total numbers of misdetections at and as of the current decision time, respectively; F(T ) and F( ) are the total numbers of false alarms at and as of the current decision time, respectively; A(T ) and A( ) are the current and accumulative TSA accuracy at and as of the current decision time, respectively, calculated as follows: The credibility estimation parameters and in the integrated decision-making rule for CNN-based ensemble classification mentioned in Algorithm 1 are very important, since they directly determine the performance of the IS. The value of parameter is usually set as 0.5 to 0.95, and is usually set as 0.05 to 0.5. We now use the control variable method to observe the effect of varying α while fixing as 0.05. The results are shown in Figure 6. We also observed the effect of varying while fixing as 0.95. The results are shown in Figure 7. The results are based on response time t equal to 30 cycles (0.5 s).
It can be observed in Figure 6 that, with increasing, the rate of uncertain instances ( ) and the accumulative accuracy A( ) increased, the count of F( ) remained unchanged as three, and the count of M( ) decreased to 0 when increased to 0.9; the reason is that a larger means the credibility estimation criterion of stable instances is more strict; thus, there tend to be more classifications being evaluated as uncertain instances, while those classified as credible stable After determining the specific composition of the ensemble CNN, we further propose a hierarchical self-adaptive method using an integrated decision-making rule to improve the credibility of prediction results and balance the trade-off between TSA speed and accuracy. Before analyzing the results, for the sake of convenience, several definitions are introduced. T i is the i-th decision time; CS(T) and CU(T) are the total numbers of credible stable instances and credible unstable instances as of the current decision time, respectively; U(T) and U r (T) are the total number of uncertain instances and the rate of uncertain instances (the percentage of the uncertain instances with respect to the total testing instances) as of the current decision time; M(T i ) and M(T) are the total numbers of misdetections at and as of the current decision time, respectively; F(T i ) and F(T) are the total numbers of false alarms at and as of the current decision time, respectively; A(T i ) and A(T) are the current and accumulative TSA accuracy at and as of the current decision time, respectively, calculated as follows: The credibility estimation parameters α and β in the integrated decision-making rule for CNN-based ensemble classification mentioned in Algorithm 1 are very important, since they directly determine the performance of the IS. The value of parameter α is usually set as 0.5 to 0.95, and β is usually set as 0.05 to 0.5. We now use the control variable method to observe the effect of varying α while fixing β as 0.05. The results are shown in Figure 6. We also observed the effect of varying β while fixing α as 0.95. The results are shown in Figure 7. The results are based on response time t equal to 30 cycles (0.5 s). enormous and unacceptable for real-time utilization. Thus, combining the impact of these two parameters on the classification results of TSA, we chose parameters with high accumulative accuracy, a low uncertain instance rate, few false alarms, and zero misdetection. Therefore, in this paper, and were set as 0.9 and 0.1, respectively, achieving 99.97% accumulative accuracy, 8.87% uncertain instance rate, zero misdetection, and three false alarms at response time t equal to 30 cycles (0.5 s). The results are shown in Table 3.    parameters on the classification results of TSA, we chose parameters with high accumulative accuracy, a low uncertain instance rate, few false alarms, and zero misdetection. Therefore, in this paper, and were set as 0.9 and 0.1, respectively, achieving 99.97% accumulative accuracy, 8.87% uncertain instance rate, zero misdetection, and three false alarms at response time t equal to 30 cycles (0.5 s). The results are shown in Table 3.   It can be observed in Figure 6 that, with α increasing, the rate of uncertain instances U r (T) and the accumulative accuracy A(T) increased, the count of F(T) remained unchanged as three, and the count of M(T) decreased to 0 when α increased to 0.9; the reason is that a larger α means the credibility estimation criterion of stable instances is more strict; thus, there tend to be more classifications being evaluated as uncertain instances, while those classified as credible stable instances become much more accurate with less misdetection. According to Figure 7, it is shown that, with β increasing, the rate of uncertain instances U r (T) and the accumulative accuracy A(T) reduced, the count of M(T) remained unchanged as zero, and the count of F(T) increased; the reason is that a larger β means the credibility estimation criterion of unstable instances is looser; thus, there tend to be fewer classifications being evaluated as uncertain instances, while those classified as credible unstable instances become not that accurate and, as such, the count of F(T) increases. As mentioned before, the costs of misclassification of unstable instances (misdetection) are enormous and unacceptable for real-time utilization. Thus, combining the impact of these two parameters on the classification results of TSA, we chose parameters with high accumulative accuracy, a low uncertain instance rate, few false alarms, and zero misdetection. Therefore, in this paper, α and β were set as 0.9 and 0.1, respectively, achieving 99.97% accumulative accuracy, 8.87% uncertain instance rate, zero misdetection, and three false alarms at response time t equal to 30 cycles (0.5 s). The results are shown in Table 3. Table 3. Prediction results of the hierarchical self-adaptive method.

Hierarchical Self-Adaptive Method for Transient Stability Prediction
To balance the rapidity and accuracy of TSA, the hierarchical self-adaptive method was used. The response times of each layer were set to three cycles (0.05 s), six cycles (0.10 ms), nine cycles (0.15 ms), . . . , 30 cycles (0.50 s), which can be adjusted in different situations. The prediction results are shown in Table 3. It is illustrated in Table 3 that the test instances can be identified at an earlier time with high accuracy, and there is no misdetection. In total, 4265 instances and 1991 instances out of 10,000 instances were classified as credible stable instances and credible unstable instances, respectively, at the first layer with response time as 0.05 s, and the accuracy was as high as 100%, without any misdetections and false alarms. The remaining 3744 uncertain instances moved to the next layer with response time as 0.10 s, where 516 and 251 of them were classified as credible stable instances and credible unstable instances, respectively, with an accuracy of 100%. Then, the remaining 2977 uncertain instances moved to the third layer with response time as 0.20 s, where 363 and 271 of them were classified as credible stable instances and credible unstable instances, respectively, with an accuracy of 99.68% (i.e., two false alarms). There still remained 887 uncertain instances at a response time of 0.50 s, of which 455 instances were unstable. Further observation of uncertain unstable instances is done through comparing the instability occurrence time, which is calculated by the difference between fault clearance and unstable occurrence. Figure 8 shows the instability occurrence time histogram for the total testing of unstable instances and uncertain unstable instances.

Hierarchical Self-Adaptive Method for Transient Stability Prediction
To balance the rapidity and accuracy of TSA, the hierarchical self-adaptive method was used. The response times of each layer were set to three cycles (0.05 s), six cycles (0.10 ms), nine cycles (0.15 ms), …, 30 cycles (0.50 s), which can be adjusted in different situations. The prediction results are shown in Table 3. It is illustrated in Table 3 that the test instances can be identified at an earlier time with high accuracy, and there is no misdetection. In total, 4265 instances and 1991 instances out of 10,000 instances were classified as credible stable instances and credible unstable instances, respectively, at the first layer with response time as 0.05 s, and the accuracy was as high as 100%, without any misdetections and false alarms. The remaining 3744 uncertain instances moved to the next layer with response time as 0.10 s, where 516 and 251 of them were classified as credible stable instances and credible unstable instances, respectively, with an accuracy of 100%. Then, the remaining 2977 uncertain instances moved to the third layer with response time as 0.20 s, where 363 and 271 of them were classified as credible stable instances and credible unstable instances, respectively, with an accuracy of 99.68% (i.e., two false alarms). There still remained 887 uncertain instances at a response time of 0.50 s, of which 455 instances were unstable. Further observation of uncertain unstable instances is done through comparing the instability occurrence time, which is calculated by the difference between fault clearance and unstable occurrence. Figure 8 shows the instability occurrence time histogram for the total testing of unstable instances and uncertain unstable instances.
For the sake of easy observation and comparison, the red line in Figure 8a illustrates the maximum vertical axis value in Figure 8b. It can be observed in Figure 8a that the majority of unstable instances in the total testing dataset had shorter instability occurrence times. Figure 8b shows that the instability of the remaining uncertain unstable instances occurred more than 3 s after the clearance of the faults. This implies that the proposed self-adaptive method can rapidly identify a large number of instances which are far away from the classification boundary. The sooner it identifies unstable instances, the more time will be reserved for emergency control. Those instances with longer instability times are always critical unstable instances near the classification boundary. It is very difficult to identify these instances quickly and accurately with the existing approaches. It is more reasonable to identify them as uncertain ones temporarily, rather than directly deliver wrong prediction results. Therefore, they need further identification at the next decision time. As time goes by, the dynamic characteristics of the power system become more and more obvious and separable. Thus, the uncertain instances are recognized as credible instances at longer response times.

TSI Regression Results
Through the above hierarchical self-adaptive transient stability prediction, credible stable and unstable instances were exported at each decision time. Then, the TSIs of credible stable and unstable instances were regressed using the stable regression model and unstable regression model, For the sake of easy observation and comparison, the red line in Figure 8a illustrates the maximum vertical axis value in Figure 8b. It can be observed in Figure 8a that the majority of unstable instances in the total testing dataset had shorter instability occurrence times. Figure 8b shows that the instability of the remaining uncertain unstable instances occurred more than 3 s after the clearance of the faults. This implies that the proposed self-adaptive method can rapidly identify a large number of instances which are far away from the classification boundary. The sooner it identifies unstable instances, the more time will be reserved for emergency control. Those instances with longer instability times are always critical unstable instances near the classification boundary. It is very difficult to identify these instances quickly and accurately with the existing approaches. It is more reasonable to identify them as uncertain ones temporarily, rather than directly deliver wrong prediction results. Therefore, they need further identification at the next decision time. As time goes by, the dynamic characteristics of the power system become more and more obvious and separable. Thus, the uncertain instances are recognized as credible instances at longer response times.

TSI Regression Results
Through the above hierarchical self-adaptive transient stability prediction, credible stable and unstable instances were exported at each decision time. Then, the TSIs of credible stable and unstable instances were regressed using the stable regression model and unstable regression model, respectively. TSI is the transient stability index defined in Equation (2) reflecting the stability degree of power systems. The regression mean squared errors (MSEs) defined in Equation (A10) (Appendix C) at different decision times of the stable regression model for the total testing stable instances and the unstable regression model for the total testing unstable instances are shown in Table 4. It can be observed in Table 4 that both the regression MSEs of the stable and unstable regression models had the trend of reducing with the increment of response time. The TSIs of the testing credible stable instances obtained until 30 post-disturbance cycles were predicted using the stable regression model. Figure 9 illustrates the detailed prediction results for the stable regression model at a response time of 0.50 s. It can be found in Figure 9a,b that the prediction results of the testing credible stable instances were closer to the actual TSI value than that of the total testing stable instances. The MSE of the testing credible stable instances was 0.0003 which is much smaller than the 0.0014 for the total testing stable instances. At the same time, the TSIs of the testing credible unstable instances obtained until 30 post-disturbance cycles were predicted using the unstable regression model. Figure 10 illustrates the detailed prediction results for unstable regression model at a response time of 0.50 s. It can be found in Figure 10a,b that the prediction results of the testing credible unstable instances were closer to the actual TSI value than that of the total testing unstable instances. The MSE of the testing credible unstable instances was 0.0023 which is much smaller than the 0.0080 for the total testing unstable instances. As mentioned in Section 5.5, those uncertain instances are always critical instances near the classification boundary. It is very difficult to identify these instances quickly and accurately with the existing approaches. Thus, it is more accurate and reasonable to predict the stability degree of the credible instances obtained through the proposed hierarchical self-adaptive method. Therefore, the first step of identifying the credible stable and unstable instances is of great help to the next step of predicting the transient stability degree. This method achieves not only predicting the stability of the instances, but also obtaining the stability degree of each instance, which is instructive for emergency control.
of the testing credible stable instances was 0.0003 which is much smaller than the 0.0014 for the total testing stable instances. At the same time, the TSIs of the testing credible unstable instances obtained until 30 post-disturbance cycles were predicted using the unstable regression model. Figure 10 illustrates the detailed prediction results for unstable regression model at a response time of 0.50 s. It can be found in Figures 10a,b that the prediction results of the testing credible unstable instances were closer to the actual TSI value than that of the total testing unstable instances. The MSE of the testing credible unstable instances was 0.0023 which is much smaller than the 0.0080 for the total testing unstable instances. As mentioned in Section 5.5, those uncertain instances are always critical instances near the classification boundary. It is very difficult to identify these instances quickly and accurately with the existing approaches. Thus, it is more accurate and reasonable to predict the stability degree of the credible instances obtained through the proposed hierarchical self-adaptive method. Therefore, the first step of identifying the credible stable and unstable instances is of great help to the next step of predicting the transient stability degree. This method achieves not only predicting the stability of the instances, but also obtaining the stability degree of each instance, which is instructive for emergency control.

Construction and Incompleteness of Input Features
The selection of reasonable input features has a significant impact on the performance of TSA classifiers [11,20]. Different researchers approached transient stability uniquely, and their feature extraction and feature selection methods were different. Therefore, there is no general feature set for TSA, and it is difficult to say that there exists a feature set that always has the best performance for any IS in any situation.
In addition, in practical applications, unexpected PMU failure, communication link delay, signal interruption, cyber-attack, etc. [39,40] limit the availability of the features. An incomplete feature input would detrimentally influence the accuracy of individual learning models or even make the transient stability assessment process unavailable under this condition. Some researchers proposed a decision tree with surrogate (DTWS) to solve missing features [41]. Others used the feature estimation method to predict the missing data directly [42], or used an emerging deep-learning technique called generative adversarial network (GAN) to address the missing data problem [43].

Construction and Incompleteness of Input Features
The selection of reasonable input features has a significant impact on the performance of TSA classifiers [11,20]. Different researchers approached transient stability uniquely, and their feature extraction and feature selection methods were different. Therefore, there is no general feature set for TSA, and it is difficult to say that there exists a feature set that always has the best performance for any IS in any situation.
In addition, in practical applications, unexpected PMU failure, communication link delay, signal interruption, cyber-attack, etc. [39,40] limit the availability of the features. An incomplete feature input would detrimentally influence the accuracy of individual learning models or even make the transient stability assessment process unavailable under this condition. Some researchers proposed a decision tree with surrogate (DTWS) to solve missing features [41]. Others used the feature estimation method to predict the missing data directly [42], or used an emerging deep-learning technique called generative adversarial network (GAN) to address the missing data problem [43].

Classifier Updating for Performance Enhancement
In practical applications, it is necessary to update the classifier depending on the practical situation for performance enhancement. With the variations of load levels, network topologies, and so on, the operation conditions change very much. Through pre-disturbance TSA simulation, the data for updating can be obtained. However, this will be time-consuming for huge T-D simulations. In addition, if the classifier is retrained by the newly generated comprehensive dataset considering all the uncertainties associated with the new operation conditions, it also be time-consuming, especially for classifiers constructed by deep learning networks. Therefore, in order to solve the time-consuming problem of updates in this paper, active learning and fine-tuning methods can be adapted to select informative and representative instances [11]. Firstly, new instances are generated via short-term simulations. Secondly, these unlabeled instances are predicted by the pre-trained classifier to obtain those import instances for the process of updating. Only those judged to be uncertain by the integrated decision-making rule mentioned in Section 4.2 will be attached with target labels through a long-term simulation. As analyzed in Section 5.5, those uncertain instances are close to the classification boundary and are relatively indistinguishable. Thus, they are important instances for the process of updating. Through fine-tuning the pre-trained classifier using these new labeled instances, the computational time of both the T-D simulation and classifier training can be greatly reduced, making the proposed method more applicable for industry online applications.

Increment of System Size
With the increase of system size, the growth rates of the sample size and feature dimensions are exponential, leading to an increment in computational memory and computation time. However, for statistics, the larger the dataset is, the more sufficient the information is likely to be. With the development of computer technologies like graphics processing units (GPUs) and distributed techniques like the alternating direction method of multipliers (ADMM), the advantages of big data can be reflected. All the sub-classifiers can be trained by GPUs combined with the ADMM algorithm, making the proposed approach more suitable for larger test systems in industry.

Misdetections and False Alarms
The costs of misdetections and false alarms are definitely different in TSA. The misclassification of stable samples as unstable samples (false alarm) may cause malfunction of some control devices, but it has little effect on the safe and stable operation of the entire system. On the other hand, the misclassification of unstable samples as stable samples (misdetection) may lead to a chain collapse or even catastrophic accidents, and the consequences are very serious. Therefore, the cost of misdetections is greater than that of false alarms. When using a data-driven AI method for TSA, we should not just pursue high accuracy but also focus on the counts of misdetections and false alarms, achieving as high a prediction accuracy as possible, with no large false alarms and zero misdetections. In this paper, through the suitable setting of thresholds, we could achieve 99.97% accumulative accuracy, with three false alarms and zero misdetections for credible instances. This indicates the effectiveness of the proposed hierarchical self-adaptive method and the importance of suitable threshold setting.

Conclusions
Based on a well-known deep learning technology called CNN, this paper developed an IS for the rapid, accurate, and reliable post-disturbance TSA of power systems. It can not only export credible accurate classification results, but also provide the stability degree of each instance, which is instructive for emergency control. With the CNN algorithm and ensemble technologies, this IS uses a strategically designed learning and an integrated decision-making rule to achieve a hierarchical self-adaptive method which makes correct transient stability prediction results at an appropriate earlier time. Through the suitable setting of thresholds, it can achieve the goal of a high prediction accuracy, few false alarms, and zero misdetections, allowing more time for emergency control and reducing economic losses. Specifically, this two-step method (the first step is the identification of the credible stable/unstable instances, and the second step is the prediction of transient stability degree) of TSA can avert unreliable results, making all the decided instances accurate. More comprehensive case studies will be done on a larger power grid in the future. where N is the total number of training dataset instances, µ is the regularization weight, w i are weights of network that need to calculate the regularization loss. y n = y (1) n , y (2) n is the true label vector and n ,ŷ (2) n is the predicted label vector of the n-th instance.
The loss function of mean squared error (MSE) is adopted in regression problem shown as Equation (A10).
where y n andŷ n are the actual and predicted TSI of the n-th instance respectively. else it is an uncertain instance at the current decision time 12: if the number of remained uncertain instances is zero 13: break 14 end 15: Output predicted stability status and stability degree TSI The script of the training process of a single CNN Algorithm A2. CNN Training Process 1: Input normalized input features and labels 2: Initialize weights and bias of the network randomly 3: The input data is forwardly propagated through the convolutional layer, the max-pooling layer, and the fully connected layer to obtain an output value 4: Calculate the error between the output value of the network and the target value 5: Adjust the network connection weights and bias by mini-batch stochastic gradient descent and back propagation to minimize the loss function 6: The weights and bias are updated according to the obtained error. Then go to the second step 7: End training when the error is equal to or less than our expected value 5: Adjust the network connection weights and bias by mini-batch stochastic gradient descent and back propagation to minimize the loss function 6: The weights and bias are updated according to the obtained error. Then go to the second step 7: End training when the error is equal to or less than our expected value  Figure A2. The process of off-line training and on-line application of proposed method. Figure A2. The process of off-line training and on-line application of proposed method.