Combination with Continual Learning Update Scheme for Power System Transient Stability Assessment

In recent years, the power system transient stability assessment (TSA) based on a data-driven method has been widely studied. However, the topology and modes of operation of power systems may change frequently due to the complex time-varying characteristics of power systems. which makes it difficult for prediction models trained on stationary distributed data to meet the requirements of online applications. When a new working situation scenario causes the prediction model accuracy not to meet the requirements, the model needs to be updated in real-time. With limited storage space, model capacity, and infinite new scenarios to be updated for learning, the model updates must be sustainable and scalable. Therefore, to address this problem, this paper introduces the continual learning Sliced Cramér Preservation (SCP) algorithm to perform update operations on the model. A deep residual shrinkage network (DRSN) is selected as a classifier to construct the TSA model of SCP-DRSN at the same time. With the SCP, the model can be extended and updated just by using the new scenarios data. The updated prediction model not only complements the prediction capability for new scenarios but also retains the prediction ability under old scenarios, which can avoid frequent updates of the model. The test results on a modified New England 10-machine 39-bus system and an IEEE 118-bus system show that the proposed method in this paper can effectively update and extend the prediction model under the condition of using only new scenarios data. The coverage of the updated model for new scenarios is improving.


Introduction
With the rapid development of modern power systems, increasing penetration of renewable energy sources and power electronics, as well as the expanding scale of power systems with regional interconnections, the power system is running closer to its stability limits [1]. When the power system is disturbed, the problems of transient instability are more likely to occur, which is an influential trigger for large-scale blackouts in the grid [2][3][4]. Thus, a fast and accurate method of transient stability assessment is essential for the safe and stable operation of power systems [5].
The current methods for transient stability assessment include time-domain simulation methods [6,7], direct methods [8,9], and data-driven methods [10]. The time-domain simulation method can model the system in detail with high accuracy of calculation, which is time-consuming and difficult to apply online. The direct method is hard to evaluate accurately and reliably with its highly simplified model and has poor adaptability facing large and complex grids. In recent years, with the continuous development of the synchronous phase measurement unit (PMU) [11] technology and the ongoing improvement of the wide-area measurement information system (WAMS) [12]. The PMU installed in the grid can obtain a large amount of system dynamic information simultaneously, providing a powerful data foundation for the data-driven approach. The method constructs a mapping relationship between transient stability data and stability conclusions, which

•
Introducing the continual learning algorithm SCP to resolve the problem of catastrophic forgetting when the model is updated. It can guarantee the evaluation requirements of all scenarios at the same time in the limited data range. • A deep residual shrinkage network is used as a classifier to reduce the impact of noise on the model learning distribution and ensure the learning ability of the model.

•
The focal loss function is introduced to solve the problems caused by unbalanced training samples and hard and easy samples during the training process.
The rest of the paper is organized as follows. Section 2 describes the transient stability problem and the proposed method. Section 3 introduces the classifier model DRSN and the continual learning algorithm SCP. Section 4 describes the implementation of the SCP-DRSN model, including its evaluation metrics, loss function, and evaluation process. Section 5 presents the case studies. Section 6 discusses the proposed method. Section 7 concludes the whole paper.

Transient Stability Assessment Problem
The power system would be subject to various large disturbances during operation, when the disturbed system can transition to a new stable operating state or return to the original state after the transient process, the system is transient stable, otherwise the system will occur transient instability [35]. The essence of transient stability assessment is to find a boundary that divides the stability and instability of a system. The data-driven method treats the power system transient stability assessment as a two-classification problem. A prediction model is constructed using operational data that reflect the transient stability information of the power system with the corresponding stability findings trained. In this study, the input X of the prediction model consists of all bus voltage magnitudes and phase angles in the system, and the output of the model is the category of transient stability. The stability classes corresponding to the samples in the training data are labeled according to the transient stability index [36]: where ∆δ max is defined as the maximum relative power angle difference between any two generators during the simulation time. If η > 0, the system is transient stable and the sample is labeled as 0. Otherwise, the system is transient unstable, and the sample is labeled as 1.

Proposed Method
The input distribution of the TSA model applied online is a dynamically changing data stream owing to the constantly changing manner in which the system of maintenance and control measures operates. Let the dataset under the ith scenario be denoted as , 1} corresponds to the label of its class, and X i follows a probability distribution of. The input data obtained in different scenarios all belong to the same task, so the class labels remain unchanged and they are denoted as {Y i } = {Y i+1 }. The probability distribution of the data in different scenarios changes, denoted as P(X i )! = P(X i+1 ). Nevertheless, deep models are commonly trained on static homogeneously distributed data and cannot adapt or extend their behavior as the external environment changes. Hence, a continual learning scheme is proposed, which is shown in Figure 1. The TSA model constructed in combination with continual learning can sustainably integrate and optimize the learned knowledge from non-smooth data distribution over time.
model constructed in combination with continual learning can sustainably integrate and optimize the learned knowledge from non-smooth data distribution over time.

Algorithms
This section will introduce the two learning algorithms included in the proposed method, including a deep residual shrinkage network and continual learning SCP. The proposed algorithms are described in detail as follows.

Deep Residual Shrinkage Network
The deep residual shrinkage network is a modified network based on the residual network (DRN) [37], which is a feature learning method for strong noise or highly redundant data. It is mainly founded on three components: deep residual network, soft threshold function, and attention mechanism. Among them, the deep residual network is a modified convolutional neural network, and "shrinkage" refers to "soft thresholding", which is a key procedure in the signal noise reduction algorithm. In the deep residual shrinkage network, the threshold required for soft thresholding is automatically set by the attention mechanism.

Deep Residual Network
Compared with the regular convolutional neural network, the deep residual network adopts the connection of constant paths with cross layers, and its structure is shown in Figure 2. With this path, information is transmitted more smoothly and efficiently both forward and backward. When forward computing the loss, the input signals can be propagated directly from any lower layer to the higher layer, solving the degradation problem of the deep layer network. When backward updating the gradient, the parameter gradient of the deep structure in the neural network can be transmitted to the input layer faster, resolving the gradient disappearance or explosion problem and reducing the training difficulty of the deep network.
= , which is a constant mapping.

Algorithms
This section will introduce the two learning algorithms included in the proposed method, including a deep residual shrinkage network and continual learning SCP. The proposed algorithms are described in detail as follows.

Deep Residual Shrinkage Network
The deep residual shrinkage network is a modified network based on the residual network (DRN) [37], which is a feature learning method for strong noise or highly redundant data. It is mainly founded on three components: deep residual network, soft threshold function, and attention mechanism. Among them, the deep residual network is a modified convolutional neural network, and "shrinkage" refers to "soft thresholding", which is a key procedure in the signal noise reduction algorithm. In the deep residual shrinkage network, the threshold required for soft thresholding is automatically set by the attention mechanism.

Deep Residual Network
Compared with the regular convolutional neural network, the deep residual network adopts the connection of constant paths with cross layers, and its structure is shown in Figure 2. With this path, information is transmitted more smoothly and efficiently both forward and backward. When forward computing the loss, the input signals can be propagated directly from any lower layer to the higher layer, solving the degradation problem of the deep layer network. When backward updating the gradient, the parameter gradient of the deep structure in the neural network can be transmitted to the input layer faster, resolving the gradient disappearance or explosion problem and reducing the training difficulty of the deep network.

Soft Thresholding
Based on the residual unit, the residual shrinkage unit inserts the soft thresholding into the structural unit as a non-linear transformation layer. The input signal is mapped into a range by the learning activities of the neural network layers, and numbers close to 0 in that range are considered to be less important. Therefore, soft thresholding means When F(x) = 0, then H(x) = x, which is a constant mapping.

Soft Thresholding
Based on the residual unit, the residual shrinkage unit inserts the soft thresholding into the structural unit as a non-linear transformation layer. The input signal is mapped into a range by the learning activities of the neural network layers, and numbers close to 0 in that range are considered to be less important. Therefore, soft thresholding means that the feature close to 0 is changed to 0, which results in reducing the noise. The soft thresholding function is as follows: Where x is the input, y is the output and τ is the threshold. The derivative of the soft thresholding function is formulated as follows: It can be seen that the derivative values of the soft thresholding function are only 0 and 1. This property is the same as that of the ReLu activation function. Therefore, the soft thresholding function is also beneficial to prevent gradient disappearance and gradient explosion.

Attentional Mechanism
In practical situations, the redundant information content is usually different for different samples. The attention mechanism adaptively sets different thresholds for each sample, focusing attention on locally critical information. The deep residual shrinkage network adaptively sets thresholds by a small sub-network. Dividing the residual shrinkage module into a Residual Shrinkage Building Unit with Channel-shared thresholds (RSBU-CS) and a Residual Shrinkage Building Unit with Channel-wise thresholds (RSBU-CW), and the module structures of both are shown in Figure 3.  The threshold value of RSBU-CS is a scalar, which is applied to all channels of feature map. The formula for calculating the threshold value for this module is as follo  The threshold value of RSBU-CS is a scalar, which is applied to all channels of the feature map. The formula for calculating the threshold value for this module is as follows: The threshold of RSBU-CW is a vector, which means that each channel of the feature map corresponds to a shrinkage threshold. The formula for calculating the threshold value of this module is as follows: where τ is the threshold value and α is the scaling factor. i, j and c are the indexes of width, height, and channel of the feature map X, respectively.

Continual Learning
The main difficulty in achieving continual learning on deep neural networks is overcoming the problem of catastrophic forgetting when models are updated. The knowledge and features learned by the neural network are stored in the model parameters (e.g., convolutional kernel parameters). When a neural network learns a new task, its parameters are updated, and the knowledge of the old task is overwritten, resulting in a "catastrophic drop" in the performance of the updated model on the old task. The process is depicted in Figure 4. In the above figure, the darker gray color corresponds to a smaller loss. The best parameter obtained by the model in Task 1 is b θ . When faced with Task 2, it is trained directly based on the previous task, and the parameter b θ is updated θ * . In this case, the θ * is a set of poorly behaved parameters when the model returns to task 1 again. If b θ is chasing the optimum on Task 2 only updates to 1 θ on the horizontal axis are considered and changes to 2 θ on the vertical axis are restricted as much as possible. Then it is possible to obtain a set of parameters that perform well on both tasks. The regularization-based continual learning algorithm adds the regularization term to the loss function of a new task to limit the variation of each weight parameter of the model to protect the old knowledge from being overwritten by the new knowledge. The loss function of the new task is shown below: where λ is the regularization factor, ( )  In the above figure, the darker gray color corresponds to a smaller loss. The best parameter obtained by the model in Task 1 is θ b . When faced with Task 2, it is trained directly based on the previous task, and the parameter θ b is updated θ * . In this case, the θ * is a set of poorly behaved parameters when the model returns to task 1 again. If θ b is chasing the optimum on Task 2 only updates to θ 1 on the horizontal axis are considered and changes to θ 2 on the vertical axis are restricted as much as possible. Then it is possible to obtain a set of parameters that perform well on both tasks.
The regularization-based continual learning algorithm adds the regularization term to the loss function of a new task to limit the variation of each weight parameter of the model to protect the old knowledge from being overwritten by the new knowledge. The loss function of the new task is shown below: where λ is the regularization factor, L(θ) denotes the original loss function of the model, and the summation term that follows is the regularization term. θ i is the ith parameter of the model, and θ b i is the ith parameter of the model learned in the old task. The regularization factor b i represents the importance of the ith parameter to the old task. The larger b i is, the more important the ith parameter is, and the less θ i can depart too far from θ b i .

SCP
The SCP [33] introduced in this paper is a continual learning algorithm based on distribution regularization. Compared with previous methods of sample regularization, SCP imposes less constraints on the network and can better utilize the learning ability of the network. When learning a new task, the SCP uses the slice Cramér distance in order to determine the importance of the model parameters, obtaining a matrix representing the importance. The distribution of any layer in the neural network over the previous task can be preserved by the relevant importance parameters determined, thus enabling the inheritance of the knowledge learned on the previous task. When task A is learned, the loss function of SCP on the new task B is expressed as follows: In order to extend to sequential learning of multiple tasks and to keep the memory requirements of the methods constant, the SCP follows the same framework as EWC++ [38].
The sliced-Cramér regularizer of task (t + 1) is obtained by Γ is a hyperparameter representing the importance of the new task relative to the old one.

Transient Stability Assessment of Power System Based on SCP-DRSN
Based on the introduction in the previous section, the structure of the SCP-DRSN model proposed in this paper is shown in Figure 5. It includes an input layer, a convolutional layer, a series of residual shrinkage modules, and finally a global average pooling layer, along with a fully connected output layer, etc. In the update phase, SCP constructs regularized loss terms by computing parameter importance matrices to limit the forgetting of old knowledge while learning new data. In addition, to balance the computational complexity and model effects, two RSBU-CW modules are used near the input layer and two RSBU-CS modules are used near the output layer. The activation function used is the ReLu activation function.
Sensors 2022, 22, 8982 8 of 22 [38]. The sliced-Cramér regularizer of task (t + 1) is obtained by ( ) is a hyperparameter representing the importance of the new task relative to the old one.

Transient Stability Assessment of Power System Based on SCP-DRSN
Based on the introduction in the previous section, the structure of the SCP-DRSN model proposed in this paper is shown in Figure 5. It includes an input layer, a convolutional layer, a series of residual shrinkage modules, and finally a global average pooling layer, along with a fully connected output layer, etc. In the update phase, SCP constructs regularized loss terms by computing parameter importance matrices to limit the forgetting of old knowledge while learning new data. In addition, to balance the computational complexity and model effects, two RSBU-CW modules are used near the input layer and two RSBU-CS modules are used near the output layer. The activation function used is the ReLu activation function.

Input Features
In this paper, the bus voltage magnitude and phase angle of the power system are chosen as the input features of the prediction model. For one, they can be obtained directly

Input Features
In this paper, the bus voltage magnitude and phase angle of the power system are chosen as the input features of the prediction model. For one, they can be obtained directly by PMU, and secondly, a large number of studies [39][40][41] have shown that they can obtain the highest precision for transient stability assessment. Based on the graphical transient features, the voltage magnitude and phase angle are constructed as two-dimensional images, respectively, and stacked on the channels to form a three-dimensional input feature X 2×T×B for two channels. In summary, for all samples, the input feature is X = {V, φ}:

Evaluation Indicators
In the TSA of power systems, there are characteristics of imbalance of sample labels and different costs caused by the omission of unstable samples and the false judgment of stable samples. In order to comprehensively evaluate the model performance, this paper chooses the following three evaluation metrics: Accuracy (Acc) rate, Misdetection (Mis) rate, and False-alarm (Fal) rate. The specific formulation is as follows: where TP and TN denote the numbers of unstable samples and stable samples with correct predictions, respectively; FP and FN denote the number of stable samples and unstable samples with incorrect predictions, respectively.

Focal Loss Function
In order to solve the problems of sample imbalance and serious consequences of sample misclassification in transient stability assessment, a focal loss function [42] is introduced in this paper to guide the model training. It can not only adjust the weights of positive and negative samples, but also control the weights of difficult and easy-to-classify samples. The expression is as follows: where y is the real label of the sample andŷ is the predicted probability of the sample label. α ∈ [0, 1] is a balancing factor to balance the disproportionality of positive and negative samples. γ > 0 as a modulation factor, allowing for the model to focus more on predicting difficult samples that perform badly. In this paper, we set α = 0.75 and γ = 2 through extensive simulation experiments.

Evaluation Process
The flow chart of the proposed method is shown in Figure 6, which consists of three parts: offline training, online application, and model update.  During the offline training process, to avoid frequent updates of the TSA model in later use, various basic operating conditions of the system should be considered as comprehensively as possible when constructing the initial TSA database. The required basic dataset is generated by the time-domain simulation software, and the model is trained based on this database.
For the online application, the operational data of the power system are collected in real-time through the PMUs. The data are processed into the structure required by the model and input into the TSA model, and the real-time evaluation results are quickly and accurately derived using the TSA model applied online.
For the model update phase, which is the focus of this paper, the power system operating conditions will change due to economic dispatch, maintenance, and other needs, and the offline initial database cannot cover all operating situations. In general, power companies can gain a list of potential operating events for the power system through forecasting. When a new operating situation emerges that was not considered before, the corresponding new scenario dataset new D is obtained by time domain simulation software and then the prediction precision new P of the TSA model is tested.
When the test results are below a predetermined threshold set A , the model is updated in time with the new scenario dataset new D . As the model update process is executed, the probability of encountering unknown operating situations is gradually reduced and the generalization capability of the model is gradually improved.

Case Study
The proposed method was tested on the modified New England 10-machine 39-bus system and the IEEE 118-bus system. The TSA model of this paper is implemented in the Pytorch environment, and the programming language is Python. The computer is configured with an Intel(R) Core (TM) i5-10200H 2.40 GHz CPU and 16.0 GB RAM.

Dataset Generation
The New England 10-machine 39-bus system contains 10 generators, 39 buses, and 46 transmission lines. The standard example is modified in this paper by connecting the wind farms at buses 2, 29, and 39, respectively. This paper applies the python API of the simulation software PSS/E to implement batch transient simulation and generate three scenarios with different distributions of datasets required for the tests. The generator is During the offline training process, to avoid frequent updates of the TSA model in later use, various basic operating conditions of the system should be considered as comprehensively as possible when constructing the initial TSA database. The required basic dataset is generated by the time-domain simulation software, and the model is trained based on this database.
For the online application, the operational data of the power system are collected in real-time through the PMUs. The data are processed into the structure required by the model and input into the TSA model, and the real-time evaluation results are quickly and accurately derived using the TSA model applied online.
For the model update phase, which is the focus of this paper, the power system operating conditions will change due to economic dispatch, maintenance, and other needs, and the offline initial database cannot cover all operating situations. In general, power companies can gain a list of potential operating events for the power system through forecasting. When a new operating situation emerges that was not considered before, the corresponding new scenario dataset D new is obtained by time domain simulation software and then the prediction precision P new of the TSA model is tested. When the test results are below a predetermined threshold A set , the model is updated in time with the new scenario dataset D new . As the model update process is executed, the probability of encountering unknown operating situations is gradually reduced and the generalization capability of the model is gradually improved.

Case Study
The proposed method was tested on the modified New England 10-machine 39-bus system and the IEEE 118-bus system. The TSA model of this paper is implemented in the Pytorch environment, and the programming language is Python. The computer is configured with an Intel(R) Core (TM) i5-10200H 2.40 GHz CPU and 16.0 GB RAM.

Dataset Generation
The New England 10-machine 39-bus system contains 10 generators, 39 buses, and 46 transmission lines. The standard example is modified in this paper by connecting the wind farms at buses 2, 29, and 39, respectively. This paper applies the python API of the simulation software PSS/E to implement batch transient simulation and generate three scenarios with different distributions of datasets required for the tests. The generator is set to the GENROU model, the load is set to the constant impedance model, the simulation step is set to 0.01 s, and the sampling frequency is set to 100 Hz. The bus voltage amplitudes and phase angles of the 5 cycles after fault disconnection are selected as the initial input features, and the data are labeled using the stability criterion. The above labeled sample dataset is divided into the training set, test set, and validation set according to the ratio of 8:1:1. Among them, the training set is used for the training of the model, the validation set is used for the selection of model hyperparameters, and the test set is used to test the performance of the model.
The three typical scenarios mentioned above are scenario 1 (which covers as many operating conditions of the system as possible), scenario 2 (which simulates the system generation and dispatch with a major change in tide), and scenario 3 (which reflects a huge change in the system topology). The three typical scenarios simulation generated datasets correspond to the basic dataset D base , the new scenario dataset D new1 , and the new scenario dataset D new2 , respectively. The simulation settings are as follows.
The dataset D base for scenario 1: considering that the new energy penetration rate changes in steps of 5% to 20% and the system load level changes in steps of 5% from 75% to 125%. A three-phase short-circuit fault is set for 34 non-transformer branches, and the fault time duration is set to 0.02~0.2 swith a step of 0.02 s. The simulation generates a total of 14,960 samples. Among them, the number of stable samples is 8450 and the number of unstable samples is 6510.
In order to avoid an excessive number of simulation samples, the new scenarios dataset is appropriately reduced. Considering that the new energy penetration rate changes within 10% to 20% in steps of 5%, the load level changes within 80% to 120% in steps of 5%, and the fault time duration is 0.04 to 0.2 s in steps of 0.04 s.
The dataset D new1 for scenario 2: significantly changes the generator terminal probability distribution, the topology of this scenario is shown in Figure 7. Three-phase short-circuits faults are set for 34 non-transformer branches. The simulation generates a total of 4590 samples. Among them, the number of stable samples is 3018 and the number of unstable samples is 1572.  The dataset D new2 for scenario 3: disconnecting a transformer branch and two nontransformer branches and removing one generator, the topology of this scenario is shown in Figure 8. A three-phase short-circuit fault is implemented for the remaining 32 nontransformer branches. The simulation generates 5280 samples, of which 3171 are stable samples and 2109 are unstable samples.

Comparison with Other Models
In order to verify the superiority of the basic model of the method in this paper, the DRSN is compared and validated with the commonly used machine learning methods RF, SVM, and deep learning methods MLP, CNN, and DRN on the scenario 1 dataset D base . The results are shown in Table 1. Note that in order to test the effectiveness of the focal loss function, the normal cross-entropy loss function is used except for the method proposed in this paper. The parameters of the neural network are initialized using the Xavier function, an optimization algorithm Adam can adaptively adjust the learning rate to accelerate model convergence, and a dropout regularization technique is used to avoid model overfitting. Obviously, the shallow machine learning models RF and SVM perform poorer than the deep learning model in all metrics, in which the Mis reaches 3.45% and 3.30%, respectively. Although the Acc of MLP, CNN, and DRN in deep learning is improving sequentially, Mis and Fal still remain relatively high due to the problem of sample imbalance is not considered. The DRSN is an improved network on the DRN, with the effect of the shrinkage module and the focal loss function, Acc is increased by 0.68%, and Mis and Fal are reduced by 1.37% and 0.39%, respectively. Compared with the regular CNN, Acc improves by 0.95%, Mis and Fal decrease by 2.31% and 0.62%, respectively.

Testing the Generalizability of the Model in New Scenarios with Large Disturbances
In the practical operation of the system, new scenarios with huge changes in topology and power distribution may be encountered. In order to test the generalizability of the models in such scenarios, the models are trained based on the scenario 1 dataset D base , and then tested on the scenario 2 dataset D new1 and the scenario 3 dataset D new2 , respectively, after the training is completed. The test result is shown in Figure 9. It can be seen from Figure 9 that the Acc performance of each model in the new scenario shows a clear decrease. Likewise, the deep learning model has a lighter decline than the machine learning model, but the prediction accuracy no longer meets the requirements. The Acc of the models drops about 15% on average under scenario 1, and the Acc of the models drops about 20% on average under scenario 2. The analysis of the test results indicates that the data distribution of the run data generated in the two new scenarios is already significantly different from the initial basic data. Especially in the face of scenario 3, the model almost loses its effectiveness for TSA. Therefore, the model needs to be updated in advance before such scenarios occur.

Comparison of Different Update Schemes
In order to demonstrate the superiority of the update scheme proposed in this paper, the update effects of two different update schemes, fine-tuning (FT) in transfer learning and continual learning SCP, are tested separately under the condition that only the new scenario dataset is used for updating. The training and updating process of model DRSN is as follows. Firstly, the training is completed on the basic dataset base D of scenario 1, and then the training is updated on scenario 2 and scenario 3 in turn by following two different updating schemes. After each training update mentioned above is completed, the model is tested on the current scenario as well as on the past scenarios, and then tested on joint all seen scenarios. The test results are shown in Figure 10.
According to Figure 10a, the Finetuning-DRSN model constructed by the fine-tuning update scheme can have 99.13% and 99.33% accuracy in the new scenario, but in the previous scenarios, it only has 84.63% and 83.09% accuracy, with a catastrophic drop in performance. Figure 10b shows that the accuracy of the SCP-DRSN model constructed under the continual learning update scheme is still maintained at 97.65%, 98.31%, and 98.47% for each scenario after the third scenario training is completed. It verified the It can be seen from Figure 9 that the Acc performance of each model in the new scenario shows a clear decrease. Likewise, the deep learning model has a lighter decline than the machine learning model, but the prediction accuracy no longer meets the requirements. The Acc of the models drops about 15% on average under scenario 1, and the Acc of the models drops about 20% on average under scenario 2. The analysis of the test results indicates that the data distribution of the run data generated in the two new scenarios is already significantly different from the initial basic data. Especially in the face of scenario 3, the model almost loses its effectiveness for TSA. Therefore, the model needs to be updated in advance before such scenarios occur.

Comparison of Different Update Schemes
In order to demonstrate the superiority of the update scheme proposed in this paper, the update effects of two different update schemes, fine-tuning (FT) in transfer learning and continual learning SCP, are tested separately under the condition that only the new scenario dataset is used for updating. The training and updating process of model DRSN is as follows. Firstly, the training is completed on the basic dataset D base of scenario 1, and then the training is updated on scenario 2 and scenario 3 in turn by following two different updating schemes. After each training update mentioned above is completed, the model is tested on the current scenario as well as on the past scenarios, and then tested on joint all seen scenarios. The test results are shown in Figure 10.

Robustness Analysis
In practical applications, PMU measurements are influenced by noises. In order to test the robustness of the model to noise, Gaussian white noises with signal-to-noise (SNRs) of 40 dB, 30 dB, and 20 dB were added to the original test data. The test results are shown in Table 2.
It is observable from Table 2 that the prediction performance of each model decreases to some extent with the increase of noise. When the SNR is 20 dB, in terms of Acc, the model RF, SVM, and MLP decrease by 2.22%, 2.54%, and 2.01%, respectively. The models CNN, DRN, and DRSN decrease by 1.78%,1.44%, and 0.76%, respectively. Compared to DRN, DRSN has 0.68% fewer in Acc. For Mis, RF is up to 5.81% at its peak, which is an increase of 2.36% compared to non-noise. The Fal of SVM is 6.98% at its peak, an increase of 4.3%. Due to the noise immunity of DRSN, the Fal and Mis of the model are only 2.24% and 0.71% under the most severe conditions of the test noise. Compared with the test results of DRN, the anti-noise effect is obvious. According to Figure 10a, the Finetuning-DRSN model constructed by the fine-tuning update scheme can have 99.13% and 99.33% accuracy in the new scenario, but in the previous scenarios, it only has 84.63% and 83.09% accuracy, with a catastrophic drop in performance. Figure 10b shows that the accuracy of the SCP-DRSN model constructed under the continual learning update scheme is still maintained at 97.65%, 98.31%, and 98.47% for each scenario after the third scenario training is completed. It verified the ability of the model to continuously learn under this method. The performance of the model on all seen scenarios is shown in Figure 10c. The test results show that the TSA model combined with the continual learning update scheme can maintain a high level and smooth performance during the emergence of new scenarios, and the coverage of the model for new scenarios is constantly improving.

Robustness Analysis
In practical applications, PMU measurements are influenced by noises. In order to test the robustness of the model to noise, Gaussian white noises with signal-to-noise (SNRs) of  It is observable from Table 2 that the prediction performance of each model decreases to some extent with the increase of noise. When the SNR is 20 dB, in terms of Acc, the model RF, SVM, and MLP decrease by 2.22%, 2.54%, and 2.01%, respectively. The models CNN, DRN, and DRSN decrease by 1.78%,1.44%, and 0.76%, respectively. Compared to DRN, DRSN has 0.68% fewer in Acc. For Mis, RF is up to 5.81% at its peak, which is an increase of 2.36% compared to non-noise. The Fal of SVM is 6.98% at its peak, an increase of 4.3%. Due to the noise immunity of DRSN, the Fal and Mis of the model are only 2.24% and 0.71% under the most severe conditions of the test noise. Compared with the test results of DRN, the anti-noise effect is obvious.

A Lager Test SYSTEM
In order to verify the effectiveness of the proposed TSA method in large-scale power systems, the SCP-DRSN-based TSA framework is applied to the IEEE 118-bus system.

Dataset Generation
The IEEE 118-bus system consists of 19 generators, 35 synchronous capacitors, 177 transmission lines, 9 transformers, and 91 loads. In the same manner, three scenario datasets with different distributions are generated with PSS\E. Scenario 1 has 18,588 samples, including 11,025 stable samples and 7563 unstable samples. Scenario 2 has 8910 samples, of which 5009 are stable samples and 3910 are unstable. Scenario 3 has 4150 samples, of which 1795 are stable samples and 2355 are unstable.

Model Performance Analysis
The test procedure and configuration in this section are the same as in the aforementioned case for the modified New England 39-bus system. The results of comparing the basic model DRSN with other models on the IEEE 118-bus system are shown in Table 3, and the results of robustness tests in the PMU noise environment are shown in Table 4. The generalizability test of the model on the new scenarios of IEEE 118-bus is shown in Figure 11. The effects of two different updates of fine-tuning (FT) and continual learning are shown in Figure 11.

Discussion
In order to solve the update problem caused by frequent changes in the topology and operation of the power system to the TSA model, this paper introduces the Sliced Cramér Preservation (SCP) algorithm of continual learning in order to perform the update operation of the model. For the proposed SCP-DRSN model, through the experimental results we can find that DRSN has stronger data mining and anti-noise learning ability compared to other machine learning and deep learning algorithms in terms of classifier selection. Meanwhile, with the focal loss function, the test results in the two provided cases clearly show that DRSN has the optimal performance on the metrics Acc, Mis, and Fal.
The test results on the generated datasets with three different distributions confirm the necessity of TSA model updates. The study compared the experimental results of the two update schemes under the condition of using only new scenario data updates. The fine-tuned model can just meet the assessment requirements of the current scenario, and the model needs to be updated or switched frequently across scenarios. For the continual learning update scheme proposed in this paper, the assessment capability of the model is effectively supplemented with the emergence of new scenarios, and the updated model can cover all operational scenarios. This characteristic makes the method advantageous in terms of data storage. From the test results, the accuracy of the model under the continual learning update scheme shows a certain degree of fluctuation across scenarios, and this fluctuation results from the means of implementation of the regularized continual learning algorithm. The SCP ensures that the model performs in all scenarios by adjusting the regularization factor λ and the relative importance parameter α for the new and old tasks. If the model sets λ and α too high to maintain the performance in the old scenarios, the model parameters will not be updated effectively. If the model only pursues high accuracy in the new scenario by setting λ and α too small, the updated parameters will lose applicability to the old scenarios. However, when such fluctuations exceed the allowed limits, it means that the model with a fixed number of parameters has reached its capacity limit. The capacity of the TSA model for continual learning can be maintained by increasing the number of parameters based on the original size.

Conclusions
A TSA model combined with a continual learning update scheme is proposed for the situation where the accuracy of the prediction model is not satisfying the requirements due to system changes under large disturbances. Continual learning solves the problem of catastrophic forgetting during model updates in comparison to updates by fine-tuning methods in transfer learning. It retains the knowledge learned by the model in previous scenarios and that it is a scalable method for updating the model. In studies of the modified New England 39-bus system and the IEEE 118-bus system, it was shown that the framework requires updating of the model only using the new scenario dataset and that the updated model meets the assessment requirements under both old and new scenarios. As the updates proceeded, the model coverage of the system operation scenarios also increased.
In future work, continual learning will be of great significance for building models with multiple assessment capabilities. For example, a model with the capability of transient stability assessment, frequency stability assessment, and voltage stability assessment at the same time.