Frequency Stability Prediction of Power Systems Using Vision Transformer and Copula Entropy

This paper addresses the problem of frequency stability prediction (FSP) following active power disturbances in power systems by proposing a vision transformer (ViT) method that predicts frequency stability in real time. The core idea of the FSP approach employing the ViT is to use the time-series data of power system operations as ViT inputs to perform FSP accurately and quickly so that operators can decide frequency control actions, minimizing the losses caused by incidents. Additionally, due to the high-dimensional and redundant input data of the power system and the O(N2) computational complexity of the transformer, feature selection based on copula entropy (CE) is used to construct image-like data with fixed dimensions from power system operation data and remove redundant information. Moreover, no previous FSP study has taken safety margins into consideration, which may threaten the secure operation of power systems. Therefore, a frequency security index (FSI) is used to form the sample labels, which are categorized as “insecurity”, “relative security”, and “absolute security”. Finally, various case studies are carried out on a modified New England 39-bus system and a modified ACTIVSg500 system for projected 0% to 40% nonsynchronous system penetration levels. The simulation results demonstrate that the proposed method achieves state-of-the-art (SOTA) performance on normal, noisy, and incomplete datasets in comparison with eight machine-learning methods.


Introduction
To respond to the global environmental crisis, 137 countries agreed to achieve carbon neutrality by approximately 2050 after COP26 [1]. As an essential part of this climate action plan, the large-scale replacement of fuel resources with renewable resources will be able to effectively reduce carbon emissions [2]. Nevertheless, grid frequency support becomes weakened as large-scale renewable generators are connected, which is a challenging issue for power systems that must accommodate renewable energy penetration [3,4]. Specifically, fuel resources in power systems usually belong to synchronous generators that provide inertia and primary operating reserves to maintain system frequency stability [5]. However, the inertia that maintains the system frequency stability is declining because synchronous generators are being replaced by nonsynchronous generators (renewable resources) [6]. To arrest and stabilize frequency volatility in renewable-energy-penetrated power systems, it is essential to accurately and quickly predict system frequency stability, which helps system planners and dispatchers to determine the corresponding control measures in advance, such as under frequency load shedding [7,8], frequency regulation using renewable-power production units [9], frequency regulation using loads [10], and frequency regulation using storage devices [11].
The traditional methods for frequency stability prediction (FSP) are model-driven methods, including the time-domain simulation (TDS) approach and its equivalent models,

Transformer Models in DL
The original transformer, as a powerful DL model, was first proposed in [35]. Unlike a CNN or LSTM, a transformer can extract global features via an attention mechanism. The most representative work is the bidirectional encoder representations from transformers (BERT) [36]. At present, most state-of-the-art (SOTA) natural language processing (NLP) tasks involve BERT. Inspired by the stunning performance of transformers in NLP, the Brain Team of Google Research proposed a vision transformer (ViT) [37] model to solve computer vision (CV) tasks, and the ViT outperformed ResNet [38] in image classification. After this, researchers studying other CV tasks tried to use the transformer architecture as their backbone networks to achieve object detection [39] and semantic segmentation [40]. Inspired by the fantastic performance of transformers in NLP and CV, this paper first proposes a ViT-based FSP method to fill the research gap regarding the use of transformers in power system stability prediction.

Feature Selection Methods
The current feature selection approaches can be classified into three categories: (1) embedded methods; (2) filter-based methods; and (3) wrapper-based methods. Specifically, an embedded method is injected into the learning process of a forecasting model. A filter-based method is independent of any prediction model. A wrapper-based method is based on an optimization algorithm and a forecasting model [41]. In terms of this study, DL methods can automatically perform feature extraction, but at a high cost. We hope that there is a simple and effective way to remove apparently redundant features. CEbased feature selection is a filter-based method, so it can meet the above requirements. Compared with embedded and wrapper-based methods, filter-based methods have higher execution efficiency and greater generalization capabilities [42]. In particular, mutual information (MI) [43] and RReliefF are typical filter-based methods. In [44], CE is proven to be equivalent to MI but has a lower computational burden than MI.

Frequency Security Indices
Frequency prediction indicators can be divided into two kinds: frequency curve prediction [32,45] and frequency characteristics prediction [46,47]. However, the above frequency indicators only distinguish between frequency security and insecurity. None of the previous studies paid attention to safety margins. Delkhosh and Seifi [48] proposed an FSI considering all frequency key characteristics. Due to the decline in power system inertia, this paper combines the center-of-inertia frequency (COIF) and FSI to divide the system frequency responses into three categories, i.e., insecurity, relative security, and absolute security. Note that relative security reflects the safety margin of the frequency response, which can help operators avoid the risk caused by low inertia. For the above reasons, the FSI is used as the prediction indicator of the DL methods.

Our Contributions
Finally, the main contributions of this paper are summarized as follows. This paper proposes a ViT-based FSP method that predicts frequency security online following a disturbance. • A CE-based feature selection method is used to construct image-like data with fixed dimensions, which can decrease the computational burden of the proposed model by removing redundant information. • This paper develops a novel FSI as the predicted result of the model, which considers the safety margin and comprehensive characteristics of frequency compared with the traditional indicators. • Case studies are conducted on a modified IEEE 39-bus system and a modified AC-TIVSg500 system for projected 0% to 40% nonsynchronous system penetration levels, aiming to validate the proposed method's efficacy and scalability.

ViT-Based FSP Method
Normal qkv self-attention (SA) [49] is a building block for DL, and it is given by Equation (1). We compute a weighted sum over all values for every element in an input sequence z ∈ R N×D . The attention weights A ij are based on the pairwise similarity between two elements in the sequence and their respective query q i and key k j representations. Figure 1 introduces a multihead self-attention (MSA) [35] mechanism that can be used to increase the performance of the SA layer, in which we run h self-attention operations (named "heads") in parallel and project their concatenated outputs. It is given by Equation (2), and when changing h, D h is usually set to D/h to maintain the number of calculated parameters constant.
solute security. Note that relative security reflects the safety margin of the frequency response, which can help operators avoid the risk caused by low inertia. For the above reasons, the FSI is used as the prediction indicator of the DL methods.

Our Contributions
Finally, the main contributions of this paper are summarized as follows.
• This paper proposes a ViT-based FSP method that predicts frequency security online following a disturbance. • A CE-based feature selection method is used to construct image-like data with fixed dimensions, which can decrease the computational burden of the proposed model by removing redundant information. • This paper develops a novel FSI as the predicted result of the model, which considers the safety margin and comprehensive characteristics of frequency compared with the traditional indicators.

•
Case studies are conducted on a modified IEEE 39-bus system and a modified AC-TIVSg500 system for projected 0% to 40% nonsynchronous system penetration levels, aiming to validate the proposed method's efficacy and scalability.

Vision Transformer (ViT)
3.1.1. Multihead Self-Attention Normal qkv self-attention (SA) [49] is a building block for DL, and it is given by Equation (1). We compute a weighted sum over all values for every element in an input The attention weights Aij are based on the pairwise similarity between two elements in the sequence and their respective query q i and key k j representations. Figure 1 introduces a multihead self-attention (MSA) [35] mechanism that can be used to increase the performance of the SA layer, in which we run h self-attention operations (named "heads") in parallel and project their concatenated outputs. It is given by Equation (2), and when changing h, Dh is usually set to D/h to maintain the number of calculated parameters constant.

ViT
The ViT adopts a pure transformer architecture, which has minimal changes for performing image classification tasks and achieves better performance than ResNet [37]. It follows the raw design of transformers as much as possible. Figure 2 depicts the framework of the ViT.

ViT
The ViT adopts a pure transformer architecture, which has minimal changes for performing image classification tasks and achieves better performance than ResNet [37]. It follows the raw design of transformers as much as possible. Figure 2 depicts the framework of the ViT.
where E denotes a linear projection that is equivalent to a 2D convolution [50], and P1d denotes 1D position embeddings that are added to the patch embeddings to retain positional information. The tokens are passed through an encoder consisting of a sequence of transformer layers. Each layer  comprises layer normalization (LN) [51], multilayer perception (MLP) [36], and MSA blocks, as follows: To process 2D images, an image x ∈ R H×W×C is reshaped into N nonoverlapping image patches x p ∈ R N×(P 2 ·C) such that p 2 is the resolution of each image patch, C is the number of channels, and (H, W) is the resolution of the original image. The ViT performs a trainable linear projection that maps each vectorized path to 1D tokens z i ∈ R d . The sequence of 1D tokens input into the subsequent transformer encoder is as follows: where E denotes a linear projection that is equivalent to a 2D convolution [50], and P 1d denotes 1D position embeddings that are added to the patch embeddings to retain positional information. The tokens are passed through an encoder consisting of a sequence of transformer layers. Each layer comprises layer normalization (LN) [51], multilayer perception (MLP) [36], and MSA blocks, as follows: The MLP is made up of two linear projections split by a GELU activation function [37], and the token dimensionality remains constant throughout all layers. Finally, a linear classifier is utilized to classify the encoded input.
As shown in Figure 2, power system operation data are reshaped into an equivalent form R H×W×C , where H is the sampling time series of the sensors, W is the dimensionality of the data, and C is the number of channels. Given such a transformation, FSP of power systems can also be carried out by the ViT model.

CE-Based Feature Selection
Statistical independence is a fundamental concept in the fields of statistics and ML. Copulas provide theoretical tools for uniformly representing the statistical associations between random variables [52]. The core of copula is the Sklar theorem [53], which shows that a multivariate density function can be denoted as a product of its marginal and copula density functions, indicating a dependence structure among the associated random variables.
Suppose that X represents random variables whose marginals and copula density are u and c(u), respectively. According to the copula density, the CE of X can be defined as follows: where c(u) = d N C(u) In [44], a parameter-free CE estimation approach was proposed, including two steps: (1) Estimating the empirical copula density (ECD) (2) Estimating the CE For step 1, if independent identically distributed samples {x 1 , . . . , x T } are generated from random variables X = {x 1 , . . . , x N } T , one can easily estimate the ECD as follows: where i = 1,..., and N and χ represent an indicator function. Let u = [F 1 , . . . , F N ]; then, one can derive a new sample set {u 1 , . . . , u T } as data from the ECD c(u).
Once the ECD is estimated, step 2 is essentially an entropy estimation problem. The knearest-neighbor method [43] is utilized to estimate the CE. A larger CE denotes a stronger correlation between the tested variables. The desired features can be obtained by measuring the CE values between the input features and the target features.
In this work, power system operation data are reshaped into three 32 × 32 dimensional matrices, i.e., image-like data with three channels and 32 pixels, via CE-based feature selection. In this process, considerable redundant information is removed from the power system operation data. Therefore, such input data with fixed dimensions are utilized as the inputs of the ViT to avoid an unnecessary computational burden.

Center-of-Inertia Frequency
The frequency of each generator fluctuates around the COIF when a sudden incident occurs in the power system. Therefore, the COIF is commonly used to represent the power system frequency in load shedding schemes [54]. The COIF is given by Equation (8).
where H i , S i , and f i represent the inertia constant, rated apparent power, and frequency of generator i, respectively. N stands for the number of synchronous generators. In this work, the COIF is used to calculate the proposed FSI.

Insecure Boundaries and Secure Boundaries
Insecure boundaries (IBs) are provided by standards and policies to maintain the system stability and reliability, i.e., the maximum frequency deviation (FD), rate of change of frequency (RoCoF), and quasi-steady-state frequency deviation (QSSFD). As depicted in Figure 3, an IB is a constant boundary distinguishing between the secure (stable) and insecure (unstable) frequencies after an active power disturbance.

Insecure Boundaries and Secure Boundaries
Insecure boundaries (IBs) are provided by standards and policies to maintain the system stability and reliability, i.e., the maximum frequency deviation (FD), rate of change of frequency (RoCoF), and quasi-steady-state frequency deviation (QSSFD). As depicted in Figure 3, an IB is a constant boundary distinguishing between the secure (stable) and insecure (unstable) frequencies after an active power disturbance.  Secure boundaries (SBs) distinguish between absolute security and relative security. As depicted in Figure 3, an SB is a flexible boundary determined by the disturbance size, and different values of α, β, and γ lead to different SBs, where α, β, and γ are dependent on the disturbance size, as defined in Equations (9)- (12).
The detailed calculation process of the IB and SB is shown in Table 1. Δfc, RoCoF, and Δfs in Table 1 represent the FD, RoCoF, and QSSFD, respectively. Δfc max , RoCoF max , and Δfs max in Table 1 represent the maximum FD, RoCoF, and QSSFD, respectively. α, β, and γ in Table 1 represent the security coefficients of the FD, RoCoF, and QSSFD, respectively. The security coefficients (α, β, and γ) are defined in Equations (9)- (12).   Secure boundaries (SBs) distinguish between absolute security and relative security. As depicted in Figure 3, an SB is a flexible boundary determined by the disturbance size, and different values of α, β, and γ lead to different SBs, where α, β, and γ are dependent on the disturbance size, as defined in Equations (9)- (12).
The detailed calculation process of the IB and SB is shown in Table 1. ∆f c , RoCoF, and ∆f s in Table 1 represent the FD, RoCoF, and QSSFD, respectively. ∆f c max , RoCoF max , and ∆f s max in Table 1 represent the maximum FD, RoCoF, and QSSFD, respectively. α, β, and γ in Table 1 represent the security coefficients of the FD, RoCoF, and QSSFD, respectively. The security coefficients (α, β, and γ) are defined in Equations (9)- (12).
where M represents the disturbance size; M max and M min respectively represent the maximum and minimum disturbance sizes, which are reference values determined by the historical disturbances in the power system; and k T represents the linear coefficient of the three security coefficients. Table 1. SB and IB.

Calculation of the FSI
The proposed FSI aims to qualitatively evaluate the frequency stability of a power system for a specified operating condition. When an incident occurs, the system frequency response can be divided into three states: insecurity, relative security, and absolute security. In Equation (13), the numbers 0, 1, and 2 indicate insecurity, relative security, and absolute security, respectively.
where ϕ i indicates three frequency characteristics, such as ∆f c , RoCoF, and ∆f s . SB(ϕ) and IB(ϕ) are presented in Table 1. Furthermore, the minimum value among the three frequency characteristics is the FSI, which is given by Equation (14).
For a clear understanding of the FSI, the effect diagram of the FSI is illustrated in Figure 4. If the COIF curve is located in the red zone, the system frequency is absolutely secure. If some parts of the COIF curve are located in the orange or blue area, the system frequency is relatively secure or insecure, respectively.  : the red zone denotes absolute security, the orange zone denotes relative security, and the blue zone denotes insecurity: (a) P 0 It is worth noting that the occurrence times of the maximum FD and the maximum QSSFD are not included in this paper. Due to their weak correlations with the disturbance size, it is not appropriate for the occurrence times of the maximum FD and the maximum QSSFD to undergo a similar process.

Raw Database
Original feature formation is critical for ensuring the accuracy of the FSP results. For a sudden disturbance in a power system, generators withstand the unbalanced power based on the corresponding synchronization factor. The synchronization factor between node j and node k is represented as: where V and δ denote the voltage amplitude and phase angle difference, respectively, and Bjk and Gjk denote the transfer impedance. Therefore, the voltage amplitude and phase angle of each bus [32] should be added to the original features. Furthermore, the power imbalance P Δ for a generator is defined as: where Pm and Pe represent the mechanical power and electrical power of the generator, Figure 4. The effect diagram of the FSI for ∆P > 0 and ∆P < 0: the red zone denotes absolute security, the orange zone denotes relative security, and the blue zone denotes insecurity: (a) ∆P > 0; It is worth noting that the occurrence times of the maximum FD and the maximum QSSFD are not included in this paper. Due to their weak correlations with the disturbance size, it is not appropriate for the occurrence times of the maximum FD and the maximum QSSFD to undergo a similar process.

Raw Database
Original feature formation is critical for ensuring the accuracy of the FSP results. For a sudden disturbance in a power system, generators withstand the unbalanced power based on the corresponding synchronization factor. The synchronization factor between node j and node k is represented as: where V and δ denote the voltage amplitude and phase angle difference, respectively, and B jk and G jk denote the transfer impedance. Therefore, the voltage amplitude and phase angle of each bus [32] should be added to the original features. Furthermore, the power imbalance ∆P for a generator is defined as: where P m and P e represent the mechanical power and electrical power of the generator, respectively. H stands for the inertia constant of the generator. f N stands for the system operational frequency. Referring to Equation (16), the electrical power values of the generators [32] are also selected as original input features. Note that the electrical power values of the nonsynchronous generators also might be related to frequency stability, according to [55][56][57]. Thus, in our work, the electrical power values of all generators are selected as original input features. Furthermore, the active power load of each bus and the apparent power of each line are also selected as original input features, as they can reflect the current power flow situation. In practice, sensors (i.e., PMUs) and TDS software (i.e., PSS/E, DIgSILENT) are able to provide the above data as a raw database. Specifically, they are listed in Table 2. Active power load of each bus from t 0 to 32 f t 3 Voltage amplitude of each bus from t 0 to 32 f t 4 Voltage phase angle of each bus from t 0 to 32 f t 5 Apparent power of each line from t 0 to 32 f t Note: t 0 is the initial sampling point when a disturbance occurs. f t is the sampling period.

Offline Training
As illustrated in Figure 5, the offline training process of the proposed method includes two parts: (1) performing CE-based feature selection and (2) training and building the ViT model.

Number
Original Feature 1 Electrical power of each generator from t0 to 32 ft 2 Active power load of each bus from t0 to 32 ft 3 Voltage amplitude of each bus from t0 to 32 ft 4 Voltage phase angle of each bus from t0 to 32 ft 5 Apparent power of each line from t0 to 32 ft Note: t0 is the initial sampling point when a disturbance occurs. ft is the sampling period.

Offline Training
As illustrated in Figure 5, the offline training process of the proposed method includes two parts: (1) performing CE-based feature selection and (2) training and building the ViT model.
The first part calculates the CE values between the input data at the initial moment t0 and the FSIs. By sorting the CE values, the desired feature subset is obtained, the dimensionality of which is 96. Then, the data shape of the feature subset is reshaped into three channels, 32 features, and 32 sampling points, similar to an RGB image of 32 pixels. Moreover, zero-mean normalization [30] is used to eliminate the magnitude differences between different features before inputting them into the model, and this process is defined as follows: where µ is the mean of all data, σ is the standard deviation of all data, and x * is the normalized data. For the second part, the loss function is the cross-entropy error function [30]. The optimization solver adopts Adam [58] to update parameters. The number of epochs is set to 200, and the batch size is set to 200 for each epoch. The initial learning rate is 0.0005, and the CosineAnnealingLR [59] schedule is used to decrease the learning rate to yield improved training efficiency.    The first part calculates the CE values between the input data at the initial moment t 0 and the FSIs. By sorting the CE values, the desired feature subset is obtained, the dimensionality of which is 96. Then, the data shape of the feature subset is reshaped into three channels, 32 features, and 32 sampling points, similar to an RGB image of 32 pixels. Moreover, zero-mean normalization [30] is used to eliminate the magnitude differences between different features before inputting them into the model, and this process is defined as follows: where µ is the mean of all data, σ is the standard deviation of all data, and x * is the normalized data.
For the second part, the loss function is the cross-entropy error function [30]. The optimization solver adopts Adam [58] to update parameters. The number of epochs is set to 200, and the batch size is set to 200 for each epoch. The initial learning rate is 0.0005, and the CosineAnnealingLR [59] schedule is used to decrease the learning rate to yield improved training efficiency.

Online Application
All steps of the online application are shown in Figure 5. The post-fault input data in the feature subset can be sampled by a WAMS with PMUs, and then these data input into the well-trained model, enabling the quick prediction of the FSIs, i.e., insecurity, relative security, and absolute security. The operators utilize corresponding controls and strategies by referring to the prediction results to minimize the loss caused by the incident. Moreover, note that the model only needs data from the feature subset in the online application stage, which means that the PMU does not cover the entire power system. Because PMUs are expensive, this can increase the economy of the ML system in online applications [60]. Finally, online updating of the database and retraining of the model can be carried out hourly or daily to adapt to various system operation situations [27].

Evaluation Indicators
In this paper, the FSP of power systems is a classification task. Therefore, accuracy (ACC), precision (PRE), recall (REC), and F-measure (F1) are employed as the evaluation metrics. These metrics are defined in Equation (18).
where TP i , FP i , and FN i are the number of true-positive samples, the number of falsepositive samples, and the number of false-negative samples under each security state i, respectively. n total is the total number of samples. PRE i , and REC i are the precision and recall under each security state i, respectively, whose average values are PRE and REC.

Equipment and Software
All tested algorithms are implemented in Pytorch-v1.10.1 and Scikit-learn-v1.0.2. The CE-based feature selection process is provided by pycopent-v0.2.3, which is available in R or Python. Moreover, all algorithms are trained on a personal computer with an Intel Core(TM) i5-12600KF CPU @ 3.70 GHz (Santa Clara, CA, USA), 32 GB of RAM and an RTX 3060 GPU ((Santa Clara, CA, USA)).

A Modified New England 39-Bus System
Numerical simulations were implemented on a modified New England 39-bus system with PSS/E [61] to simulate the data acquired by a WAMS and PMUs. To approximate the Entropy 2022, 24, 1165 11 of 24 system behavior [27], the load levels of the power system were set to 50%, 51%, . . . , and 100% of the original load levels [32]. The same ratio was also used to scale the generation power, but extra modifications were provided to ensure that all input data fall within a reasonable range [27]. Under different power flow levels, the sudden load volatility was set to the active power disturbance [32]. The disturbance was assumed to occur on all load buses. The fault sizes were set to range from −500 MW to 500 MW at intervals of 100 MW. The fault occurrence time was set at the simulation start moment (0 s), the simulation duration was 60 s, and the sample rate was 100 Hz for each incident. Additionally, wind farms were connected to bus 2, bus 29, and bus 39 of the system. Dynamic wind farm models were provided by the Western Electricity Coordinating Council (WECC) [62]. Specifically, the wind turbine converter module adopted WT3G2, the electrical control module adopted WT3E1, the mechanical control module adopted WT3T1, and the pitch control module adopted WT3P1. The detailed parameters of dynamic wind farm models are listed in Appendix A. Changing the power output of the generating units controlled the renewable energy penetration rate (REPR). This study set the REPR to 0%, 5%, . . . , and 40% of the total generation power output.
The detailed configurations utilized for dataset generation are summarized in Table 3. The FSI was used for each sample to annotate the corresponding frequency stability category. The required input data for the FSI calculation process are listed in Table 4. Consequently, 69,768 labeled samples were formed under the above conditions. As described in the CE-based feature selection discussion, the input data obtained within 0.32 s were selected as a subset with 96 features. The number of sample points was T = 0.32/0.01 = 32, and the input sample of each disturbance was X ∈ R 32×32×3 . In the subsequent experiments, the dataset was divided randomly at a 7:3 ratio into training and test datasets to assess the performance discrepancies among various methods.  In this work, the sample annotation process was based on the FSI. Each sample belonged to one of three frequency categories, i.e., insecurity, relative security, and absolute security. To observe the correlation between the REPR and FSI, Figure 6 presents the distribution of the FSI (labels) under 0% to 40% REPRs. The number of insecure samples follows an increasing trend with the growth of the REPR. Conversely, the number of absolutely secure samples follows a decreasing trend with the growth of the REPR. Moreover, the number of relatively secure samples approximatively follows a normal distribution. It is assumed that the other operating conditions are not shifted, such that the growth of the REPR deteriorates the frequency stability of power systems.
follows an increasing trend with the growth of the REPR. Conversely solutely secure samples follows a decreasing trend with the growth o ver, the number of relatively secure samples approximatively follows tion. It is assumed that the other operating conditions are not shifted, s of the REPR deteriorates the frequency stability of power systems.  x-axis represents the frequency security index (FSI), i.e., insecurity, relative security, and absolute security. The y-axis represents the renewable energy penetration rate (REPR). The z-axis represents the number of each kind of sample among the FSIs.

Feature Subset
Due to the vast computational resources required by a transformer, a single-layer fully connected network (FCN) is used to verify the effectiveness of CE-based feature selection in this section. Figure 7a depicts a comparison between the raw dataset and the optimal dataset in terms of the accuracy, training time, and parameter size of the FCN. The parameter size denotes the required computational resources, which are mainly dependent on the data dimensionality and model complexity. According to dimensionality reduction, CE-based feature selection can reduce the training time and parameter size while maintaining the original performance as much as possible. Figure 7b shows the component analysis results of the feature subset. In the feature subset, the apparent power of the transmission line accounts for 33.3% of the feature subset, and the voltage phase angle of the bus accounts for 31.3% of the feature subset. This reflects that these physical variables may involve much effective information for the FSP. Subsequently, the active power load accounts for 21.9% of the subset, and the voltage amplitude of the bus accounts for 13.5% of the subset. This reflects that these physical variables only involve some effective information for the FSP. It is worth noting that the active power of the generator accounts for 0%. This reflects that the active power of the generator may provide zero or slight effective information for the FSP. Overall, CE, as a theoretical tool in statistics, tries to analyze the correlation between system frequency and other physical variables of the power system. To some degree, this can help system planners and dispatchers understand what input features are important for the FSP.
Finally, the feature subset provided by CE-based feature selection can substitute for the raw dataset. Thus, the next section adopts this feature subset as the input dataset to compare the performance of different models. accounts for 0%. This reflects that the active power of the generator may provide zero or slight effective information for the FSP. Overall, CE, as a theoretical tool in statistics, tries to analyze the correlation between system frequency and other physical variables of the power system. To some degree, this can help system planners and dispatchers understand what input features are important for the FSP. Finally, the feature subset provided by CE-based feature selection can substitute for the raw dataset. Thus, the next section adopts this feature subset as the input dataset to compare the performance of different models.

Performance Comparison
A support vector machine (SVM), FCN, LeNet [63], AlexNet [64], ResNet [38], VGG [65], MobileNet [66], and InceptionNet [67] are utilized for comparison to test the performance of the ViT on the same dataset. The SVM and FCN are traditional ML models and are the default implementations provided by Sklearn. LeNet, AlexNet, ResNet, VGG, MobileNet, and InceptionNet are traditional DL models that use the same parameters and structure as those in their original papers. In particular, the structure of the ViT contains three transformer encoder layers and one MLP layer for classification. The detailed hyperparameters of the well-trained ViT model are listed in Table 5. For a fair comparison, the ViT and other DL models adopt the same training strategy. As shown in Figure 8, it is evident that the proposed method achieves SOTA performance in comparison with the traditional ML and DL methods. The PRE and REC values are similar to the ACC values, which reflects that the models treat each FSI category fairly. The traditional ML models lack the powerful feature-extraction capability of DL. Thus, they have poorer performance than the DL models. The traditional DL models extract data information by the convolution layer. However, the convolution layer focuses on the extraction of local data features, and obtaining global data information requires a large number of convolution layers. However, according to the results, MSA is a better effective mechanism than convolution. Unlike convolution, MSA can directly extract global features [35], which leads to better performance for the proposed method. Moreover, since FSP has high demands regarding the execution times of models, the proposed algorithm only takes approximately 0.34 s (the time window is 0.32 s and the execution time is 0.02 s) to be executed for predicting each FSI category since the implementation of transformer models has been highly optimized [36]. Thus, the proposed method is acceptable for online applications.
information by the convolution layer. However, the convolution layer focuses on the extraction of local data features, and obtaining global data information requires a large number of convolution layers. However, according to the results, MSA is a better effective mechanism than convolution. Unlike convolution, MSA can directly extract global features [35], which leads to better performance for the proposed method. Moreover, since FSP has high demands regarding the execution times of models, the proposed algorithm only takes approximately 0.34 s (the time window is 0.32 s and the execution time is 0.02 s) to be executed for predicting each FSI category since the implementation of transformer models has been highly optimized [36]. Thus, the proposed method is acceptable for online applications.

Influence of Gaussian Noise
The experiments, as mentioned above, are assumed to sample data from the PMUs in power systems without any noise. However, PMUs usually suffer from noise interference and sampling errors [27]. To analyze the influence of white Gaussian noise on the models, this study added a noise n with different signal-noise ratios (SNRs) to the feature subset. The SNR is given by Equation (19). The accuracies of the ViT and the other models on the noisy data are reported in Table 6, where the best values are highlighted in boldface. As the SNR declines from 50 dB to 10 dB, Gaussian white noise hinders the useful feature vector information extracted by the ML models, reducing their accuracy. In Table 6, it can be observed that the ViT achieves SOTA accuracy on noisy datasets (from 50 dB to 10 dB) relative to those of other methods. Note that the ViT still exceeds 90% accuracy on the noisy data with an SNR of 10 dB. In contrast, the accuracies of the SVM, the FCN, LeNet, AlexNet, and MobileNet are obviously lower than 90% on noisy data, with an SNR of 10 dB. Overall, the ViT ensures the concentration of useful information in the extracted feature vectors, which illustrates that the ViT can tolerate PMU noise in practice.

Incomplete Data Analysis
Another assumption mentioned above is that the PMU measurements of all data are available. In practice, some PMU data may be missing due to PMU losses or communication delays [27]. To analyze the influences of incomplete data on the models, we randomly set the data of each sample to 0 with the same proportions. The incomplete ratio can be described by Equation (20), where N missing is the number of missing data, and N all is the total number of data. The accuracies of the models on the incomplete data are presented in Table 7, where the best values are also highlighted in boldface. As the incomplete ratio rises from 5% to 40%, the missing data also reduce the information contained in the feature vectors extracted by the ML models, which decreases their accuracy. As shown in Table 7, the ViT achieves SOTA accuracy compared with that of other ML methods. Under incomplete ratios of 5% and 10%, the ViT exceeds 95% accuracy. An incomplete ratio of more than 10% is rare in real power systems unless they are maliciously attacked. In the case with malicious attacks, the accuracy of the ViT still remains at 90.78% under an incomplete ratio of 40%. For this situation, one of the reasons for this performance may be attributed to the global feature extraction ability of the transformer model. Additionally, it should be noted that the accuracies of InceptionNet and ResNet are close to that of ViT. The main reason for this may be that stacking a large number of convolution layers can similarly extract global features from local features [38]. However, MSA can naturally extract global features. Thus, the ViT is less affected in terms of performance when data are missing. Finally, the ViT is empirically proven to work more robustly than CNN-based models, even when some PMU measurements are unavailable.

Visualization Analysis of the ViT
TSNE [68] is a popular method for embedding high-dimensional data to visualize them in a low-dimensional space. To further analyze the representation ability of the ViT, we decrease the dimensions of the feature vectors extracted by the model to 2D for visualization purposes. The results are shown in Figure 9, in which the closer the sample points are, the more similar they are, and the different colors distinguish different categories. Figure 9a shows the feature visualization results obtained from the raw data after performing dimensionality reduction via TSNE, and there are no evident demarcation lines between the three categories. After full training, it is clear from Figure 9b that the chaotic feature is separated into several clusters by the well-trained ViT model. The TSNE visualization results demonstrate the powerful feature extraction ability of the transformer architecture. Moreover, the representations learned by the MSA mechanism are useful for the subsequent classification task. Overall, this proves that the ViT model has the ability to find effective representations for classification. (a) (b) Figure 9. Visualization of the ViT feature extraction results. FSIs from 0 to 2 indicate insecurity, relative security, and absolute security, respectively: (a) raw data distribution; (b) output of the last layer.

A Modified ACTIVSg500 System
In addition to the modified New England 39-bus system, a more extensive synthetic system, the modified ACTIVSg500 system, was also employed as a test case to validate the performance and scalability of the proposed method. As shown in Figure 10, the AC-TIVSg500 system was built based on the footprint of western South Carolina, covering approximately 21 counties with approximately 2.6 million people [69]. The ACTIVSg500 system has two voltage levels (345/138 kV). Furthermore, it contains 90 generators with a total generation capacity of approximately 12 GW [70]. The synchronous generators include coal, gas, and hydro generators. The nonsynchronous generators include wind and solar PV power plants. Specifically, the detailed parameters of dynamic wind farm models are the same as those in the previous case. The grid interface module for solar generators adopts REGCAU1, the electrical control module for solar generators adopts REECBU1, and the plant controller module for solar generators adopts REPCAU1. The detailed parameters of solar PV power plants are listed in Appendix A. The wind power plants are connected to nodes 9, 144, and 197 of the system, and the solar PV power plants are connected to nodes 17, 167, and 224 of the system. Changing the power outputs of the generating units adjusts the different REPRs.
Similarly, numerical simulations were carried out on PSS/E, and all simulation models were provided by Texas A&M University. The load levels were set to 50%, 52%, …, and 100% of the basic system load levels. The same ratios also scale the generation power but with extra modifications to ensure that all input data stay within a reasonable range. Under each power flow level, the sudden load volatility was set to active power disturbance. The disturbances are located at nodes 4, 6, 61, 64, 103, 150, 204, 292, 303, 364, 470, and 499. The fault sizes were set to range from −700 MW to 700 MW at intervals of 100 MW. The fault occurrence time was set at the simulation start moment (0 s), the simulation duration was 60 s, and the sample rate was 100 Hz for each disturbance. This study set different REPRs: 0%, 5%, …, and 40% of the total generation power output.

A Modified ACTIVSg500 System
In addition to the modified New England 39-bus system, a more extensive synthetic system, the modified ACTIVSg500 system, was also employed as a test case to validate the performance and scalability of the proposed method. As shown in Figure 10, the ACTIVSg500 system was built based on the footprint of western South Carolina, covering approximately 21 counties with approximately 2.6 million people [69]. The ACTIVSg500 system has two voltage levels (345/138 kV). Furthermore, it contains 90 generators with a total generation capacity of approximately 12 GW [70]. The synchronous generators include coal, gas, and hydro generators. The nonsynchronous generators include wind and solar PV power plants. Specifically, the detailed parameters of dynamic wind farm models are the same as those in the previous case. The grid interface module for solar generators adopts REGCAU1, the electrical control module for solar generators adopts REECBU1, and the plant controller module for solar generators adopts REPCAU1. The detailed parameters of solar PV power plants are listed in Appendix A. The wind power plants are connected to nodes 9, 144, and 197 of the system, and the solar PV power plants are connected to nodes 17, 167, and 224 of the system. Changing the power outputs of the generating units adjusts the different REPRs. The detailed configurations used for dataset generation are summarized in Table 8. The required input data for the FSI calculation process are listed in Table 9. Consequently, 39,312 labeled samples were formed under the above conditions. In further trials, the dataset was also divided randomly at a 7:3 ratio into training and test datasets to assess the Similarly, numerical simulations were carried out on PSS/E, and all simulation models were provided by Texas A&M University. The load levels were set to 50%, 52%, . . . , and 100% of the basic system load levels. The same ratios also scale the generation power but with extra modifications to ensure that all input data stay within a reasonable range. Under each power flow level, the sudden load volatility was set to active power disturbance. The disturbances are located at nodes 4,6,61,64,103,150,204,292,303,364,470, and 499. The fault sizes were set to range from −700 MW to 700 MW at intervals of 100 MW. The fault occurrence time was set at the simulation start moment (0 s), the simulation duration was 60 s, and the sample rate was 100 Hz for each disturbance. This study set different REPRs: 0%, 5%, . . . , and 40% of the total generation power output.
The detailed configurations used for dataset generation are summarized in Table 8. The required input data for the FSI calculation process are listed in Table 9. Consequently, 39,312 labeled samples were formed under the above conditions. In further trials, the dataset was also divided randomly at a 7:3 ratio into training and test datasets to assess the observed performance discrepancies.

Testing Results and Comparison
The ViT and other methods have almost the same configurations as those in the previous case. The performance comparison between the ViT and other methods is shown in Figure 11. The feature subset of the ACTIVSg500 system was similarly analyzed, and the results are shown in Figure 12. The effects of PMU noise and loss on the model are shown in Tables 10 and 11,      In the results presented thus far, this case agrees with the previous case. The proposed ViT-based method achieves the best accuracy among the tested methods, whether they are used in a noisy or incomplete environment or not. This indicates again that the global feature extraction of MSA is better than the local feature extraction of convolution. Note that the performance of the proposed method does not decline as the system size increases, demonstrating its superior efficacy and scalability. The reason behind these results might be that CE retains as many variables as possible that are effective in FSP. In Figure 12b, the variables of power systems follow the same trend as in the previous case, with a slight difference in the number of each feature.  In the results presented thus far, this case agrees with the previous case. The proposed ViT-based method achieves the best accuracy among the tested methods, whether they are used in a noisy or incomplete environment or not. This indicates again that the global feature extraction of MSA is better than the local feature extraction of convolution. Note that the performance of the proposed method does not decline as the system size increases, demonstrating its superior efficacy and scalability. The reason behind these results might be that CE retains as many variables as possible that are effective in FSP. In Figure 12b, the variables of power systems follow the same trend as in the previous case, with a slight difference in the number of each feature.

Discussion
In the present study, we first propose a ViT-based FSP method. Note that the ViT only uses the pure transformer architecture because we aim to explore the potential of the transformer architecture in FSP. Convolution and MSA are all effective mechanisms for extracting useful information. Multimodel combinations may be better when the transformer is reasonably combined with other DL models. For example, the transformer can be combined with a CNN to mine the complex relationships between data because the transformer is good at extracting global features [71] and the CNN is good at extracting local features.
In addition, the transformer is a widespread method in NLP. For NLP, the selfsupervised training approach [36] is common; i.e., the training process does not need data labels. However, most ML methods in power systems are supervised training approaches, i.e., the training process needs data labels. The use of self-supervised training methods combined with transformers is foreseeable in power systems. The advantage of self-supervised learning is that it ensures low-cost access to large amounts of training data and maintains high performance. For example, the topology of a power system may change due to a failure, causing the new data to be completely different from the original training data. Models trained via supervised learning cannot handle new data that are outside the range of the original training data. This means that new data should be collected and labeled to retrain the model to fit such a change, but this is a high-cost training approach. In contrast, self-supervised learning can automatically fit the change by collecting data without manual annotation, which costs less than supervised learning. For the image classification task, DL does not require feature selection because the dimensionality of the image is fixed, and DL can automatically extract useful features. For the FSP of power systems, different power systems have different dimensions, and their data are highly redundant. CE-based feature selection transforms power system operation data into image-like data with three channels and 32 pixels, thereby significantly improving the generalization of the proposed method. Moreover, DL usually belongs to a black-box model that has low interpretability. According to feature selection, we can know feature importance, which increases the interpretability of the resulting model to some extent.

Conclusions and Future Work
This paper proposes a DL method for power system FSP by using ViT and CE. Case studies were carried out on the modified New England 39-bus system and the modified ACTIVSg500 system. The results demonstrate the following:

•
The ViT-based FSP method achieves SOTA performance compared to eight ML methods on normal, noisy, and incomplete datasets, so the proposed method is suitable for practical applications.

•
As for the FSP of power systems tasks, the global feature extraction of MSA is a better mechanism than the local feature extraction of convolution. • When using CE-based feature selection, the proposed method is still efficient and achieves high performance in power systems of any scale without vast computational resources.