Self-Attention Network for Partial-Discharge Diagnosis in Gas-Insulated Switchgear

: Detecting, measuring, and classifying partial discharges (PDs) are important tasks for assessing the condition of insulation systems used in different electrical equipment. Owing to the implementation of the phase-resolved PD (PRPD) as a sequence input, an existing method that processes sequential data, e.g., the recurrent neural network, using a long short-term memory (LSTM) has been applied for fault classiﬁcation. However, the model performance is not further improved because of the lack of supporting parallel computation and the inability to recognize the relevance of all inputs. To overcome these two drawbacks, we propose a novel deep-learning model in this study based on a self-attention mechanism to classify the PD patterns in a gas-insulated switchgear (GIS). The proposed model uses a self-attention block that offers the advantages of simultaneous computation and selective focusing on parts of the PRPD signals and a classiﬁcation block to ﬁnally classify faults in the GIS. Moreover, the combination of LSTM and self-attention is considered for comparison purposes. The experimental results show that the proposed method achieves performance superiority compared with the previous neural networks, whereas the model complexity is signiﬁcantly reduced.


Introduction
The popularity of power systems is rapidly increasing as the power demand increases, and the reliability of a power grid is important for a stable power-system operation. Gas-insulated switchgears (GISs) are equipment filled with SF 6 gas that have excellent insulation characteristics and have been applied to substation equipment as the main protection device since the late 1960s owing to their high reliability, safety, and compactness [1]. Various failures can occur with the passage of service time, and the insulation defects in a GIS can cause partial discharges (PDs) before breakdown [2][3][4]. Therefore, detecting PDs at the early stages contributes to ensuring high reliability and safety of grid assets [5].
The PDs in a GIS can be measured using electrical, mechanical, and chemical methods [6,7]. High-frequency and ultra-high-frequency (UHF) sensors are used in the electrical methods [4,8,9], acoustic sensors are used for sound measurements [10,11], and dissolved-gas analysis is employed in the chemical methods [12,13]. In particular, the UHF method offers the advantage of high-sensitivity detection [14]. Therefore, the UHF measurement-system verification method has been standardized [15]. The present study uses a UHF sensor for a PD measurement system.
To examine the characteristic of PDs in a GIS, two types of analyses that have been studied are available, namely, time-resolved PD (TRPD) and phase-resolved PD (PRPD) [16][17][18][19][20][21][22][23]. In the TRPD method, the time-domain, frequency-domain, and both time-and frequency-domain features are used to analyze the PD pulses [21][22][23]. The PRPD-based method analyzes the phase-amplitude-number (φ − q − n) measurements, where φ is the phase angle, q is the amplitude, and n is the phase angle, q is the amplitude, and n is the number of discharges [24]. The number of PD pulses, maximum amplitude, or average amplitude in each phase is used as features in the PD classification [25].
Most of the previous studies on PD analysis using PRPDs focused on either extracting the useful features or accurate classification based on these extracted features. Signal-processing techniques such as time-domain [26], frequency-domain [27], and time-frequency-domain [28] analyses are used to extract the representative features from PRPDs. After feature extraction, the dimension reduction for computational efficiency is achieved through a feature-selection step. Correlation analysis is applied to cluster the PDs into different groups [29]. Principal component analysis is used to reduce the dimensions [30,31]. Based on the useful features from the univariate phase-resolved distributions [32], the final step is to train the classifiers such as neural networks [33], decision trees [34], and support vector machines (SVMs) [35]. However, the PD classification performance significantly varies depending on the particular combination of the existing feature-extraction and classification methods. Therefore, to maximize the PD classification performance, an integrated framework simultaneously considering both the feature-extraction and classification methods is needed.
Motivated by this objective, a deep-learning model is proposed to combine the automatic feature-extraction and fault-classification methods [36,37] in which deep neural networks (DNNs) have recently achieved cutting-edge performance in pattern-recognition tasks such as computer vision, speech recognition, text classification, and many other domains [38][39][40]. More recently, deep-learning methods have been applied to PRPD classification. DNNs [41] and convolutional neural networks (CNNs) [42] are proposed to improve the recognition accuracy of PRPDs. CNNs allow the systems to learn the local response from temporal or spatial data; however, they lack the ability to learn the sequential correlations of the inputs. Recurrent neural networks (RNNs) with a long short-term memory (LSTM) have advantages over the CNN because the models can effectively process the sequential data [43]. However, the sequential characteristic of RNN-based models does not assist parallelism, which results in the significant training processing time when the input sequence is long [44].
To overcome the aforementioned drawbacks, we propose new classification methods to classify faults in a GIS using PRPDs, namely, self-attention-based neural network for PRPDs (SANPD) and LSTM SANPD (LSANPD) methods. Self-attention takes advantage of parallel computation and enables the capture of the interactions among inputs [45] because the self-attention function can capture the global dependence of the entire input without requiring recurrence or convolution components [44]. In LSANPD, the combination of LSTM and the self-attention mechanism further improves the performance, because the self-attention mechanism can address the lack of simultaneous computation and focus on the important information from the LSTM inputs.
The proposed SANPD and LSANPD methods employ multi-head self-attention, feed-forward networks, and a classification layer. The multi-head self-attention is used to jointly attend to the information from different representation subspaces that correspond to the different phase sets of PRPDs. The feed-forward networks overcome the lack of self-attention, which is a linear model, because of the composite mapping of the nonlinear processing units [44]. Finally, the classification layer is employed to detect faults in the GIS. The main contributions of this paper are summarized as follows: • Self-attention is introduced for the first time to classify the PRPDs in a GIS. Self-attention offers the advantages of classification accuracy and computational efficiency compared with DNNs, CNNs, and RNNs [41,42,46] because it can capture the relevance among the phases of the PRPDs by considering their entire interaction sequence input regardless of distance [44].
• The LSTM self-attention method is also considered. In the LSTM self-attention model, the self-attention mechanism assists the LSTM to simultaneously compute and focus on the important information from the data inputs, which improve the classification accuracy of the PRPD classification relative to that of the LSTM RNN [46]. • The experimental results reveal that our models outperform the previous RNN model [46] in terms of the PRPD classification accuracy with a lower complexity because the self-attention mechanism recognizes the different relevance of the information among the inputs and takes advantage of simultaneous computation [45].
The remainder of this paper is organized as follows. We discuss the PRPDs and on-site noise measurements in a GIS in Section 2. Section 3 describes the proposed self-attention and LSTM self-attention models. The performance evaluation is presented in Section 4, and Section 5 concludes the study while also discussing future research topics.

Preliminaries
In this section, we discuss the experimental PRPDs of a GIS and on-site noise-measurement data in which UHF sensors are used.

PRPD Measurements
For performance comparison with the previous result, we obtained the PRPD data using an external UHF sensor in a 345-kV GIS chamber where a cavity-backed patch antenna as an external UHF sensor and an amplifier with a gain of 45 dB and a signal bandwidth that ranged from 500 MHz to 1.5 GHz was used for the PRPDs [46]. Four types of faults are possible (corona, floating, particle, and void PDs) in which artificial cells were used to simulate the possible defects in a GIS [46]. Figure 1 shows artificial cells that model four types of faults in GIS such as corona, floating, particle, and void [46,47]. The artificial cell for corona to simulate protrusion of an electrode through a needle with a tip radius of 10 µm and a diameter of 1 mm, while the distance between the needle and the ground electrode was 10 mm, and the test voltage was 11 kV, is shown in Figure 1a. As shown in Figure 1b, the cell of a floating electrode was fabricated (with distances of 10 mm between the high-voltage [HV] and middle electrodes and 1 mm between the middle and ground electrodes) to simulate an unconnected cell, where the test voltage was 10 kV. To simulate the free particle discharge, as shown in Figure 1c, a small sphere with a diameter of 1 mm was placed on a concave ground electrode and the HV electrode was attached to a 45 mm diameter sphere (fixed at 10 mm from the ground electrode), where the test voltage was 10 kV. The small gap between the epoxy disc and the upper electrode (as shown in Figure 1d) was made to simulate the artificial void defect, where the test voltage was 8 kV. All artificial cells were filled with 0.2 MPa of SF 6 gas.  Figure 2 shows the PRPDs with 3600 power cycles and Figure 3 shows the 2D representation of PRPDs, where the faults for 3600 power cycles are accumulated to generate the 2D PRPD patterns, and the number of PD events per 3600 power cycles is illustrated by the different colors. Corona signals were present at high frequencies from 255 to 315 degrees, slightly around 45 degrees, and close to zero. The floating signal showed an extremely condensed density of signals with a period of 90 degrees, which started from zero. Void signal appeared similar to corona faults that were found in the regions around 45-90 degrees or smaller at around 270 degrees. The particle signal contained a number of signals that coincidentally appeared even at different intensity ranges. The PRPD signal at the mth power cycle can be defined as where N = 128 is the number of data points in a power cycle.

On-Site Noise Measurements
External noise can vary with time, location, GIS, and antenna design. The noise was measured for 267 min using a PD measurement system in an on-site field in Korea. In Figure 4a, a block diagram of the PD measurement system for the 154 kV GIS is shown. The PD measurement system consisted of an external UHF sensor, an amplifier, and a data acquisition system (DAS). The external UHF sensor was located outside the spacer, as shown in Figure 4b. The cavity-backed patch antenna was used for the external UHF sensor in the PD measurement system. The amplifier had a gain of 45 dB and a signal bandwidth from 500 MHz to 1.5 GHz. The measured reflection coefficient of the external UHF sensor using an E5017C network analyzer is shown in Figure 5. The measured reflection coefficient was less than −6 dB in the target frequency range from 500 MHz to 1.5 GHz. Figure 6 shows an example of the on-site noise measurement. Here, noise signals existed in all phase regions of the specific power cycles, and the amplitudes of the noise signals were smaller than those of the PRPDs in the GIS.

Proposed Methods
In this section, we describe the architecture of the proposed two methods, namely, SANPD and LSANPD, to detect the PRPDs in the GIS, as shown in Figure 7. In SANPD, PRPD is the input and self-attention blocks (SABs) are utilized to learn the global dependence between the input and output and the relevance among items. Then, a multiple self-attention network is used to capture the high-level features for PRPDs of the faults. Finally, a classification layer is adopted to classify multiple faults in the GIS and utilize the cross-entropy loss. In LSANPD, the LSTM architecture is added in the prior SABs, and the remaining components are the same as those in SANPD. LSTM is good at directly learning the temporal dependence of the PRPD signals [46]. However, it is not capable of learning the model alignment between the input and output sequences, which is an essential aspect in structured output tasks [48]. In other words, LSTM does not determine if some specific parts of the input sequence are important to improve the model performance, whereas self-attention can recognize the important information between the input and output sequences. Therefore, LSANPD is proposed to improve the LSTM performance in fault classification.

Proposed SANPD
Before the SABs, we introduce a concise presentation of the attention mechanism. The mechanism can choose the critical information from a large amount of information to implement the current necessary task target. In this attention mechanism, the different weights are used to adjust the effects of the distinctive parts on the target. [49,50]. In other words, the attention mechanism is able to capture important interactions among elements of an input sequential data to improve the performance of the machine learning model.
We consider a given input sequence consisting of a vector representation of query q ∈ 1×d q and sequence X = [x T 1 , x T 2 , . . . , x T n ] ∈ R n×d e , where q is any representative vector created to calculate dependencies (relative to input items), d q is a dimension of q, d e = N and N is the number of data points in a power cycle. To measure the attention of x i and q or the relevance/dependence of the relationship between x i and q, the attention mechanism proposes compatibility function f (x i , q) as an alignment score [51,52]. The output vector for query q is denoted as s ∈ R 1×d e , and calculated as follows: where f softmax (a) = exp(a i ) , a = [a 1 , a 2 , ..., a n ], and a i = f (x i , q) is an i-th alignment score. In this study, we use a dot-product attention mechanism for the compatibility function as [52] f where W (h 1 ) ∈ R d e ×d i , W (h 2 ) ∈ R d q ×d i are learnable parameters and ·, · denotes the inner product, and d i denotes the number of samples in the input data. Self-attention mechanism is considered as a special case of the attention mechanism where query q is captured from the input itself. Each SAB is composed of a multi-head self-attention sub-block and a feed-forward network sub-block, as shown in Figure 8. A residual connection is employed around each of the two sub-blocks [53]. In the multi-head self-attention sub-block, the attention function of this mechanism has three input arguments, namely, queries Q ∈ R M×d k , keys K ∈ R M×d k , and values V ∈ R M×d v , where M is the number of power cycles, d k is the dimension of matrix Q and K and d v is the dimension of matrix V [45]. The output, i.e., Attention(Q,K,V) ∈ R M×d v , is obtained as follows: where T is transpose. In (4), the self-attention mechanism uses a scaled dot-product function to compute the relationship between each query and the key, divides each relationship by √ d k , and adopts a softmax function to obtain the weighted sum of the values [45].
To improve the computational effectiveness and take advantage of parallel computation, the multi-head attention is implemented by applying the attention for h times on the projected (Q, K, V) matrices of the dimension [45]. The multi-head attention is calculated as where Concat (·, ·) is defined as a merge of matrices {H 1 , · · · ,H h }, and W O ∈ R d k ×d v is the weight matrix for multi-head attention. In (5), In the feed-forward network sub-block, a linear transformation using a rectified linear unit (ReLU) activation function [54] defined as f ReLU (u) = max(0, u), where u is the argument of the function, is applied and a residual connection was used to obtain the low-layer information. The output of this sub-block is defined as where W 1 ∈ R d v ×d 1 , B 1 ∈ R M×d 1 , W 2 ∈ R d 1 ×d 2 and B 2 ∈ R M×d 2 are the learnable parameters, and d 1 and d 2 are the number of neuron units of the first and output layers of feed-forward network sub-block, respectively. Thus, the output of SABs, which includes the multi-head self-attention and feed-forward network sub-blocks, is given as F =SAB(X). (8) To capture the different types of features [55], the self-attention network with multiple SABs using F in (8) is expressed as where b is the number of SABs and the first SAB is defined as F (1) = F. The classification block is applied to detect the faults in the GIS, as shown in Figure 9. A pooling layer reduces the parameters of the network and avoids overfitting. As the input size obtained by the output of the SABs has a size of M × d 2 , maximum pooling is done according to the column (i.e., maximizing the elements in the same column), and the pooling layer output is a 1 × d 2 vector and given as where i = 1, ..., d 2 , and f (b) i,j is the (i, j) th element of matrix F (b) . The output of the maximum pooling is then fed into a single-layer neuron network. In the neuron network, a linear transformation using the ReLU activation function is then applied to create a fault representation vector with d 3 dimensions, which is defined as where W 3 ∈ R d 2 ×d 3 is the weighted matrix and b 3 ∈ R 1×d 3 is the bias vector, and d 3 is the number of neuron units of the layer. Finally, the characteristic representation vector is derived in a softmax layer as [56] where W 4 ∈ R d 3 ×d c is the weighted matrix, b 4 ∈ R 1×d c is the bias vector, d c is the dimension of the C classes, and z i is the predicted fault representing the i-th category (where i ∈ C) in the C classes.

Proposed LSANPD
In LSANPD, we propose a combination of LSTM and self-attention using stacked LSTM layers and multi-head self-attention sub-block, as shown in Figure 7b. Input sequence {x 1 , x 2 , ..., x M } in (1) is fed into the LSTM block, and {x 1 ,x 2 , ...,x M } can be obtained using the LSTM mechanism, as shown in Figure 10. Then, all vector outputs are linked together to form theX = x T 1 ,x T 2 , · · · ,x T M matrix as the SAB input. The next steps are performed on the SAB and classification block, similar to those in the SANPD model.

Training
The SANPD and LSANPD parameters are learned through training dataset L to minimize the loss function in the classification block. The parameter set includes hyper parameters, weight parameters, and bias parameters. In the proposed SANPD and LSANPD, the cross-entropy loss function is used. Thus, the loss function of the lth training data is formulated as follows: where v (l) = [v 1 , . . . , v C ] T is the label that corresponds to the lth training data in which v i = 1 if the true classification is fault type i. The other cases are v c = 0 for c = i, and z (l) = z i if the predicted classification of the lth training data is a type-i fault. The target of the training process is to find the suitable parameters that minimize the cost function of the entry training dataset: where Θ represents all learnable parameters and |·| is the number of elements in a set.
To minimize the loss function, many variants of the gradient-descent method have been studied in the literature, such as AdaGrad, AdaDelta, Adam, and Nestrove momentum, into the ADAM optimizer [57][58][59][60]. These optimizers adaptively change the learning rate to properly minimize the loss function. We select the Nestrove momentum in the ADAM optimizer in our experiments.

Performance Evaluation
This section presents a performance analysis of the proposed models using PRPD data from PD experiments and noise measurements. The performance of the proposed models is compared to that of the recently developed RNN model [46], which has achieved significant performance improvement over existing machine-learning methods and other techniques. For comparison purposes, we consider this model as the baseline system.
For PRPD experiments, four types of faults, such as corona, floating, particle, and void faults are considered. Table 1 shows the numbers of experiments for each fault type and noise, where one experiment for PRPD and noise measurements includes 3600 power cycles and the total number of experiments is defined as K = 735.  Figure 11 shows the data-augmentation process. The data that have been used in this study include K = 735 experiments of PRPD faults. Each experiment is performed at 3600 power cycles. To overcome the issue of overfitting during the training process, we applied a data-argumentation technique to increase the number of training samples [61]. We divided every experiment into M = 60 equally small groups containing 60 power cycles (3600 in total). Afterward, the total number of data samples became KM = 44,100.
We split the dataset into three parts: training, validation, and test sets. We used 81%, 9%, and 10% of the data for these three sets, respectively. Thus, we have a total of 35,721 training, 3969 validation, and 4410 test samples.
Multiple experiments with different numbers of attention heads and SABs were conducted using the validation data. With different parameters to tune our model, extensive experiments were conducted to obtain the other optimized hyperparameters such as the batch size, number of epochs, and learning rate. The model parameters for the proposed architectures are listed in Tables 2 and 3, where the output shapes, activation functions, and numbers of trainable parameters are presented. During our experiments, we performed 20 trials to mitigate the effects of random initialization of the neural network and then the results were averaged to confirm the robustness of our proposed model. During the training process, the optimization step was carried out according to small batches of 512 samples, and the learning rate was 0.001. The model was implemented using Keras [62] with TensorFlow [63], which is a deep learning framework of Google.  The list in Table 4 illustrates the comparison between the SANPD, LSANPD, and LSTM RNN methods in [46]. The overall accuracy and classification accuracy in each fault of the SANPD and LSANPD outperform that of the RNN because the SABs are able to capture the relevance among the PRPD phases by considering their entire interaction sequence input regardless of the distance. Moreover, the LSANPD improves the performance compared to the LSTM RNN owing to the self-attention mechanism that assists the LSTM in simultaneously computing and focusing on the important information of the sequence input, where the accuracy of the LSTM RNN is quite high compared with traditional methods such as the linear SVM and artificial neural network (Table 2 [46]). In addition, the performance of the proposed SANPD and LSANPD is comparably obtained. Performance accuracy for each fault is also presented in Table 4. It can be observed that most models (for corona, void, and noise data) managed to perform efficiently. Furthermore, it was difficult to classify floating and partial faults because the amount of data of the floating and partial faults was smaller than that of the other faults, as shown in Table 1.  Table 5 lists the comparison of the models in terms of the number of parameters and computation training and test times. SANPD demonstrates the lowest number of parameters of approximately 90,000, whereas the number of LSTM RNN parameters is approximately 264,000, which is 2.9 times larger. The parameters of the LSANPD method is approximately 222,000, which is 1.2 times lower than that of the LSTM RNN. The training and test times of SANPD are faster than those of RNN by approximately 1.8 and 1.03 times, respectively. They are respectively 2.3 and 1.3 times lower than those of LSANPD. In terms of complexity, SANPD is better than LSANPD and LSTM RNN because SANPD has a self-attention mechanism parallel architecture without a recurrent or convolution module in the PRPD phase while LSANPD has an LSTM structure, which increases the model complexity. In terms of accuracy performance, the proposed SANPD and LSANPD are superior to the existing LSTM RNN model [46]. Although SANPD exhibits slightly lower accuracy performance than LSANPD by 0.2 times, it is significantly better than the LSANPD and LSTM RNN in terms of complexity.

Conclusions
Deep learning is a fast-evolving technique that has many implications in many different applications. However, the performance of the existing deep learning approaches is not further improved for the sequential PRPD data because the models are not capable of simultaneous computation and learning the important relevance of inputs. To deal with the problems, a state-of-the-art self-attention-based neural-network technique is investigated to classify faults in the GIS. In the proposed model, the multi-head self-attention is implemented to learn the interactions of the PD signals by focusing on the important information of the PRPD sequence input and improve computation and performance using parallelism. The experimental results show that the SANPD outperforms the previous LSTM RNN model in terms of accuracy and complexity. SANPD has slightly lower accuracy than LSANPD. However, it reduces the complexity compared with LSANPD because it takes advantage of parallel computation. Therefore, the proposed method can be successfully applied to fault diagnosis in GISs.
For future work, we will install the PD diagnosis systems in power grids. The proposed scheme will be further verified to improve the robustness at various noise conditions (depending on time, location, GIS, and antenna design) and detect various faults using experimental data, including corona discharges, floating discharges, particle discharges, and void discharges at three/four different levels of voltage.

Conflicts of Interest:
The authors declare no conflict of interest.