Recurrent Neural Network for Partial Discharge Diagnosis in Gas-Insulated Switchgear

The analysis of partial discharge (PD) signals has been identified as a standard diagnostic tool for monitoring the condition of different electrical apparatuses. This study proposes an approach to detecting PD patterns in gas-insulated switchgear (GIS) using a long short-term memory (LSTM) recurrent neural network (RNN). The proposed method uses phase-resolved PD (PRPD) signals as input, extracts low-level features, and finally, classifies faults in GIS. In the proposed method, LSTM networks can learn temporal dependencies directly from PRPD signals. Most existing models use support vector machines (SVMs) and mainly focus on improving feature representation and extraction manually to analyze PRPD signals. However, the proposed model captures important temporal features with the help of its low-level feature extraction capability from raw inputs. It outperforms conventional SVMs and achieves 96.74% classification accuracy for PRPDs in GIS.


Introduction
Power systems are rapidly growing in popularity because of increasing power demands, and the reliability of the power grid is critical to stable power system operation.The gas-insulated switchgear (GIS), which is applied to a substation, is the main protection device for electric power facilities.It is a device that protects the electric power system by not only opening and closing normally, but also by quickly shutting off excessive current in case of a fault.In the case of a GIS, if a failure occurs and overcurrent happens, it will cause large-scale effects and requires a long time to recover from the accident.In addition, the power failure time becomes lengthy.Various abnormalities that cause dielectric breakdown of GISs also cause partial discharge before dielectric breakdown.Therefore, detecting partial discharges (PDs) in GISs to avoid failures and ensure high reliability and safety is crucial [1][2][3][4][5][6].Various electrical, mechanical, and chemical methods have been used to detect PDs in GISs [7,8].Some existing electrical methods use ultra-high frequency (UHF) sensors [9][10][11][12][13], while sound measurement methods use acoustic sensors [14,15] and chemical methods use dissolved gas analysis [16,17].In this study, an electrical method that employs a UHF sensor is used for the PD measurement system.Time-resolved PD (TRPD) and phase-resolved PD (PRPD) analyses have been studied in order to examine the characteristics of PDs in GIS [18][19][20][21][22][23][24][25][26][27].The TRPD-based method is a method of analyzing the time-domain features of PD pulses, frequency-domain features of PD pulses, and both time-domain and frequency-domain features of PD pulses [19][20][21].The PRPD-based diagnostic method analyzes phase-amplitude-number (φ-q-n) measurements, where φ is the phase angle, q is the amplitude, and n is the number of discharges [26].It identifies the fault type by analyzing the number of PD pulses, the Energies 2018, 11, 1202 2 of 13 maximum amplitude, or the average amplitude in each phase [19][20][21]25,27,28].From these features, fault types are classified by many methods, including a knowledge-based fuzzy logic analysis [26] and machine learning techniques such as K-means cluster analysis [23,29], artificial neural networks (ANNs) [19,27,30,31], or support vector machines (SVMs) [19][20][21]25,32,33].Among the ANN and fuzzy logic methods, ANNs provide higher accuracy in classifying fault types and fault severities [26].However, existing methods have studied either feature extraction or classification for PD diagnosis.To improve the accuracy of fault diagnosis, it is necessary to consider feature extraction from raw data and classification simultaneously.
In this study, a data-based approach to PRPD diagnosis that combines automatic feature extraction and PRPD classification is proposed.The proposed method is based on a recurrent neural network (RNN) chosen from a variety of deep neural networks that have recently achieved state-of-the-art performances in a range of pattern recognition tasks [34].The RNN model has been actively applied to various fields, such as language modeling [35], speech recognition [36], and machine translation [37].When compared to traditional neural network structures, such as those of a fully-connected neural network and a convolutional neural network, an RNN model uses a recurrence formula during every time step in order to consider sequential information [35].This makes it a candidate to model PRPD patterns.The long short-term memory (LSTM) model, which is one of the most widely used RNN models, avoids the long-term dependency problem caused by the vanishing gradient in gradient-based learning methods by using four gates to adjust the flow of information [38].For LSTM models, we use experimental PRPDs with training data to determine appropriate parameters for the model.The trained network is analyzed using t-distributed stochastic neighbor embedding (t-SNE) for the visualization of high-dimensional datasets [39].The following contributions are made in this study:

•
For the first time, an RNN structure is applied to classify PRPDs in a GIS.The proposed LSTM RNN model can learn features from PRPDs without manual feature extraction.

•
To obtain training and test data for the proposed LSTM RNN model, we conduct PRPD and noise experiments for a GIS.We collect extensive data with respect to various fault types and noise for a GIS.

•
The performance of the proposed LSTM RNN model is verified with conventional ANNs and SVMs.The proposed method yields highly accurate results even for the PRPD data observed in a very short time.Therefore, it considerably reduces the number of PRPDs for PD classification, thus saving the data for fault diagnosis.
The rest of the paper is organized as follows: we discuss PRPDs and noise experiments for a GIS in Section 2, Section 3 presents the proposed LSTM RNN model, a performance evaluation is presented in Section 4, and Section 5 concludes the study while also discussing future research topics.

Experiments in the GIS
In this section, we present our experimental setup and results obtained from PRPD measurements after modeling four types of artificial defects-namely protruding electrodes, floating electrodes, free particles, and void defects.In addition, we conducted artificial noise measurements to obtain data for noise.

PRPDs in the GIS
In the measurement system, artificial cells for the modeling of PDs and an external UHF sensor were installed in the 345 kV GIS chamber.Figure 1 shows the measurement system for conducting PRPD experiments in the GIS.A high voltage was applied to the AC voltage tester to generate the GIS PD signal in one of our experiments.The cavity-backed patch antenna for the external UHF sensor and an amplifier with a gain of 45 dB and a signal bandwidth ranging from 500 MHz to 1.5 GHz were used for PD detection.Figure 2 shows the measured reflection coefficient of the external UHF sensor using an E5071C network analyzer.The measured reflection coefficient was less than −6 dB in the target frequency range from 500 MHz to 1.5 GHz, which allowed the external UHF sensor to operate with a favorable impedance matching.Figure 3 displays the artificial cells that model four types of faults (corona, floating, particle, and void PDs) to simulate possible defects in a GIS.As shown in Figure 3a, the artificial cell for modeling the corona simulated the protrusion of an electrode through a needle with a tip radius of 10 μm and a diameter of 1 mm (Ogura), while the distance between the needle and the ground electrode was 10 mm, and the test voltage was 11 kV.To simulate an unconnected cell, the cell of a floating electrode was fabricated with 10 mm between the high-voltage (HV) and middle electrodes, and 1 mm between the middle and ground electrodes, as illustrated in Figure 3b, where the test voltage was 10 kV.To simulate the free particle discharge, a small sphere with a diameter of 1.0 mm was placed on a concave ground electrode, and the HV electrode was a 45-mm-diameter sphere fixed at 10 mm from the ground electrode, where the test voltage was 10 kV, as represented in Figure 3c.For the artificial void defect, there was a small gap between the epoxy disc and the upper electrode, as shown in Figure 3d, where the test voltage was 8 kV.All artificial cells were filled with 0.2 MPa of sulfur hexafluoride (SF6) gas.Energies 2018, 11, x FOR PEER REVIEW 3 of 13 UHF sensor using an E5071C network analyzer.The measured reflection coefficient was less than −6 dB in the target frequency range from 500 MHz to 1.5 GHz, which allowed the external UHF sensor to operate with a favorable impedance matching.
(a) (b)  Figure 3 displays the artificial cells that model four types of faults (corona, floating, particle, and void PDs) to simulate possible defects in a GIS.As shown in Figure 3a, the artificial cell for modeling the corona simulated the protrusion of an electrode through a needle with a tip radius of 10 μm and a diameter of 1 mm (Ogura), while the distance between the needle and the ground electrode was 10 mm, and the test voltage was 11 kV.To simulate an unconnected cell, the cell of a floating electrode was fabricated with 10 mm between the high-voltage (HV) and middle electrodes, and 1 mm between the middle and ground electrodes, as illustrated in Figure 3b, where the test voltage was 10 kV.To simulate the free particle discharge, a small sphere with a diameter of 1.0 mm was placed on a concave ground electrode, and the HV electrode was a 45-mm-diameter sphere fixed at 10 mm from the ground electrode, where the test voltage was 10 kV, as represented in Figure 3c.For the artificial void defect, there was a small gap between the epoxy disc and the upper electrode, as shown in Figure 3d, where the test voltage was 8 kV.All artificial cells were filled with 0.2 MPa of sulfur hexafluoride (SF6) gas. Figure 3 displays the artificial cells that model four types of faults (corona, floating, particle, and void PDs) to simulate possible defects in a GIS.As shown in Figure 3a, the artificial cell for modeling the corona simulated the protrusion of an electrode through a needle with a tip radius of 10 µm and a diameter of 1 mm (Ogura), while the distance between the needle and the ground electrode was 10 mm, and the test voltage was 11 kV.To simulate an unconnected cell, the cell of a floating electrode was fabricated with 10 mm between the high-voltage (HV) and middle electrodes, and 1 mm between the middle and ground electrodes, as illustrated in Figure 3b, where the test voltage was 10 kV.To simulate the free particle discharge, a small sphere with a diameter of 1.0 mm was placed on a concave ground electrode, and the HV electrode was a 45-mm-diameter sphere fixed at 10 mm from the ground electrode, where the test voltage was 10 kV, as represented in Figure 3c.For the artificial void defect, there was a small gap between the epoxy disc and the upper electrode, as shown in Figure 3d, where the test voltage was 8 kV.All artificial cells were filled with 0.2 MPa of sulfur hexafluoride (SF 6 ) gas. Figure 4 presents the PRPDs, with 60 power cycles in the four artificial cells, recorded through the UHF sensor.In the PRPDs recorded by the UHF sensor, the 360° power cycle was divided into small-phased windows.Some corona PDs were observed in the positive half-cycle band (from 0° to 180°), but they were more densely distributed in the negative half-cycle band (from 180° to 360°), as shown in Figure 4a.The floating PDs were clearly observed in the positive and negative half cycles, as depicted in Figure 4b.The particle PDs were distributed across all bands, as depicted in Figure 4c. Figure 4d shows that the void PDs were observed around 90° and 270°.Figure 4 presents the PRPDs, with 60 power cycles in the four artificial cells, recorded through the UHF sensor.In the PRPDs recorded by the UHF sensor, the 360 • power cycle was divided into small-phased windows.Some corona PDs were observed in the positive half-cycle band (from 0 • to 180 • ), but they were more densely distributed in the negative half-cycle band (from 180 • to 360 • ), as shown in Figure 4a.The floating PDs were clearly observed in the positive and negative half cycles, as depicted in Figure 4b.The particle PDs were distributed across all bands, as depicted in Figure 4c. Figure 4d shows that the void PDs were observed around 90 • and 270 • .Figure 4 presents the PRPDs, with 60 power cycles in the four artificial cells, recorded through the UHF sensor.In the PRPDs recorded by the UHF sensor, the 360° power cycle was divided into small-phased windows.Some corona PDs were observed in the positive half-cycle band (from 0° to 180°), but they were more densely distributed in the negative half-cycle band (from 180° to 360°), as shown in Figure 4a.The floating PDs were clearly observed in the positive and negative half cycles, as depicted in Figure 4b.The particle PDs were distributed across all bands, as depicted in Figure 4c. Figure 4d shows that the void PDs were observed around 90° and 270°.Energies 2018, 11, 1202 5 of 13

Noise Measurement
Noise measurements were performed by generating artificial noises that might occur in power grids.In the noise measurements, an air purifier (Samsung AC-121B) was used as a noise source and noise signals were obtained using the external UHF sensor.One example of noise signals is shown in Figure 5. Here, noise signals exist in all ranges of phases and power cycles, and the amplitudes of noise signals are smaller than those of PRPDs in the GIS.

Noise Measurement
Noise measurements were performed by generating artificial noises that might occur in power grids.In the noise measurements, an air purifier (Samsung AC-121B) was used as a noise source and noise signals were obtained using the external UHF sensor.One example of noise signals is shown in Figure 5. Here, noise signals exist in all ranges of phases and power cycles, and the amplitudes of noise signals are smaller than those of PRPDs in the GIS.

Neural Network Model for Diagnosing PRPDs
In this section, we propose an LSTM RNN model to detect PRPDs in the GIS.The first task was to generate an appropriate input vector from the PRPD measurements.The PRPD signal at the m-th power cycle was defined as: was the number of data points in a power cycle.Figure 6 shows the architecture of the proposed RNN model, which was composed of LSTM modules and an output layer for classification.The standard RNN structure causes the gradient descent method in the network to struggle in minimizing the cost function because of a vanishing gradient, which means long-term dependencies become exponentially smaller in the sequence and therefore have less impact on the gradient when compared to short-term dependencies.Among many LSTM-based structures, the proposed RNN is a many-to-one model and conducts representation learning of deep learning.
The structure of the LSTM module is shown in Figure 7.The inputs to the m-th LSTM module in layer l consisted of , ,

Neural Network Model for Diagnosing PRPDs
In this section, we propose an LSTM RNN model to detect PRPDs in the GIS.The first task was to generate an appropriate input vector from the PRPD measurements.The PRPD signal at the m-th power cycle was defined as: where m = 1, . . ., M and N = 128 was the number of data points in a power cycle.
Figure 6 shows the architecture of the proposed RNN model, which was composed of LSTM modules and an output layer for classification.The standard RNN structure causes the gradient descent method in the network to struggle in minimizing the cost function because of a vanishing gradient, which means long-term dependencies become exponentially smaller in the sequence and therefore have less impact on the gradient when compared to short-term dependencies.Among many LSTM-based structures, the proposed RNN is a many-to-one model and conducts representation learning of deep learning.
The structure of the LSTM module is shown in Figure 7.The inputs to the m-th LSTM module in layer l consisted of h l−1 m , h l m−1 , and c l m−1 , where h l−1 m was the output of the m-th LSTM module in the previous layer l − 1, and h l m−1 and c l m−1 were the outputs of the (m − 1)-th LSTM module in the current layer l.The equations below describe the internal structure of the cell at the m-th LSTM module in layer l: where W l { f ,i,g,o} are N * N weight matrices, b l { f ,i,g,o} are N * 1 bias vectors, denotes an element-wise multiplication, sigm(•) is a sigmoid activation function, tanh(•) is a hyperbolic tangent activation function, and h l−1 m , h l m−1 denotes a concatenation.For the first layer, the inputs of LSTM blocks were the PRPD vectors as x m = h 0 m , where m = 1, . . ., M. The LSTM model avoids long-term dependency obstacles by using four hidden layers f l m , i l m , g l m , o l m as four gates to adjust the flow of information [38], where each gate controlled the information flow in cell state c l m .
- In the proposed model, the output y for K classes is related to the last LSTM layer z as follows: where z is a K by 1 bias vector, and σ(z) is a softmax function.In Equation ( 8), the j-th element of y represents the likelihood that the fault is recognized as the j-th category in K classes and is defined as: The parameters of the proposed LSTM RNN model were learned through the training data set G to minimize the following cost function: where |G| is the number of elements in a set and Loss(g) is the loss value of the g-th training data.In Equation (10), Loss(g) measures how accurately the proposed LSTM RNN model predicts that the label c(g) = [c 1 , • • • , c K ] T corresponds to the training data, where c j = 1 and c k = 0 for k = j if the target classification is a fault type j.Among the many choices of loss functions, we used cross-entropy, which is expressed by: where y (g) = y j from Equation (9) if the target classification of the g-th training data is a type j fault.
To minimize the loss function, many variants of the gradient descent method have been examined in previous studies.These include AdaGrad, AdaDelta, and Adam optimizers [40][41][42].These optimizers adaptively change the learning rate to minimize the loss function in a precise manner.In this study, the Adam optimization algorithm was applied with a learning rate of 0.001 to train our proposed LSTM RNN model [42].Adam was chosen because it requires that only first-order gradients be calculated, thus reducing computational complexity.
manner.In this study, the Adam optimization algorithm was applied with a learning rate of 0.001 to train our proposed LSTM RNN model [42].Adam was chosen because it requires that only first-order gradients be calculated, thus reducing computational complexity.manner.In this study, the Adam optimization algorithm was applied with a learning rate of 0.001 to train our proposed LSTM RNN model [42].Adam was chosen because it requires that only first-order gradients be calculated, thus reducing computational complexity.

Performance Evaluation
In this section, we discuss the performance evaluation results of the proposed RNN model using PRPDs in the GIS.We conducted artificial noise measurements and PD experiments for four types of faults-namely, corona, floating, particle, and void PDs.The numbers of experiments for each fault type are given in Table 1, where the PRPD signals or noise signals with M = 3600 power cycles were obtained in one experiment.We divided the dataset into three parts: training, validation, and test sets.For these three sets, we used 80%, 10%, and 10% of the data, respectively.During the training process, the optimization step was carried out in small batches of 512 samples.To prevent overfitting, we applied an early stopping technique so that the training process stopped itself when the validation accuracy was stable after 10 consecutive epochs.The model was implemented using TensorFlow [43] and Keras [44].
Without sufficient training samples, the deep learning model will easily run into an overfitting problem [45].In deep learning models, data augmentation is frequently used to increase the number of training samples in order to enhance the generalization performance [45].Here, slicing the experimental data with overlap was used to achieve high classification precision in fault diagnosis.This process is shown in Figure 8.For example, a single PRPD experiment with 3600 cycles can provide the proposed RNN with training samples, each with a length of M when the shift size is 1.

Performance Evaluation
In this section, we discuss the performance evaluation results of the proposed RNN model using PRPDs in the GIS.We conducted artificial noise measurements and PD experiments for four types of faults-namely, corona, floating, particle, and void PDs.The numbers of experiments for each fault type are given in Table 1, where the PRPD signals or noise signals with 3600 = M power cycles were obtained in one experiment.We divided the dataset into three parts: training, validation, and test sets.For these three sets, we used 80, 10, and 10% of the data, respectively.During the training process, the optimization step was carried out in small batches of 512 samples.To prevent overfitting, we applied an early stopping technique so that the training process stopped itself when the validation accuracy was stable after 10 consecutive epochs.The model was implemented using TensorFlow [43] and Keras [44].
Without sufficient training samples, the deep learning model will easily run into an overfitting problem [45].In deep learning models, data augmentation is frequently used to increase the number of training samples in order to enhance the generalization performance [45].Here, slicing the experimental data with overlap was used to achieve high classification precision in fault diagnosis.This process is shown in Figure 8.For example, a single PRPD experiment with 3600 cycles can provide the proposed RNN with training samples, each with a length of M when the shift size is 1.Multiple experiments with different parameters were conducted using the validation data to tune our model.Figure 9 shows the training and validation accuracies of the proposed RNN model according to the number of power cycles M for 1 L = or 2 L = layer models.The accuracy improved as the number of power cycles M increased.This was because more information about PRPDs ( 1 , ,  M x x ) could be obtained as M increased.In addition, when the model expanded to a larger scale, the accuracy increased because more parameters were introduced, thus better fitting the model to the data.We set the power cycles to 60 M = and the number of layers to 2 = L .Multiple experiments with different parameters were conducted using the validation data to tune our model.Figure 9 shows the training and validation accuracies of the proposed RNN model according to the number of power cycles M for L = 1 or L = 2 layer models.The accuracy improved as the number of power cycles M increased.This was because more information about PRPDs (x 1 , • • • , x M ) could be obtained as M increased.In addition, when the model expanded to a larger scale, the accuracy increased because more parameters were introduced, thus better fitting the model to the data.We set the power cycles to M = 60 and the number of layers to L = 2.For comparison, we used an ANN model and linear and nonlinear SVMs with a radial basis function (RBF) [46,47] as baseline models.The ANN model consisted of an input layer, 2 hidden layers with 256 hidden nodes at each layer, and an output layer, where the input data was     For comparison, we used an ANN model and linear and nonlinear SVMs with a radial basis function (RBF) [46,47] as baseline models.The ANN model consisted of an input layer, 2 hidden layers with 256 hidden nodes at each layer, and an output layer, where the input data was  For comparison, we used an ANN model and linear and nonlinear SVMs with a radial basis function (RBF) [46,47] as baseline models.The ANN model consisted of an input layer, 2 hidden layers with 256 hidden nodes at each layer, and an output layer, where the input data was M = 60 PRPDs and the cross-entropy cost function and Adam optimization function were used.In SVMs, the feature vector was obtained by the mean of the amplitudes and occurrence numbers in each phase from M = 60 PRPDs [32] Classification performance comparisons of the proposed LSTM RNN model, SVMs, and ANN are presented in Table 2. From this table, we can see that the proposed LSTM RNN model achieved the highest overall classification accuracy performance (96.74%), when compared to the ANN and the linear and nonlinear SVMs.Note that the ANN was superior to the SVMs and the nonlinear SVM with RBF was somewhat superior to the linear SVM.This was because the proposed LSTM RNN method automatically obtained the sequential characteristics of PRPDs from the raw input, whereas the ANN used the raw input without phase information of PRPDs and the SVMs used the manually-created feature vector that combined characteristics of PRPDs.For corona faults, the performance of the proposed method was the highest at 97.04%, approximately 1.5% higher than the ANN and the SVMs.In floating fault classifications, the performance of the proposed method achieved the best result, at 79.54%.In the case of particle faults, the performance of the proposed technique was 93.18%, which was 7.84%, 27.71%, and 15.56% better than that of the ANN, the linear SVM, and the nonlinear SVM with RBF, respectively.In the case of void faults, the proposed method achieved a nearly perfect 99.94% accuracy, better than the ANN and the SVMs.For noise classification, the proposed method outperformed all other methods and achieved a 98.26% accuracy rate.Table 3 shows training and testing timing comparisons for the ANN, the SVMs, and the proposed LSTM RNN methods, where the timing was normalized to a hypothetical 1 GHz single-core CPU to make the measurement meaningful.In our experiments, the models were trained and tested on an NVIDIA Titan X GPU with 3584 cores, each running at 1.4 GHz.It can be seen that the training and testing times of the proposed LSTM RNN model were slower than those of the ANN and the SVMs.This was because the design of the RNN required the output of the previous time step for the current time step output calculation.The test time of the proposed LSTM RNN model took longer than that of other methods, but the test time per sample of the proposed method was only 1 (s*GHz).To better understand what the model learned, we analyzed the internal representations of the trained network at the end of two layers.Following the training procedure, the hidden state vectors of the last LSTM modules in the two layers were used to visualize the trained network.Figure 11 shows t-SNE representation of h 1 M and h 2 M using 5000 inputs from the training set, where t-SNE projected 128 dimensional vectors to two-dimensional spaces while retaining their pairwise similarity [39].Therefore, hidden state vectors h 1 M and h 2 M , which are similar according to the network, occur close together in Figure 11.Here, the opposite does not have to be true because large distances in Figure 11 do not necessarily imply that the hidden state vectors h 1 M and h 2 M are dissimilar.In the figure, we can see that the hidden state h 2 M of layer 2 was much more dispersed when compared to the hidden state h 1 M of layer 1.This explains the improved accuracy based on the number of layers as shown in Figure 9.As shown in Figure 11b, the hidden states h 2 M for some data of corona, floating, particle, and void faults in a GIS were similar with those for some noise data.This was because PRPDs existed with small amplitudes in the whole phase for the power cycles M = 60, as shown in Figure 4a.

Conclusions
Deep learning is a state-of-the-art technique used in many different applications.Using this technique, we proposed a fault diagnosis method using an LSTM RNN structure, which employed a series of PRPDs in a GIS.Instead of utilizing handcrafted features to classify PRPDs in the GIS, the proposed model efficiently learns low-level features and temporal dependencies of PRPDs using training data.To adjust parameters in the proposed model, we conducted extensive PRPD experiments using artificial defects and noise in a GIS.To lower the risk of overfitting, the data sets were obtained using data augmentation for PRPDs and were divided into three sets.These three sets were used for the purposes of training, cross-validation, and performance evaluation.The proposed model achieved a higher accuracy than the conventional ANN and SVM methods for classifying PRPDs in GIS.
The proposed method will be useful in other PRPD detections, such as power transformers and wall bushings.We hope this represents a major advancement for grid asset management and will contribute to stable power grid operation in the future.

Conclusions
Deep learning is a state-of-the-art technique used in many different applications.Using this technique, we proposed a fault diagnosis method using an LSTM RNN structure, which employed a series of PRPDs in a GIS.Instead of utilizing handcrafted features to classify PRPDs in the GIS, the proposed model efficiently learns low-level features and temporal dependencies of PRPDs using training data.To adjust parameters in the proposed model, we conducted extensive PRPD experiments using artificial defects and noise in a GIS.To lower the risk of overfitting, the data sets were obtained using data augmentation for PRPDs and were divided into three sets.These three sets were used for the purposes of training, cross-validation, and performance evaluation.The proposed model achieved a higher accuracy than the conventional ANN and SVM methods for classifying PRPDs in GIS.
The proposed method will be useful in other PRPD detections, such as power transformers and wall bushings.We hope this represents a major advancement for grid asset management and will contribute to stable power grid operation in the future.

Energies 2018 ,Figure 1 .
Figure 1.Measurement system in the gas-insulated switchgear (GIS): (a) block of the measurement system, and (b) high-voltage test site.PD: partial discharge; UHF: ultra-high frequency.

Figure 2 .
Figure 2. Measured reflection coefficient of the external UHF sensor.

Figure 1 .
Figure 1.Measurement system in the gas-insulated switchgear (GIS): (a) block of the measurement system, and (b) high-voltage test site.PD: partial discharge; UHF: ultra-high frequency.

Figure 1 .
Figure 1.Measurement system in the gas-insulated switchgear (GIS): (a) block of the measurement system, and (b) high-voltage test site.PD: partial discharge; UHF: ultra-high frequency.

Figure 2 .
Figure 2. Measured reflection coefficient of the external UHF sensor.

Figure 2 .
Figure 2. Measured reflection coefficient of the external UHF sensor.

Figure 5 .
Figure 5. Noise measurements from the air purifier.
of the m-th LSTM module in the previous layer l − 1, and of the (m−1)-th LSTM module in the current layer l.The equations below describe the internal structure of the cell at the m-th LSTM module in layer l:

Figure 5 .
Figure 5. Noise measurements from the air purifier.

Figure 7 .
Figure 7. Structure of the m-th LSTM block in layer l.

Figure 7 .
Figure 7. Structure of the m-th LSTM block in layer l.

Figure 7 .
Figure 7. Structure of the m-th LSTM block in layer l.

Figure 8 .
Figure 8. Structure of the m-th LSTM block in layer l.

Figure 8 .
Figure 8. Structure of the m-th LSTM block in layer l.

Figure 9 .
Figure 9. Training and validation accuracies of the proposed RNN model based on the number of power cycles M for 1 L = or 2 = L layer models.

Figure 10
Figure 10 illustrates the convergence of the model over epochs with the training and validation set.As shown in this figure, the accuracy with training data tended to improve with the epoch, whereas the accuracy with cross-validation data diminished up to a certain epoch and then improved again.After achieving the maximum accuracy, the model firstly paused, recorded the parameters, and then continued the training process for an additional 10 epochs.This was part of the early stopping method for identifying another peak.After determining that the accuracy with cross-validation data could not be further improved, the model stopped the training process to prevent overfitting the training dataset.As the figure shows, the training process finished after 55 epochs.The maximum accuracy of 96.62% achieved at epoch 45 with the cross-validation data is presented in Figure 10 as an "⨉" mark.
PRPDs and the cross-entropy cost function and Adam optimization function were used.In SVMs, the feature vector was obtained by the mean of the amplitudes and occurrence numbers in each phase from 60 M = PRPDs [32], and therefore, was a 2N by 1 vector.The normalized feature vectors were used to optimize and train SVMs to classify faults in the GIS.The parameter 0.01 C = for the linear SVM and the parameters 0.01 C = and 0.1 γ = could be learned using training data,

Figure 9 .
Figure 9. Training and validation accuracies of proposed RNN model based on the number of power cycles M for L = 1 or L = 2 layer models.

Figure 10 13 Figure 9 .
Figure 10 illustrates the convergence of the model over epochs with the training and validation set.As shown in this figure, the accuracy with training data tended to improve with the epoch, whereas the accuracy with cross-validation data diminished up to a certain epoch and then improved again.After achieving the maximum accuracy, the model firstly paused, recorded the parameters, and then continued the training process for an additional 10 epochs.This was part of the early stopping method for identifying another peak.After determining that the accuracy with cross-validation data could not be further improved, the model stopped the training process to prevent overfitting the training dataset.As the figure shows, the training process finished after 55 epochs.The maximum accuracy of 96.62% achieved at epoch 45 with the cross-validation data is presented in Figure 10 as an "×" mark.

Figure 10
Figure 10 illustrates the convergence of the model over epochs with the training and validation set.As shown in this figure, the accuracy with training data tended to improve with the epoch, whereas the accuracy with cross-validation data diminished up to a certain epoch and then improved again.After achieving the maximum accuracy, the model firstly paused, recorded the parameters, and then continued the training process for an additional 10 epochs.This was part of the early stopping method for identifying another peak.After determining that the accuracy with cross-validation data could not be further improved, the model stopped the training process to prevent overfitting the training dataset.As the figure shows, the training process finished after 55 epochs.The maximum accuracy of 96.62% achieved at epoch 45 with the cross-validation data is presented in Figure 10 as an "⨉" mark.
PRPDs and the cross-entropy cost function and Adam optimization function were used.In SVMs, the feature vector was obtained by the mean of the amplitudes and occurrence numbers in each phase from 60 M = PRPDs [32], and therefore, was a 2N by 1 vector.The normalized feature vectors were used to optimize and train SVMs to classify faults in the GIS.The parameter 0.01 C = for the linear SVM and the parameters 0.01 C = and 0.1 γ = could be learned using training data,

Energies 2018 , 13 when compared to the hidden state 1 M h of layer 1 .Figure 11 .
Figure 11.t-distributed stochastic neighbor embedding (t-SNE) representation of 5000 training samples at: (a) the hidden state 1 M h of layer 1, and (b) the hidden state 2 M h of layer 2.

Figure 11 .
Figure 11.t-distributed stochastic neighbor embedding (t-SNE) representation of 5000 training samples at: (a) the hidden state h 1 M of layer 1, and (b) the hidden state h 2 M of layer 2.
l m is the forget gate, which can decide what information is unnecessary from the cell state.i l m is the input gate, which decides which values in the cell state should be updated.g l m is the external output gate, which is a vector of new candidate values that could be added to the state.f l m , i l m , g l m gates are used to modify the cell state between time steps as shown in Equation (2).o l m is the output gate, which acts as a filter to decide what parts of the current cell state should go the output, h l m .The cell state is then put through tanh(•) and filtered through o l m to become the hidden state h l m of the current time step as shown in Equation (3).

Table 1 .
Experimental data set.

Table 1 .
Experimental data set.

Table 3 .
Training and testing time comparisons.