Partial Discharge Pattern Recognition of Gas-Insulated Switchgear via a Light-Scale Convolutional Neural Network

: Partial discharge (PD) is one of the major form expressions of gas-insulated switchgear (GIS) insulation defects. Because PD will accelerate equipment aging, online monitoring and fault diagnosis plays a signiﬁcant role in ensuring safe and reliable operation of the power system. Owing to feature engineering or vanishing gradients, however, existing pattern recognition methods for GIS PD are complex and ine ﬃ cient. To improve recognition accuracy, a novel GIS PD pattern recognition method based on a light-scale convolutional neural network (LCNN) without artiﬁcial feature engineering is proposed. Firstly, GIS PD data are obtained through experiments and ﬁnite-di ﬀ erence time-domain simulations. Secondly, data enhancement is reinforced by a conditional variation auto-encoder. Thirdly, the LCNN structure is applied for GIS PD pattern recognition while the deconvolution neural network is used for model visualization. The recognition accuracy of the LCNN was 98.13%. Compared with traditional machine learning and other deep convolutional neural networks, the proposed method can e ﬀ ectively improve recognition accuracy and shorten calculation time, thus making it much more suitable for the ubiquitous-power Internet of Things and big data.


Introduction
Gas-insulated switchgear (GIS) is widely used in power systems because of its small footprint, high reliability, low environmental impact, and maintenance-free features. Potential risks exist, however, in design, manufacturing, transportation, installation, and operation and maintenance, which may further give rise to latent GIS failures [1][2][3]. Because the GIS is one of the main control and protection components of the power system, once it fails, it will significantly shock the power grid, not only causing large-scale power outages and negatively affecting power supply reliability but also resulting in massive economic losses. Therefore, effective detection of GIS latent defects and taking necessary measures before failures are of great importance in ensuring safe and reliable operation of the power grid while reducing maintenance time and cost.
The advancement of the construction of the ubiquitous-power Internet of Things (IoT) provides new opportunities for GIS fault diagnosis and challenges [4,5]. Exploring not only real-time rapid processing of GIS fault signals but also the ability to accurately identify fault diagnosis methods for GIS faults has become an urgent problem to be solved. According to the statistics, insulation faults are the main cause of accidents in GIS [6], and most insulation faults are manifested in the form of partial discharges (PDs), which will further accelerate equipment aging. Currently, the detection of insulation

Data Enhancement with Conditional Variation Auto-Encoder
Based on the raw sample, data enhancement is aimed at learning the sample features and reconstructing the samples in the same feature distributions through constructing a deep network. It can greatly increase the number of samples [35][36][37][38], thus improving classification performance. The technique, which originates from game theory, entails making the generator and discriminator in the network gradually achieve dynamic equilibrium (Nash equilibrium), so that the model can better learn the approximate feature distribution from the input samples. Since the acquired PD signals are stored in a unified standard manner, the image scaling can only introduce noise interference and cannot achieve data enhancement. Considering the standardization of TRPD waveform images, data enhancement methods through image rotation and transformation can result in enhanced data becoming difficult to classify as special samples. In PD pattern recognition, because of the relatively small number of total data sets, in this paper, a conditional variation auto encoder (CVAE) is adopted as the data enhancement model to increase the number of training data and improve the generalization ability of the model.
The basic idea of the CVAE algorithm is that each entry point, x ∈ µ i , is replaced by a respective hidden variable, z. Therefore, the final output,x, can be generated by a certain probability distribution, P θ (x|z) , which is assumed to be Gaussian. The decoder function, f θ (z), can generate the final parameters of the generated distribution and itself is constrained by a set of uniform parameters, θ. At the same time, the encoder function, g θ (x), can generate a parameter for each probability distribution, q φ (z|x) , which is also restrained by the parameter, θ. In training and testing, a label factor is added to the network learning image distribution. The condition of the label factor is set as y, and, in this sense, the specified image can be generated according to the label value.
The variational derivation indicates that an approximate distribution, p(z), is needed in place of the distribution, q φ (z|x) . The similarity of the two distributions is usually measured by a function called Kullback-Leibler (KL) divergence. Thereby, the objective function of the confidence lower bound for a single entry point is as follows: L(x, y; φ, θ) = −D KL (q φ (zx, y) p θ (z|y)) + E q φ (z|x,y) [log p θ (x|z, y)]. (1) In Equation (1), the first divergence function, D KL (), can be regarded as a regularized item, while the second expression, E q (), can be viewed as the desired auto-coded reconstruction error. It may greatly simplify the calculation by approximating E q with the mean of the samples, S, from q φ (z|x) : where z (S) is one of the S samples. Because p(z) is a Gaussian distribution, re-parameter quantization can be used to simplify the operation. For the sample, z (S) , a small variable, ε, can be obtained from the distribution, N(0, 1), instead of the normal distribution, N(µ, δ). It can be calculated through the expectation, µ, and standard deviation, δ: Up to now, all parameters can be optimized by adopting the stochastic gradient descent method. In this paper, the conditional variation auto-encoder is used to randomly select 20% of the data from the training set to generate new samples for model training. The variation auto-encoder can improve the generalization ability of the model and the recognition and classification performance of PD patterns.

Convolutional Neural Network
Developed in recent years, CNN has been widely considered as one of efficient pattern recognition methods. Generally, the basic structure of CNN consists of two layers, feature extraction layer and feature mapping layer [39]. In feature extraction layer, each neuron input is connected to the local receptive field of the previous one. Once the local features are extracted, their positional relationship with other features is also determined. In feature mapping layer, each computing layer of the network is composed of multiple feature maps. Each feature map is a plane where the weights of all neurons are equal. The feature mapping structure adopts the ReLU function, a small influence function, as the activation function of the convolution network, to ensure the shifting invariance of the feature map. In addition, since the neurons on one map plane share weights, the number of network free parameters is reduced. Each convolutional layer in the CNN is closely followed by a computational layer for local averaging and quadratic extraction. This unique structure of two-phase feature extraction effectively reduces feature dimensions.
The CNN is mostly composed of multiple convolution layers and pooling layers, which can be subdivided as feature extraction layers, fully-connected layers, and Softmax layers; the feature map first convolutedly calculates with multiple convolution kernels and then is connected to the next layer with bias calculation, activation function, and pooling operation. In the convolution layer, each convolution kernel is convolved with the feature maps of the previous one, and then an output feature map can be obtained through an activation function, which can be expressed as: where H i refers to the feature map of the ith layer of the CNN, σ is the activation function, * is the convolution operator, W i is the weight matrix of the ith convolution kernel, and b i is the offset vector of the ith layer. Currently, the main activation functions are tanh, sigmoid, and ReLU. For the pooling layer, the calculation process can be expressed as: Pooling represents the pooling operation, which involves averaging, maximization, and random pooling.
In the fully connected layer, the feature maps of the previous layer are processed with the weighted sum method. The output feature map can be attained by the activation function, which can be expressed as: The training goal of the CNN is to minimize the loss function. When used for classification problems, the loss function uses cross entropy: where x i is the ith input, y i is the true value of the ith entry input, m is the number of training samples, and h θ (x i ) is the predicted value of the ith entry input. When used for regression problems, the loss function uses the mean square error function: In the training process, the gradient descent method is used for model optimization while the back-propagated residual layer by layer updates the parameters (W and b) of each layer in the CNN. Some variations of the gradient descent method include Momentum, Ada Grad, RMS Prop, the stochastic gradient descent algorithm, and the Adam algorithm [40].

Convolutional Neural Network
Deconvolution, proposed by Zeilor et al. [41], is the process of reconstructing the unknown input by measuring the output and the input. In neural networks, the deconvolution process does not involve learning and is merely used to visualize a trained convolutional network model. The visual filter characteristics obtained by the deconvolution network are similar to the Tokens proposed by Marr in Vision.
Assume that the ith layer input image is y i , which is composed of K 0 channels, y i 1 , y i 2 , . . . , y i K 0 , and c is the channel of the image. The deconvolution operation is expressed as a linear sum of the convolution of K 1 feature maps, Z i k , and filter, f K,c : As for y i : If y i c is an image with N r × N c pixels and the filter size is H × H, then the size of the derived feature map, Z i k , is (N r + H − 1) × (N c + H − 1). The loss function can be expressed as: where the first term is the mean square error of the reconstructed image and the input image, and the second is the regularization term in the form of the p norm. In contrast to the CNN, the structure of the deconvolution neural network is mainly composed of a reverse pooling layer and a deconvolution layer. For a complex deep convolutional neural network, through the transformation of several convolution kernels in each layer, it is impossible to know the information automatically extracted by each convolution kernel. Through deconvolution reduction, however, the information can be clearly visualized. The feature maps obtained by each layer are inputted, and then deconvolution is conducted to obtain deconvolution results, which can be used to verify the feature maps extracted by each layer.

GIS PD Pattern Recognition Using Light-Scale Convolutional Neural Network
In this paper, the LCNN structure for PD pattern recognition is constructed, which is further clearly shown in Figure 1. This LCNN consists of seven layers in total, namely, two convolutional layers, two pooling layers, two fully connected layers, and one Softmax layer. In the input layer, data are input via a TRPD single-channel banalization image, and converted from 600 × 438 to 64 × 64 by the image downsampling technique. The first convolutional layer consists of 64 3 × 3 convolution kernels, and the second consists of 16 3 × 3 convolution kernels. In the pooling layer, all operations are 3 × 3 maximum pooling, with two strides. Both fully connected layers contain 128 neurons each, and the second fully connected layer is optimized by dropout to avoid overfitting of the model. In the output layer, Softmax is used as a classifier, and one-hot encoding is used to identify four PDs pattern maps. All activation functions in the model are ReLU functions.
In this paper, the LCNN structure for PD pattern recognition is constructed, which is further clearly shown in Figure 1. This LCNN consists of seven layers in total, namely, two convolutional layers, two pooling layers, two fully connected layers, and one Softmax layer. In the input layer, data are input via a TRPD single-channel banalization image, and converted from 600 × 438 to 64 × 64 by the image downsampling technique. The first convolutional layer consists of 64 3 × 3 convolution kernels, and the second consists of 16 3 × 3 convolution kernels. In the pooling layer, all operations are 3 × 3 maximum pooling, with two strides. Both fully connected layers contain 128 neurons each, The LCNN-based PD pattern recognition process is presented in Figure 2. Specific steps are as follows: The LCNN-based PD pattern recognition process is presented in Figure 2. Specific steps are as follows: (1) Data preprocessing. The PD time-domain map is shrunk from 600 × 438 to 64 × 64 by image downsampling.
(2) Data enhancement. The CVAE is used to randomly select 20% of the data as training data for image generation to improve the generalization ability of the model.
(3) Model training. The model training uses a back-propagated algorithm and a stochastic gradient descent algorithm. In the pooling layer, the data are normalized. In the fully connected layer, the dropout method is adopted.
(4) Model testing. Model testing is conducted with 20% of the remaining image to verify the generalization ability, fault recognition accuracy, and testing time.
(5) Model visualization. TensorBoard and a deconvolution neural network are used to achieve full visualization of the whole training process and feature extraction process, which solves the "black-box" problem in CNNs. (1) Data preprocessing. The PD time-domain map is shrunk from 600 × 438 to 64 × 64 by image downsampling.
(2) Data enhancement. The CVAE is used to randomly select 20% of the data as training data for image generation to improve the generalization ability of the model.
(3) Model training. The model training uses a back-propagated algorithm and a stochastic gradient descent algorithm. In the pooling layer, the data are normalized. In the fully connected layer, the dropout method is adopted.
(4) Model testing. Model testing is conducted with 20% of the remaining image to verify the generalization ability, fault recognition accuracy, and testing time.

Data Acquisition
Four typical GIS PD defects were selected for pattern recognition and fault classification: Free metal particle defects, metal tip defects, floating electrode defects, and insulation void defects. A schematic of the experiment is shown in Figure 3.

Data Acquisition
Four typical GIS PD defects were selected for pattern recognition and fault classification: Free metal particle defects, metal tip defects, floating electrode defects, and insulation void defects. A schematic of the experiment is shown in Figure 3.

Data Acquisition
Four typical GIS PD defects were selected for pattern recognition and fault classification: Free metal particle defects, metal tip defects, floating electrode defects, and insulation void defects. A schematic of the experiment is shown in Figure 3. In the experiment, the rated voltage of the transformer was 250 kV and the rated capacity was 50 kVA. The UHF sensor was composed of an amplifier, a high-pass filter, a detector, and a shielding case and its detection frequency band was 300 to 2000 MHz. The working bandwidth of the amplifier was 300 to 1500 MHz, and the amplifier gain was 40 dB. The sensor had a mean effective height of He = 10.2 mm over the frequency range of 500 to 1500 MHz, as shown in Figure 4c. The oscilloscope In the experiment, the rated voltage of the transformer was 250 kV and the rated capacity was 50 kVA. The UHF sensor was composed of an amplifier, a high-pass filter, a detector, and a shielding case and its detection frequency band was 300 to 2000 MHz. The working bandwidth of the amplifier was 300 to 1500 MHz, and the amplifier gain was 40 dB. The sensor had a mean effective height of He = 10.2 mm over the frequency range of 500 to 1500 MHz, as shown in Figure 4c. The oscilloscope single-channel sampling rate was 10 GS/s, and the analogue bandwidth was 2 GHz. The experimental device and defect model installation location, defect models, TRPD PD waveforms, and frequency spectrum maps are respectively shown in Figures 4-7. single-channel sampling rate was 10 GS/s, and the analogue bandwidth was 2 GHz. The experimental d     To more thoroughly capture GIS PD characteristics, following [42][43][44][45][46], the four typical kinds of GIS PD signals were simulated with the finite-difference time-domain (FDTD) method. The FDTD method, introduced by Yee in 1966, is a computational method to model electromagnetic wave propagation and interactions with the properties of materials through FDTD software [47,48]. The simulation model is shown in Figure 8. The center conductor and the tank were, respectively, 120 and 400 mm in diameter, and the tank wall was 10 mm in thickness and 2.2 m in length. To more thoroughly capture GIS PD characteristics, following [42][43][44][45][46], the four typical kinds of GIS PD signals were simulated with the finite-difference time-domain (FDTD) method. The FDTD method, introduced by Yee in 1966, is a computational method to model electromagnetic wave propagation and interactions with the properties of materials through FDTD software [47,48]. The simulation model is shown in Figure 8. The center conductor and the tank were, respectively, 120 and 400 mm in diameter, and the tank wall was 10 mm in thickness and 2.2 m in length.   excitation source to simulate an insulation defect. Referring to the discharge signals and experimental conclusions measured by other scholars in actual discharge tests, the signal data and Gaussian pulses under different defect conditions were used as the excitation source of the fault simulation part of this article. The Gaussian pulse with a broadband of −30 dB attenuation at −3 GHz was added to the needle as a PD current pulse [45,46]. To guarantee consistency with the actual situation, the needle and current were placed perpendicular to the conductor surface in the y direction, and a seven-layer perfect matching layer (PML) of the two terminals was applied to match the impedance of the adjacent medium, without the reflection and refraction of the GIS end taken into account. The cavity boundaries in the other two directions were set in the same way as the y axis. Furthermore, with regard to loss at the conductor walls, an ideal conductor was used for the high-voltage conductors and cavities. The relative dielectric constant of SF 6 filled in the GIS cavity is 1.00205, and the density was set as 23.7273 kg/m 3 under a pressure of 0.4 MPa (absolute). The relative magnetic permeability and electrical conductivity were 1 and 1.1015 × 10 −5 S/m, respectively. The highest frequency of the calculation was 3 GHz and the cell size was set as 10 mm × 10 mm × 10 mm. The simulation time was 250 ns and the time step was 9.661674 × 10 −6 us. The simulation conditions for the other three defects were the same. The four types of defect simulation models are shown in Figure 9. Let us consider metal tip defects as an example. A metal needle of 30 mm in length was instilled on a high-voltage conductor as an excitation source to simulate an insulation defect. Referring to the discharge signals and experimental conclusions measured by other scholars in actual discharge tests, the signal data and Gaussian pulses under different defect conditions were used as the excitation source of the fault simulation part of this article. The Gaussian pulse with a broadband of −30 dB attenuation at −3 GHz was added to the needle as a PD current pulse [45,46]. To guarantee consistency with the actual situation, the needle and current were placed perpendicular to the conductor surface in the y direction, and a seven-layer perfect matching layer (PML) of the two terminals was applied to match the impedance of the adjacent medium, without the reflection and refraction of the GIS end taken into account. The cavity boundaries in the other two directions were set in the same way as the y axis. Furthermore, with regard to loss at the conductor walls, an ideal conductor was used for the highvoltage conductors and cavities. The relative dielectric constant of SF6 filled in the GIS cavity is 1.00205, and the density was set as 23.7273 kg/m 3 under a pressure of 0.4 MPa (absolute). The relative magnetic permeability and electrical conductivity were 1 and 1.1015 × 10 −5 S/m, respectively. The highest frequency of the calculation was 3 GHz and the cell size was set as 10 mm × 10 mm × 10 mm. The simulation time was 250 ns and the time step was 9.661674 × 10 −6 us. The simulation conditions for the other three defects were the same. For free metal particle defects, 10-mm-long metal particles were placed in the outer shell of the bus bar, which corresponds to the conditions of the experiment performed using the actual GIS. For floating electrode defects, the inner side of the ring, having a diameter of 101 mm, was tangent to the high-voltage conductor to simulate the floating electrode. For air gap defects, a 3-mm air gap was reserved in the basin insulator model to simulate the air gap in the GIS. During the simulation process, the random variation of the sensor (test point) is adopted and the defect position is relatively unchanged to simulate the randomness of the local partial discharge. The measurement points are set every 100 mm, and each measurement point is measured at intervals of 0-180° relative to the defect distribution every 15°. The FDTD simulation waveforms and frequency spectrum of four kinds of defects are shown in Figures 10 and 11. As can be seen from Figures 6, 7, 10 and 11, there are certain differences in the waveform and spectrum between the experimental data and the simulated data, mainly reflected in the changes of amplitude and phase. However, for the same type of defect, because the simulation excitation source comes from the experimental process, the simulation and For free metal particle defects, 10-mm-long metal particles were placed in the outer shell of the bus bar, which corresponds to the conditions of the experiment performed using the actual GIS. For floating electrode defects, the inner side of the ring, having a diameter of 101 mm, was tangent to the high-voltage conductor to simulate the floating electrode. For air gap defects, a 3-mm air gap was reserved in the basin insulator model to simulate the air gap in the GIS. During the simulation process, the random variation of the sensor (test point) is adopted and the defect position is relatively unchanged to simulate the randomness of the local partial discharge. The measurement points are set every 100 mm, and each measurement point is measured at intervals of 0-180 • relative to the defect distribution every 15 • . The FDTD simulation waveforms and frequency spectrum of four kinds of defects are shown in Figures 10 and 11. As can be seen from Figures 6, 7, 10 and 11, there are certain differences in the waveform and spectrum between the experimental data and the simulated data, mainly reflected in the changes of amplitude and phase. However, for the same type of defect, because the simulation excitation source comes from the experimental process, the simulation and experimental data show the same trend in the entire waveform or spectrum change, which is obviously different from other types of defects. experimental data show the same trend in the entire waveform or spectrum change, which is obviously different from other types of defects.  experimental data show the same trend in the entire waveform or spectrum change, which is obviously different from other types of defects.

Results and Analysis
In this paper, 3200 TRPD data consisting of 1200 experimental data and 2000 FDTD simulation data were used for PD pattern classification. In total, 80% of data were randomly selected for training, and the remaining 20% were for testing. We trained our model with TensorFlow on a machine equipped with one 8 GB GeForce RTX 2060 GPU. Additionally, we chose Anaconda as our Python package manager and its distribution of TensorFlow. Moreover, we developed all our programs in PyCharm IDE.

Model Training and Visualization
The CNN is regarded as a "black-box" model. Thus, in this paper, to visualize the entire training and feature extraction process, a TensorBoard-integrated deconvolution neural network method was adopted. Through the visualization, the features extracted from each layer of the model can be easily observed. Meanwhile, use of the method can also help in monitoring the whole training process and then evaluating the overfitting problem of the model. The model training process was visualized by TensorBoard, and the visualized results of the loss function and training accuracy curves are shown in Figure 10. It can be seen from Figure 12 that the loss function decreases with increasing training, and the rate of the decreasing speed drops with the increase of the training. The loss function fluctuates around 0.1 at the 500th training step, fluctuates around 0 at the 1000th training step, and drops to 0 after 750 training steps. The trend of the training accuracy is exactly opposite to that of the loss function, but they share the same trend of change rate. Thus, the model performs quite well on the training set.

Results and Analysis
In this paper, 3200 TRPD data consisting of 1200 experimental data and 2000 FDTD simulation data were used for PD pattern classification. In total, 80% of data were randomly selected for training, and the remaining 20% were for testing. We trained our model with TensorFlow on a machine equipped with one 8 GB GeForce RTX 2060 GPU. Additionally, we chose Anaconda as our Python package manager and its distribution of TensorFlow. Moreover, we developed all our programs in PyCharm IDE.

Model Training and Visualization
The CNN is regarded as a "black-box" model. Thus, in this paper, to visualize the entire training and feature extraction process, a TensorBoard-integrated deconvolution neural network method was adopted. Through the visualization, the features extracted from each layer of the model can be easily observed. Meanwhile, use of the method can also help in monitoring the whole training process and then evaluating the overfitting problem of the model. The model training process was visualized by TensorBoard, and the visualized results of the loss function and training accuracy curves are shown in Figure 10. It can be seen from Figure 12 that the loss function decreases with increasing training, and the rate of the decreasing speed drops with the increase of the training. The loss function fluctuates around 0.1 at the 500th training step, fluctuates around 0 at the 1000th training step, and drops to 0 after 750 training steps. The trend of the training accuracy is exactly opposite to that of the loss function, but they share the same trend of change rate. Thus, the model performs quite well on the training set. Visualization of the automatic feature extraction in the CNN is presented in Figure 13. Obviously, the learned convolution filter is relatively smooth in space, which is the result of sufficient training. Moreover, the feature visualization results indicate that the features extracted by the CNN are sensitive to time. In the initial features extracted, most are in the form of waveform contours. In the later feature maps, these waveform features gradually coalesced, and, in the final maps, the initial waveform features can be seen much more clearly.

Accuracy Analysis of Pattern Recognition
To effectively assess the recognition accuracy of the model, 2560 groups of 3200 TRPD maps covering four major defects-free metal particle defects (M-type defects), metal tip defects (N-type defects), suspended electrode defects (O-type defects), and insulation void defects (P-type defects)were used for model testing. Support vector machine (SVM), decision trees (DT), BP neural network (BPNN), LeNet5, AlexNet, VGG16, and LCNN models were used for GIS PD pattern recognition. The recognition results are given in Table 1 (with reference to [49], the maximum value, root mean square deviation, standard deviation, skewness, kurtosis, and the peak-to-peak value were selected as feature parameters [49]). Visualization of the automatic feature extraction in the CNN is presented in Figure 13. Obviously, the learned convolution filter is relatively smooth in space, which is the result of sufficient training. Moreover, the feature visualization results indicate that the features extracted by the CNN are sensitive to time. In the initial features extracted, most are in the form of waveform contours. In the later feature maps, these waveform features gradually coalesced, and, in the final maps, the initial waveform features can be seen much more clearly. Visualization of the automatic feature extraction in the CNN is presented in Figure 13. Obviously, the learned convolution filter is relatively smooth in space, which is the result of sufficient training. Moreover, the feature visualization results indicate that the features extracted by the CNN are sensitive to time. In the initial features extracted, most are in the form of waveform contours. In the later feature maps, these waveform features gradually coalesced, and, in the final maps, the initial waveform features can be seen much more clearly.

Accuracy Analysis of Pattern Recognition
To effectively assess the recognition accuracy of the model, 2560 groups of 3200 TRPD maps covering four major defects-free metal particle defects (M-type defects), metal tip defects (N-type defects), suspended electrode defects (O-type defects), and insulation void defects (P-type defects)were used for model testing. Support vector machine (SVM), decision trees (DT), BP neural network (BPNN), LeNet5, AlexNet, VGG16, and LCNN models were used for GIS PD pattern recognition. The recognition results are given in Table 1 (with reference to [49], the maximum value, root mean square deviation, standard deviation, skewness, kurtosis, and the peak-to-peak value were selected as feature parameters [49]).

Accuracy Analysis of Pattern Recognition
To effectively assess the recognition accuracy of the model, 2560 groups of 3200 TRPD maps covering four major defects-free metal particle defects (M-type defects), metal tip defects (N-type defects), suspended electrode defects (O-type defects), and insulation void defects (P-type defects)-were used for model testing. Support vector machine (SVM), decision trees (DT), BP neural network (BPNN), LeNet5, AlexNet, VGG16, and LCNN models were used for GIS PD pattern recognition. The recognition results are given in Table 1 (with reference to [49], the maximum value, root mean square deviation, standard deviation, skewness, kurtosis, and the peak-to-peak value were selected as feature parameters [49]). As can be seen in Table 1, the overall recognition rate of the LCNN reached 98.13% of 640 testing sets while the rates of SVM, BPNN, DT, LeNet5, AlexNet, and VGG16 were, respectively, 93.76%, 83.78%, 93.44%, 75.04%, 90.63%, and 86.41%. The recognition rate of the LCNN is significantly higher than that of the traditional machine learning methods and of the deep learning method. The recognition rate of different recognition methods varies significantly in identifying different defect types. In general, the LCNN outperforms the other methods in defect recognition, whereas LeNet5 has the lowest recognition rate. The reason why the recognition rates of the traditional machine learning methods and LeNet5 are low is that the features are underutilized; AlexNet and VGG16 have low rates because of their inability to address the vanishing gradient during model training, through the transfer learning, the recognition rate of both is improved, but the overall recognition rate is limited by the size of the data set. All the methods except AlexNet and VGG16, however, have much greater difficulty in identifying insulator insulation void defects. This is because the internal defects of the insulator mainly entail minor gaps in the molding resin or voids in the layered region between the insulating material and the metal insert, and, with the long-time accumulation of electric field, the instability of the PD under such defects leads to low recognition accuracy [50].
To further compare the performance of the LCNN and traditional machine learning methods, training sets of different sample sizes were used for model training. The recognition accuracy curves for different algorithms for PD pattern classification are shown in Figure 14. As can be seen from Figure 14, the LCNN demonstrates the largest variation in accuracy before the number of training samples reaches 500. When the number of training samples are fewer than 500, both the SVM and DT methods outperform the LCNN. With >500 samples, however, the LCNN is significantly superior to the other traditional machine learning methods. Therefore, generally, the LCNN performs the best, followed by the SVM and DT, both of which have relatively high recognition accuracy for small-sized samples. The BPNN has the worst performance. Figure 15 reports the improvement in recognition accuracy of the LCNN for different training samples compared to traditional machine learning algorithms. samples reaches 500. When the number of training samples are fewer than 500, both the SVM and DT methods outperform the LCNN. With >500 samples, however, the LCNN is significantly superior to the other traditional machine learning methods. Therefore, generally, the LCNN performs the best, followed by the SVM and DT, both of which have relatively high recognition accuracy for small-sized samples. The BPNN has the worst performance. Figure 15 reports the improvement in recognition accuracy of the LCNN for different training samples compared to traditional machine learning algorithms.  It can be seen from Figure 15 that, when the number of training samples is 100, compared with DT, SVM, and BPNN methods, improvement in recognition accuracy of the LCNN is −37.79%, −33.1%, and 1.09%, respectively. The recognition accuracies of the SVM and DT methods are significantly higher than that of the LCNN. Therefore, for small samples, traditional machine learning methods possess evident advantages over the LCNN. When the training data set reaches 500, the recognition accuracy of the LCNN, improves by 1.38%, 0.89%, and 21.78%, respectively, which means that the LCNN has begun to demonstrate its advantages in recognition accuracy. When the sample size of training data reaches 2500, the accuracy of the LCNN improves by 4.37%, 4.69%, and 14.35%, respectively. At this point, the LCNN significantly outperforms traditional machine learning methods. With the increase of the training data sets, the LCNN method demonstrates growing advantages over traditional machine learning methods in recognition accuracy. Therefore, the overall performance of the LCNN is significantly better than that of machine learning methods. In the context of big data and the ubiquitous-power IoT, the LCNN has broad application prospects. samples reaches 500. When the number of training samples are fewer than 500, both the SVM and DT methods outperform the LCNN. With >500 samples, however, the LCNN is significantly superior to the other traditional machine learning methods. Therefore, generally, the LCNN performs the best, followed by the SVM and DT, both of which have relatively high recognition accuracy for small-sized samples. The BPNN has the worst performance. Figure 15 reports the improvement in recognition accuracy of the LCNN for different training samples compared to traditional machine learning algorithms.  It can be seen from Figure 15 that, when the number of training samples is 100, compared with DT, SVM, and BPNN methods, improvement in recognition accuracy of the LCNN is −37.79%, −33.1%, and 1.09%, respectively. The recognition accuracies of the SVM and DT methods are significantly higher than that of the LCNN. Therefore, for small samples, traditional machine learning methods possess evident advantages over the LCNN. When the training data set reaches 500, the recognition accuracy of the LCNN, improves by 1.38%, 0.89%, and 21.78%, respectively, which means that the LCNN has begun to demonstrate its advantages in recognition accuracy. When the sample size of training data reaches 2500, the accuracy of the LCNN improves by 4.37%, 4.69%, and 14.35%, respectively. At this point, the LCNN significantly outperforms traditional machine learning methods. With the increase of the training data sets, the LCNN method demonstrates growing advantages over traditional machine learning methods in recognition accuracy. Therefore, the overall performance of the LCNN is significantly better than that of machine learning methods. In the context of big data and the ubiquitous-power IoT, the LCNN has broad application prospects. It can be seen from Figure 15 that, when the number of training samples is 100, compared with DT, SVM, and BPNN methods, improvement in recognition accuracy of the LCNN is −37.79%, −33.1%, and 1.09%, respectively. The recognition accuracies of the SVM and DT methods are significantly higher than that of the LCNN. Therefore, for small samples, traditional machine learning methods possess evident advantages over the LCNN. When the training data set reaches 500, the recognition accuracy of the LCNN, improves by 1.38%, 0.89%, and 21.78%, respectively, which means that the LCNN has begun to demonstrate its advantages in recognition accuracy. When the sample size of training data reaches 2500, the accuracy of the LCNN improves by 4.37%, 4.69%, and 14.35%, respectively. At this point, the LCNN significantly outperforms traditional machine learning methods. With the increase of the training data sets, the LCNN method demonstrates growing advantages over traditional machine learning methods in recognition accuracy. Therefore, the overall performance of the LCNN is significantly better than that of machine learning methods. In the context of big data and the ubiquitous-power IoT, the LCNN has broad application prospects.

Model Time Analysis
The length of time spent in model training and testing directly determines whether the model can be applied under the ubiquitous-power IoT. The current testing time cannot meet the requirement of quick, real-time processing under online monitoring of the ubiquitous-power IoT. The length of training time affects the updating ability of the model. It is difficult to update the model for better accuracy when much more training samples are acquired to form a larger historical knowledge database. To verify the temporal advantages of the LCNN model proposed in this paper, the training data of the SVM, DT, and BPNN were used for comparison. Figure 16 shows the training time and testing time distribution of different models based on the 3200 groups of the TRPD data set.

Model Time Analysis
The length of time spent in model training and testing directly determines whether the model can be applied under the ubiquitous-power IoT. The current testing time cannot meet the requirement of quick, real-time processing under online monitoring of the ubiquitous-power IoT. The length of training time affects the updating ability of the model. It is difficult to update the model for better accuracy when much more training samples are acquired to form a larger historical knowledge database. To verify the temporal advantages of the LCNN model proposed in this paper, the training data of the SVM, DT, and BPNN were used for comparison. Figure 16 shows the training time and testing time distribution of different models based on the 3200 groups of the TRPD data set. In Figure 16, the LCNN training time is 8.39 min, 11.63 min with the SVM, 9.89 min with the DT, and 10.95 min with the BPNN. The LCNN testing time is 7.3 s < 11.2 s with the SVM, 9.8 s with the DT, and 12.4 s with the BPNN. Therefore, the LCNN model has an evident advantage in training time and testing time. One of the main reasons why traditional machine learning models consume much more time is that they take a massive amount of time to extract features.
By comparison with traditional machine learning methods, such as the SVM, DT, and BPNN, it can be seen that the LCNN proposed in this paper demonstrates obvious advantages over other methods in training time and testing time. The shorter testing time makes it possible to handle fault signals quickly and in time, which lays a solid foundation for t-fault diagnoses of GIS PD in the context of the ubiquitous-power IoT and big data.

Conclusions
In this paper, an LCNN model was proposed for GIS PD pattern recognition. In the TRPD mode, the light-scale model mitigates the dependence of traditional machine learning on expert experience and avoids the risk that the depth model cannot be trained because of the vanishing gradient. It also maximizes the simulation of various working conditions of PD and makes full use of the time-domain waveform characteristics of PD. Therefore, it effectively improves the accuracy of the model and reduces both training time and test time, making it more applicable to the ubiquitous-power IoT. The following conclusions can be drawn: (1) Visualization of the entire training and automatic feature extraction process of the LCNN can be realized by the deconvolution neural network integrated with TensorBoard. As a result, the "black-box" problem of the CNN is solved. The feasibility of LCNN was verified by a visual training process, and the features were visually displayed. In Figure 16, the LCNN training time is 8.39 min, 11.63 min with the SVM, 9.89 min with the DT, and 10.95 min with the BPNN. The LCNN testing time is 7.3 s < 11.2 s with the SVM, 9.8 s with the DT, and 12.4 s with the BPNN. Therefore, the LCNN model has an evident advantage in training time and testing time. One of the main reasons why traditional machine learning models consume much more time is that they take a massive amount of time to extract features.
By comparison with traditional machine learning methods, such as the SVM, DT, and BPNN, it can be seen that the LCNN proposed in this paper demonstrates obvious advantages over other methods in training time and testing time. The shorter testing time makes it possible to handle fault signals quickly and in time, which lays a solid foundation for t-fault diagnoses of GIS PD in the context of the ubiquitous-power IoT and big data.

Conclusions
In this paper, an LCNN model was proposed for GIS PD pattern recognition. In the TRPD mode, the light-scale model mitigates the dependence of traditional machine learning on expert experience and avoids the risk that the depth model cannot be trained because of the vanishing gradient. It also maximizes the simulation of various working conditions of PD and makes full use of the time-domain waveform characteristics of PD. Therefore, it effectively improves the accuracy of the model and reduces both training time and test time, making it more applicable to the ubiquitous-power IoT. The following conclusions can be drawn: (1) Visualization of the entire training and automatic feature extraction process of the LCNN can be realized by the deconvolution neural network integrated with TensorBoard. As a result, the "black-box" problem of the CNN is solved. The feasibility of LCNN was verified by a visual training process, and the features were visually displayed.
(2) LCNN has superior feature capture capability for GIS PD signals and effectively implements GIS PD pattern recognition. Based on TRPD, the overall recognition rate of the LCNN reached 98.13%, while rates of the SVM, BPNN, DT, LeNet5, AlexNet, and VGG16 were 93.76%, 83.78%, 93.44%, 75.04%, 90.63%, and 86.41%, respectively. With the increase of the number of samples, the LCNN demonstrates more advantages, which make it much more suitable for application to big data and the ubiquitous-power IoT.
(3) Banalization and image downsampling considerably alleviated the burden of time consumption for model training. By comparison with traditional machine learning methods, the TRPD-based LCNN demonstrates significant advantages over other methods in training time and testing time. It can not only accomplish quick, real-time online monitoring of fault signals but also rapidly update the model after resetting the fault knowledge base.
Author Contributions: Y.W. and J.Y. conceived and designed the experiments; J.L. provided the experimental data; Y.W. wrote the paper; Z.Y. modified the code of the paper; T.L. revised the contents and reviewed the manuscript; Y.Z. provided the simulation data.
Funding: This research received no external funding.