A Convolutional Neural Network-Based Model for Multi-Source and Single-Source Partial Discharge Pattern Classiﬁcation Using Only Single-Source Training Set

: Classification of the sources of partial discharges has been a standard procedure to assess the status of insulation in high voltage systems. One of the challenges while classifying these sources is the decision on the distinct properties of each one, often requiring the skills of trained human experts. Machine learning offers a solution to this problem by allowing to train models based on extracted features. The performance of such algorithms heavily depends on the choice of features. This can be overcome by using deep learning where feature extraction is done automatically by the algorithm, and the input to such an algorithm is the raw input data. In this work, an enhanced convolutional neural network is proposed that is capable of classifying single sources as well as multiple sources of partial discharges without introducing multiple sources in the training phase. The training is done by using only single-source phase-resolved partial discharge (PRPD) patterns, while testing is performed on both single and multi-source PRPD patterns. The proposed model is compared with single-branch CNN architecture. The average percentage improvements of the proposed architecture for single-source PDs and multi-source PDs are 99.6% and 96.7% respectively, compared to 96.2% and 77.3% for that of the traditional single-branch CNN architecture.


Introduction
Effective insulation degradation diagnosis is a key prerequisite for monitoring the integrity of any electrical system. An acceptable diagnostic method which has been used over the years is the measurement of partial discharges (PD) [1]. Different parameters were employed for PD classification throughout the years. Some of the parameters include maximum discharge magnitude and number of discharges as a function of time, PD pulses on an elliptic time-base, phase of the positive half cycle of the PRPD patterns, features that were extracted using different dimensionality reduction techniques, the application of mixed Weibull functions and wavelet transform on discharge patterns. For the deep learning part, the parameters used are waveform spectrogram, time-domain waveform signal, and PRPD patterns. Okamoto and Tanaka were among the first to work on developing techniques to measure partial discharges in 1986 [2]. Their work demonstrated the existence of a correlation between the distribution profile of the charge against the phase angle and the level of insulation degradation by analysing the skewness of the profile. Another approach for determining partial discharge sources was based on the analysis of different quantities of discharge as a function of time; these include maximum discharge magnitude, the number of discharges, and the inception voltage [3]. By 1990s it was evident that distinctive characteristic behaviors such as increase, decrease, strong or weak fluctuations of these quantities can be correlated to discharge sources.
With the advancements in the field of pattern recognition, interest increased in automating partial discharge recognition and classification. In 1993, one of the first successful applications of neural networks for automatic recognition of any partial discharge source was reported [4]. The input was extracted from commercial partial discharge detectors that would display PD pulses on an elliptic time-base. The phase position and the spread of the pulses were shown to be correlated with the nature of PD source, suggesting that PD pulses on an elliptic time-base provided important features for characterization. The rate of correct classification varied between 70% and 90% depending on the number of layers in the neural network and the classes to be classified. The choice of the neural network architecture has been an open question since that time. Poor generalization was recorded on real patterns compared to that on training synthesised patterns [5]. In [6], phase resolved partial discharge (PRPD) patterns were considered as inputs to the neural network wherein the phase of the positive half cycle was considered. The study proposed a way to separate superimposed charge-phase patterns which was based on separating contours before passing it to the neural network. The limitation of this method is that it required the patterns to be non-overlapping. More progress was done by Krivda who used dimensionality reduction techniques to derive low-dimensional representations of different partial discharge patterns [7]. Krivda [7] concluded that in order to decide on the right features, a balance should be set between the number of features and the time needed to compute the features; moreover, new types of neural network could yield better results. In [8], the authors discussed automatic recognition of multiple PD sources. A stochastic method based on on applying mixed Weibull functions to the pulse-height distribution patterns was investigated. The study concluded that in case of partially or completely superimposed PD patterns, separation was impossible.
Up to the year 2000, automatic recognition of multiple PD sources was yet to be resolved. The authors in [9] introduced the application of wavelet transform on PD detection proposing the use of Daubechies mother wavelet. Features were extracted from the third level reconstructed horizontal H and vertical V component images. A feature vector was composed by averaging the H and V images in the magnitude and phase directions resulting in 150 elements. The neural network used in this model had one hidden dense layer, and multiple source patterns were used while training. The overall classification accuracy was 88%. However, the authors concluded that further study of actual multiple source PD was required for more accurate assessment of the proposed method. In [10], stochastic procedures and fuzzy classifier were implemented to identify different PD pulses; however, it was noted that the fuzzy classifier was not efficient when PD pulses had similar shapes.
Historically, the input data for any machine learning algorithm had to be pre-processed by using the user's knowledge of the domain and assessment of which features are important for the specific problem. By 2006, automatic feature extraction became possible through the use of deep artificial neural networks which could accept raw data as input. The first application of deep learning on PD diagnosis was done in 2015 [11]. In [11], the authors recorded the PRPD patterns for six different PD defects in oil, where the patterns were treated as 50 × 64 dimensional images. The classification accuracy increased as the number of hidden layers increased reaching 86% for five hidden layers. The authors in [12] were among the first to use a deep learning architecture called Recurrent Neural Network (RNN) for the classification of PRPD patterns. Trials were performed to decide on the best values for the number of layers and number of power cycles. They achieved an accuracy of 96.62% that outperformed simple deep neural networks (with an accuracy of 93.01%) and traditional machine learning techniques using support vector machine (with an accuracy of 88.63%). Recently, a number of authors have reported the use of deep learning models, such as convolutional neural networks (CNN) for classifying PD sources [13][14][15]. Among these works, various formats of input have been used for PD source identification; these inputs include: waveform spectrogram, time-domain waveform signal, and PRPD patterns. For the waveform spectrogram data, the authors in [16] used CNN to detect PD signal with varying noise and interference signals. The input to the network was an image showing the time-frequency spectrum of sound clips, which were measurements recorded from a switch gear using the transient earth voltage method (TEV). CNN showed superior performance in terms of detection accuracy and detection time compared to other methods prevalent in the industry. In [17], Che et al. used 2D-CNN to classify three PDs sources in XLPE cable which are internal PDs, corona PDs, surface PDs, in addition to noise. Acoustic signals were generated using an optical fiber distributed acoustic sensing system. The 1D-signals were converted to 2D spectral representation by applying melfrequency cepstrum coefficients analysis (MFCC). For the time-domain waveform data, authors in [18] used signals from an analog transformer model which consisted of impulse fault current waveforms for different fault conditions. Each waveform was represented by a 2500 dimensional vector. The training was performed using the PD data from sources co-occurring simultaneously at two different locations within the winding. This resulted into a total of 20,304 classes corresponding to the different fault conditions at different winding locations. The classification accuracy attained was 99.2%. Wang et al were interested in UHF signals for partial discharge detection in GIS [19]. They collected time-series data from lab experiments and simulations using the finite-difference time-series method (FDTD). The input to the CNN were 64 by 64 images that were down sampled originally from a 600 by 438 time-resolved partial discharge (TRPD) image. The classification accracy was compared with conventional methods based on using statistical features as input. It was concluded that when CNN outperforms the traditional methods when the number of training examples is greater than 500. For the PRPD input data, the authors in [20] obtained mixed onsite and experimental PRPD patterns for six different sources of partial discharge. The input data was represented as a 72 × 50 matrix. An accuracy of 89.7% was achieved. In [21], the authors used CNN in order to detect the deterioration of the insulation in high voltage systems using PRPD images. Four classes were distinguished: start, middle, end and noise. The tested specimens were aged by undergoing high electric stress in a lab setup. Different architectures of the CNN were investigated by changing the hyper-parameters as the number and the size of the kernels. The results were reported in terms of the confusion matrix and the accuracy percentage. In [22], an algorithm was presented to identify multi-source PDs based on a two-step logistic regression model.
It is noteworthy that all of the prior methods reported above depend on the availability of training data from multi-source PD inputs [23]. There are a number of drawbacks associated with this choice. Such a training data is difficult to collect in practice, is time consuming, and, by its very combinatorial nature, precludes the collection of examples for all possible combinations of concurrently occurring defects. In this paper, to address these drawbacks, we propose a novel convolutional architecture for single-source PD and multi-source PD classification using training data with ground-truth available only at the level of single-source PDs. Our proposed architecture consists of a convolutional backbone feeding into multiple fully connected neural networks (FCNs). The input to the convolutional part of the network is the PRPD pattern matrix (Section 2.1). The output of this CNN stage is a common feature representation which is broadcast to different FCNs, wherein each FCN is trained to output the probability of occurrence of a specific PD. Thus the proposed hybrid architecture moves from extracting general representations to more fine-tuned representation in a hierarchical fashion. The overall loss of the network is the combination of individual binary cross entropy losses from each of the FCNs. This loss is jointly optimized with respect to the parameters of the CNN stage and the FCNs. At testing time, our network produces a multi-label output vector signifying the probability of the presence of respective PDs. We show superior performance as compared to models trained independently on single-source PDs demonstrating the value of shared convolutional stage and joint optimization of the FCNs.

Experimental Setup
Several experiments were performed by different groups to classify partial discharges by the use of their phase resolved partial discharge patterns. PD classification and identification using laboratory data has been used to establish proof of concept for a num-ber of techniques available in the literature (e.g., [24,25]). Lab experiments were done by Janani et al. [26,27] to simulate artificial defects. The experimental setup consisted of a high voltage transformer, a capacitive divider to measure the AC voltage, the test cell, and the PD measurement system as shown in Figure 1. The lab setups simulate common sources of PD in air, oil, and SF 6 . PD data collection was conducted in accordance with IEC 60270 standard [28]. The test cells include three sources (floating electrode, moving particle, and fixed protrusion) of partial discharge in SF 6 ( Figure 2), two sources of PD (free particle and needle electrode) in transformer oil (Figure 3), and corona in air which has the same setup as floating electrode but filled with air. For the floating electrode, the distance between the gap between the two electrodes is 1 mm. For the free particle, a small bearing with a diameter of 3.17 mm was placed on a concave dish ground electrode. For the point plane electrode, the needle has a diameters of of 20 µm [27]. More details on the experimental setups are explained in [29]. In total, there are six different PD patterns generated. In addition, four different combinations of multiple partial discharges were simulated by using simultaneously two or three test cells. A commercial PD measurement system (Omicron MPD 600) was used to acquire the PRPD of each test cell.   The output data from the Omicron software is exported as binary files. This data includes information about partial discharges taking place relative to the applied phase voltage. The discharge magnitude and phase are divided into 400 and 500 bins respectively. This results in a 400 × 500 matrix M(x, y), where the number in each bin represents the number of discharges occurring at a specific phase angle (x) and a specific discharge magnitude (y). Figure 4 shows a visual representation of the six single-source PRPD patterns. The 400 × 500 matrix is reduced to 100 × 100 by summing up the counts in each 40 × 50 sub-matrix. In addition, background noise is unavoidable even with perfect measurement conditions. Background noise is reflected by having an offset charge over all the phase windows. In this work, it has been removed for all the PRPD patterns. In addition to the six classes, an additional no-pattern class corresponding to the cases not involving the presence of any PD is added. In order to encourage the model to learn features related to the shape of PRPDs, the samples were converted into binary samples, where zero threshold is considered for binarization. A visual representation of the binary matrices are shown on the side of the Omicron representation of each of the six single PD source classes. Figure 5 shows the PRPD patterns of the four multi-source PD classes.  To make the systems insensitive to changes in charge magnitude settings on Omicron software and to different applied voltages, samples are extracted using multiple magnitude settings; hence, introducing variability in the dataset. Particularly, for each class, the following charge magnitude settings for the Omicron software were employed that are summarized in Table 1. The numbering of the six single-source classes is done as follows: • Class 1 (corona in air) • Class 2 (floating electrode in SF 6 ) • Class 3 (free particle in oil) • Class 4 (free particle in SF 6 ) • Class 5 (point plane electrode in oil) • Class 6 (point plane electrode in SF 6 ) The numbering of the four multiple-source classes is done as follows: • Class 14 (corona in air and free particle in SF 6 ) • Class 16 (corona in air and point plane electrode in SF 6 ) • Class 46 (free particle in SF 6 and point plane electrode in SF 6 ) • Class 146 (corona in air, free particle in SF 6 and point plane electrode in SF 6 )

Method
Convolutional neural networks (CNNs) represent a class of deep neural networks that were originally designed for visual images, and have shown state-of-the-art performance for a range of applications [30][31][32][33]. Typically, CNNs consist of a cascade of alternating convolutional and pooling layers as shown in Figure 6. A convolutional layer comprises of a bank of linear 2D or 3D filters which are convolved with a multi-channel input image to produce a multi-channel output of feature maps. The output of convolutions is often passed through a non-linear activation function such as a rectified linear unit (ReLU). A pooling layer subsamples the input in a non-linear fashion (e.g., maximum value in a local window). The successive convolutional and pooling layers, coupled with non-linear activations, confer the CNNs with the capability to automatically learn feature representations at different spatial scales of an image in a hierarchical fashion. Common applications include classification, regression, and matrix-to-matrix transformations [34,35]. In classification problems, a data-point can belong to a single class (mutually exclusive membership) or it could belong to multiple categories at the same time. The latter is usually referred to as multi-label classification. Since PRPD patterns from multiple sources can occur concurrently, PD detection is essentially a multi-label classification problem. In the presence of training data with various combinations of co-occurring multi-source PD labels, building a multi-label classification model is tenable. However, as mentioned in Section 1, collection of such a dataset is expensive, time consuming, and may not allow to span all possible combinations of PDs. On the other hand, it is practically more feasible to collect single-source PD data in large quantities. We therefore focus on methods to capitalize single-source training data for solving multi-label classification problem.
Let K be the number of PD sources. To enable explicit detection of cases with no PDs, we define a separate category representing the absence of all the PDs. Let the training data consisting of N examples be represented as {X i , y i } N i=1 , where X i ∈ R H×W is the ith PRPD pattern image, and y i ∈ {0, 1} K+1 is the corresponding (K + 1)-dimensional label. The label vector is (K + 1)-dimensional because, as described above, we have defined an additional class for cases with no PDs binary label vector, signifying the presence or absence of each PD. Since only single-source examples are considered during training, each y i is a one-hot vector. At testing time, the label vector for a test-case can contain multiple 1s.

Multiple Single-Source Classifiers (Baseline)
To achieve multi-label classification, a traditional way has been to learn multiple (K + 1) independent binary classifiers, each trained to detect an individual PD defect. The loss for the kth model, given the training dataset, is given by where F θ k is the function encoded by the kth model, depending on parameters θ k . y ik is the kth element of y i . After the training phase, given a test case X (test) , one then needs to invoke K + 1 models to build a multilabel output,ŷ (test) = {F θ k (X (test) )} K+1 k=1 . An example convolutional architecture that accepts a PRPD pattern image and performs a binary classification for the presence of a specific single-source PD is shown in Figure 7a.

Joint Model with Shared CNN Parameters (Proposed)
While the baseline approach described above may learn excellent single-source classifiers, it is not expected to generalize for multilabel classification task. This is due primarily to overtuned class specific parameters {θ k } K+1 k=1 learned independently for each singlesource PD. To address these issues we propose to decompose the network parameters into two sets: a shared set of common parameters, ρ CNN (for the convolutional part), and class specific parameters, {φ FCN k } K+1 k=1 (for fully connected networks). In particular, our proposed architecture has a shared convolutional stage for feature extraction. These features are then distributed to multiple FCNs. Our motivation is to encourage the CNN to learn to extract more general feature representations which are useful for all classes. The FCNs accept these general features to learn class specific models in a joint fashion. Our architecture is shown in Figure 7b. Let the CNN part be represented by the network G, and each of the fully connected networks be represented by H k . Our joint loss function is then given by (2)

Design Details for Network Layers
Two CNN layers consisting of 36 filters, a kernel of size 3 × 3 followed by two dense layers with 128 and 64 filters respectively, and ending with a classification layer of seven nodes constitute the network used in this study. Batch normalization has been used in order to decrease the effect of over-fitting. The schematic for one of the classifiers in Figure 7a is shown in Figure 8. The design details for the implemented classifiers are shown in Table 2. The hyper parameters of our neural network such as the number of layers, number of nodes per layer, kernel size were chosen by running experiments for different values of the parameters and plotting the training and validation accuracy curves as a function of epochs.  (None, 1) Activation4 (None, 1) The design of the layers is kept the same for both the baseline model and the proposed model so that the difference in performance due to the proposed parameter-sharing based architecture can be investigated. The proposed model architecture is shown in Figure 9.

Performance Metrics
Since the model is trained using single PRPD patterns only, the generalization of the model is tested by evaluating the performance on a new hybrid dataset that includes PRPD patterns from single as well as from multiple partial discharge sources. In addition to that, samples with different charge magnitude specification on the Omicron software are tested. Different standard multi-label classification metrics have been used in the literature to evaluate the performance of trained models. Some of these metrics include mean average precision, 0-1 exact match, macro and micro F1, per class precision, per class recall, overall precision and overall recall [36]. In this paper, the individual recall (Recall(k)) and the individual precision (Precision(k)) are calculated for each of the classes 1 to 7 by taking into account both single-source PDs and multiple-source PDs. The recall reflects the proportion of the positive examples that is correctly classified, and the precision reflects the proportion of the examples predicted to be positive that are actually positive. PCR and PCP represent the arithmetic mean of recall and precision respectively, In addition, classification accuracy and false negative accuracy are evaluated for single as well as for multiple classes. The classification accuracy is calculated considering equal weights for all classes, while the false negative accuracy is calculated taking into consideration only the true class or classes that the sample truly belongs to. The importance of calculating the false negative accuracy metric in this context comes from the fact that it is of high importance to detect the correct source of PD in high voltage systems. Consistent false identification of a PD source will put the high voltage apparatus in failing condition, in addition to safety risk for employees working near this apparatus. The false negative accuracy reflects on the performance of the model by quantitatively evaluating single classes and multiple classes separately in comparison to the individual recall and precision. If a PRPD pattern belongs to classes one, four and six, then the ground truth is [1001010]. The classification accuracy is then calculated by checking the matching elements in each of the seven-element vector [1001010]. The classification accuracy for a single sample is calculated as where M k is equal to one when the element k in the ground truth vector agrees with the prediction of the model for the corresponding class k, and zero otherwise. The ideal classification accuracy is 100% and the worst is 0%. The false negative accuracy for a single sample is calculated as: where N j equals one when the element j in the ground truth vector which is equal to one does not agree with the prediction of the model for the corresponding class j, and zero otherwise. In this metric, checking the matching prediction is performed only on the class or classes that the sample truly belongs to. T is the number of classes that the sample truly belongs to. In our tested dataset, T can be 1 for single-sourced PDs, 2 or 3 for multiple-sourced PDs. Hence, the ideal false negative accuracy is 0% and the worst is 100%. Calculating the classification and false negative accuracies over a number of samples is done by averaging (5) and (6) over the number of samples. Figure 10 shows calculated loss of the trained model, using (1), as a function of epochs (or iterations) for both the training and the validation dataset in log scale. An epoch is when the entire dataset is passed forward and backward through the neural network. Under-fitting is clearly seen for classes six and seven where the gap between the training loss and validation loss increases at epoch 1000. Table 3 shows the classification and the false negative accuracies.  As seen in Table 3, the model does not generalize well to the multiple classes especially Class 16 and Class 146. The arithmetic mean of both precision and recall are shown in Table 4. The precision for each of class three and class four is low similar to the recall of class six. In Table 5 we show the hybrid confusion matrix in which the rows and columns represent the input and predicted classes respectively. The true positives are highlighted for better visibility.

Proposed Model
As we proceed with training the model, a trade off takes place between generalization and learning deeper features about single partial discharges. Generalization comes in the context of correct classification of multiple sourced-PRPD patterns. The training of the model is terminated when the validation accuracy is observed to start shifting from the training accuracy. During the training phase, a portion of the dataset is used for validation purposes where this portion is used to calculate the loss for back propagation in each epoch. The decision is collectively made by analyzing the average validation and training loss of the seven classes. For epoch 4000, the percentage difference between the validation and the training loss is 0.8% compared to 2.7% for epoch 8000 as shown in Figure 11, and consequently, the training is stopped at iteration (epoch) 4000. The calculated loss of the trained model as a function of epochs or iterations for both the training and the validation dataset in log scale, using (1), is shown in Figure 12. The classification accuracy and false negative accuracy are shown in Table 6. In comparison with Table 3, better performance is recorded where the average classification accuracy for single classes increased from 96.2% to 99.6%. The average false negative accuracy for the multiple classes dropped from 23.5% to 8.7%. On the other hand, the arithmetic mean of both precision and recall are calculated in Table 7. Comparing Table 7 with Table 4, ideal recall is recorded for class 6 and ideal precision is recorded for all classes. This indicates that our proposed model enhanced the prediction of true positives. The hybrid confusion matrix for the proposed model is shown in Table 8. As seen in these tables, compared to Table 5, our proposed model has enhanced classification ability not only for single-source PDs, but also for multi-source PDs. This is shown in the last four rows of Table 8 corresponding to the multiple classes and comparing them with that of Table 5. This indicates that our proposed model decreased false negatives predictions. occur in high voltage insulation systems. The difficulty of identifying multiple sources PDs using training set of single sources PDs results from the fact that the PRPD patterns are partially overlapping. As a result, traditional machine learning techniques which are based on the manual extraction of features get confused when multiple source PRPD patterns are set to be classified. Different algorithms should be deployed in order to decide on the separation criteria between these overlapped PRPDs. A customized CNN model has been shown to be useful for this problem through the proposed enhanced version based on sharing the weights among different classes. The essence of the proposed model is that the training is done on single sources of PDs only. This is appreciated in the industry where additional financial resources and time are needed to acquire data from simultaneous sources of PDs. The model is robust to electric interference as well as to applied phase voltage. The average percentage improvements of the proposed architecture for singlesource PDs and multi-source PDs are 99.6% and 96.7%, respectively compared to 96.2% and 77.3% for that of the independent classifiers architecture.