Classiﬁcation of Partial Discharge Images Using Deep Convolutional Neural Networks

: Artiﬁcial intelligence-based solutions and applications have great potential in various ﬁelds of electrical power engineering. The problem of the electrical reliability of power equipment directly refers to the immunity of high-voltage (HV) insulation systems to operating stresses, overvoltages and other stresses—in particular, those involving strong electric ﬁelds. Therefore, tracing material degradation processes in insulation systems requires dedicated diagnostics; one of the most reliable quality indicators of high-voltage insulation systems is partial discharge (PD) measurement. In this paper, an example of the application of a neural network to partial discharge images is presented, which is based on the convolutional neural network (CNN) architecture, and used to recognize the stages of the aging of high-voltage electrical insulation based on PD images. Partial discharge images refer to phase-resolved patterns revealing various discharge stages and forms. The test specimens were aged under high electric stress, and the measurement results were saved continuously within a predeﬁned time period. The four distinguishable classes of the electrical insulation degradation process were deﬁned, mimicking the changes that occurred within the electrical insulation in the specimens (i.e., start, middle, end and noise / disturbance), with the goal of properly recognizing these stages in the untrained image samples. The results reﬂect the exemplary performance of the CNN and its resilience to manipulations of the network architecture and values of the hyperparameters. Convolutional neural networks seem to be a promising component of future autonomous PD expert systems


Introduction
Artificial intelligence (AI) is one of the most active topics of this decade. It has experienced explosive growth and is expected to penetrate almost all domains (engineering, metering and control, biomedicine and autonomous vehicles, to mention a few). This will pave the way for more accurate, faster and more cost-effective solutions. As a subset of AI, machine learning is experiencing unprecedented development, especially in the area of artificial neural networks, with many current variants and deployed applications. Scientists are excited about the potential of deep learning and the performance of convolutional neural networks. Thus, AI-based solutions and applications have great potential in various fields of electrical power engineering. The problem of the electrical reliability of power equipment is directly related to the immunity of high-voltage insulation systems to operating stresses, overvoltages and other stresses-in particular, those involving strong electric fields. Therefore, tracing material degradation processes in insulation systems requires dedicated diagnostics. The electric field exposure in insulation systems is a factor that is responsible for initiating and developing various forms of electrical discharges. These refer to discharges in the internal gaseous cavities, called voids, and on the surface of the insulation systems. The so-called partial discharge (PD) refers to cases in which no full insulation breakdown

Partial Discharge Phase-Resolved Acquisition
The progressive deterioration of high-voltage insulation caused by partial discharges is one of the key factors limiting the lifetime of electrical power equipment. Thus, one of the most commonly accepted and reliable quality indicators of high-voltage (HV) insulation systems is PD measurement. Partial discharges occur in the HV technical insulation systems of power equipment. A PD is usually defined as an electrical discharge that takes place locally, in only one part of an insulation system, that does not directly result in losing the insulating properties of the power device. However, long-lasting partial discharges result in micro and macroscopic deteriorations, leading to insulation breakdowns. Partial discharges may appear in solid, liquid and gaseous insulation systems; those bringing about destructive consequences take a variety of forms of interactions and multi-stage characters of the accompanying processes. The phase-resolved partial discharge analysis method is currently the most popular tool in PD-based diagnostics for high-voltage electrical insulation. This technique allows for the registration of individual PD pulses with respect to the phase angle of the applied voltage. The methodology relies on coupled two-dimensional multi-channel analyzers and is graphically shown in Figure 1. Since PD physical processes demonstrate a statistical behavior, a method based on phase-resolved PD acquisition is especially useful. The phase position of a particular discharge with respect to the high-voltage AC cycle brings additional information, which allows for PD form separation and the recognition of non-coherent forms on a phase-resolved plane. The PD patterns acquired in this way, revealing various stages of discharge development, are treated in this paper as images.
In the monitoring test, the specimens were aged under a high voltage, up to 20 kV, and the measurement results were saved continuously within a predefined time period. The PD measurements, at AC voltage, were performed using a wideband acquisition system, ICM+ (Power Diagnostix, Aachen, Gemany), connected to a host computer via a GPIB (General Purpose Interface Bus) interface [39][40][41]. The phase-resolved measurements in AC mode were recorded within 60 s, resulting in a D(ϕ, q, n) pattern (8 × 8 × 16 bit).

Machine Learning and Partial Discharge Image Recognition
Machine learning (ML) is a very popular topic in various disciplines; great progress in this field has recently been observed. ML refers to the ability of algorithms with tunable parameters that are adjusted automatically and adapted accordingly to previously seen data. In general terms, machine learning can be considered to be a subdomain of artificial intelligence. The applied algorithms can be considered as the building blocks of computer learning, leading systems to behave more intelligently by generalizing rather than only operating on data elements, in contrast to a conventional dataset system. Thus, machine learning is learning based on experience and can be classified into unsupervised learning and supervised learning. The first approach aims to group and interpret data based only on input data, whereas the latter method relies on predictive models based on both input and output data. Unsupervised learning implies that the algorithm itself will find patterns and relationships among different data clusters. The dataset in machine learning usually consists of a multi-dimensional entry associated with several attributes or features. Unsupervised learning can be further divided into clustering and association. Clustering refers to the automatic grouping of similar objects into sets. In turn, supervised learning implies an algorithm's ability to recognize elements based on provided samples with the goal of recognizing new data based on training data. Supervised learning algorithms include, for example, decision trees, support vector machines (SVM), naive Bayes classifiers, k-nearest neighbors and linear regressions [7][8][9][10][11][12][13][14][15][16][17][18][19][20][34][35][36][37]. Supervised learning can be further divided into classification and regression: classification means that samples belong to two or more classes, with the goal of predicting the class of unlabeled data from the already-labeled data and thus identifying to which category an object belongs; regression is understood as predicting an attribute associated with an object.
Probably the most popular examples of machine learning are artificial neural networks. The machine-learning workflow consists of several steps. The first step is preprocessing; i.e., data preparation in a form on which the network can train. This involves collecting images and properly resizing and labeling them up to the normalization stage (for example). The second step refers to the definition of the neural network topology in terms of the number of layers to be used in a model, the size of the input and output layers, the type of the implemented activation functions, whether or not dropout will be used, the number of epochs, the sizes of the training sets and many other factors. In fact, setting up all of the hyperparameters of a neural network framework is an art and relies greatly on experience. In the following stage, an instance of the model is fit with training data. Usually, the higher the number of epochs, the greater the performance of the neural network; however, too many training epochs may lead to overfitting and performance degradation. After training comes model evaluation by comparing the model's performance against a validation dataset (a dataset on which the model has not been trained). The performance of a neural network is determined by the application of various metrics. The most common is "accuracy," which reflects the number of correctly classified images divided by the total number of images in the dataset.
Partial discharge image recognition refers to the ability to distinguish and recognize between different PD types and sources within the insulating system of electrical power equipment, which often consists of complex insulation (a combination of gaseous, solid and liquid phases). Moreover, discriminating between the internal discharges that occur in insulating systems and the external interference and disturbances is a critical element of pattern recognition for on-site measurements that are performed in a hazardous environment. The classification stage represents the recognition of the source from the data. Usually, this stage is twofold and consists of extracting a feature vector from the data followed by recognizing a corresponding source. The identification and classification of partial discharge features is a fundamental requirement for an effective insulation diagnosis. Partial discharges can affect an insulation system's reliability in different ways. Some electrical insulation systems (for example, mica-based system) are designed to withstand a certain level of partial discharges occurring in the distributed micro-voids, while others (e.g., polymer-based systems) are degraded very quickly by partial discharge activity.
As PD recognition is very complicated, only experts with extensive personal experience are able to discriminate between various discharge phenomena and assess the risk of potential insulation breakdown. Initially, experts mainly relied on the apparent charge magnitude; later, they also took the pulse density, phase and amplitude distributions into account after the introduction of acquisition systems. In the early 1990s, the introduction of PD phase-resolved acquisition and 3D-representation as well as ultra-wide-band detection and pulse registration created new possibilities and tools for automatic pattern recognition and expert systems. A phase-resolved PD pattern is treated as an image; therefore, image-processing algorithms are used for extracting and distinguishing the features of an image.
Today's PD expert systems are often based on statistical parameters and the stochastic nature of PD processes. The most important attributes of partial discharges include their amplitude, rise time, phase position (with respect to alternating voltage), occurrence rate and time interval to the preceding and successive PD pulses. It is known that a strong correlation exists between PD patterns and the sources causing the discharges [25,[42][43][44]. For pattern recognition, various techniques have been employed (e.g., statistical tools, neural networks, fuzzy logic, time-series analysis, principal component analysis, wavelets, etc.), which are supported in the preprocessing phase by advanced signal and image processing techniques. A PD pattern analysis can be performed on PD pulse shape, PD distributions (pulse-height, pulse-phase, pulse-energy, etc.) or PD-derived images. Expert systems based on statistical operators are very sensitive to the harmonics in high voltages; thus, special attention should be paid to the interpretation and classification stage [45].
The discussed PD pattern-recognition techniques are applied to almost all HV objects, such as transformers, generators, motors, bushings, components, GIS (Gas Insulated Switchgear), cables and joints. Many of the algorithms described above rely on a solid reference library containing either generic defect patterns or a set of attributes. As the amount of data is huge (especially in the monitoring applications), there is a need for data compression and a reduction of identification time (mainly in real-time analyzers). The time needed for identification can be significantly reduced by the proper selection of distinguishing parameters for the features. The most challenging aspects of today's partial discharge pattern recognition applications are related to multi-source PD classification, the separation between real internal PDs and both noise and disturbances. The purpose of applying neural networks for the diagnostics of high-voltage insulating systems is to recognize PD images and their qualifications to the appropriate group of images that are characteristic of the typical discharge forms or to trace the dynamic changes in PD images in the case of monitoring. For image recognition, back-propagation networks and self-organizing Kohonen's maps have often been applied in recent years; currently, convolutional neural networks are particularly promising [12][13][14][21][22][23][24].
Partial discharge images have also introduced a new category in PD evaluation, referring to qualitative analysis and defect discrimination. A kind of system that requires no calibration in absolute units (pC, mV, mA, etc.) and in which qualitative discrimination could be performed by the analysis of the shapes of statistically accumulated images would be very desirable, especially in on-site or monitoring measurements. This direction has been a visible trend in PD expert systems over the last few decades.

Architecture of Deep Convolutional Neural Networks
Artificial neural networks (ANN) have been a constant focus of research since the beginning of the 1990s, evolving from simple multi-layer perceptrons (MLP) to advanced deep topologies today. One of the key accelerators for this was certainly the development of computational power, both based on the CPU (central processing unit) and GPU (graphics processing unit), as well as the rapid development of algorithms, architectures and programming environments such as Tensor Flow by Google [46]. This approach has disrupted several industries and businesses due to their unprecedented capabilities, versatility and speed of implementation. One of the most advanced directions in AI currently is the deep learning architecture based on convolutional neural networks (CNN). The CNN topology consists of convolutional layers in which the output of each neuron is a function of usually only a smaller subset of the previous layer's neurons, in contrast to the MLP structure, where each layer's neurons connect to all of the neurons in the next layer (fully connected layers); i.e., each neuron's output is a transformation of the previous layer that is introduced with an activation function. In their basic structure, neural networks consist of neurons with learnable weights and biases. The convolution operation transforms an input image (matrix I) and a typically much smaller weight matrix (called a kernel or filter M) to an output matrix O according to the following formula: where L is the number of filters in the previous layer, l is the filter index and M and N refer to the size of the filter. In this way, a convolutional layer is spatially focused and lighter than a fully connected one, allowing the convolutional representations to learn much more quickly. Typically, various kernels are applied in order to preserve the complexity of an image, give depictions of different parts of the image and generate a collection of feature maps (FM). A graphical illustration of feature map layer creation is shown in Figure 2. The size and number of filters of each convolution layer should be predefined during the network architecture phase. The filter M k coefficients are unknown elements of a CNN, which will be specified by a backpropagation method during the training process. Backpropagation is a technique used for the evaluation of connections between neurons. When it receives an input, each neuron can completely and independently calculate the output value and the local gradient of the input (taking into account the output value). This phase is called a forward pass. When this phase is over, the backpropagation training phase starts. During the backpropagation, each neuron learns the gradient of its output value considering the complexity of the entire network. The neuron takes the gradient of the whole network topology and multiples it with each gradient with which it is connected. The common filter sizes used in CNNs are 3 or 5, creating a 3 × 3 or 5 × 5 mask of pixels, respectively. In the case of a full-color image (e.g., RGB), the dimensions of this filter are 3 × 3 × 3 ( Figure 3). The filter is shifted across the image according to a parameter called "stride"; this defines the number of pixels by which the filter will be moved after each iteration. A conventional stride value for a convolutional neural network is 2. The basic assumption in CNN networks (especially in image processing) is that each neuron is strongly affected by its neighbors and that distant neurons have only a small impact. This reflects the property of an image where the spatial correlation between pixels usually decreases as the pixels become more distant from each other. The convolutional neural network topology consists of main four elements: Convolution; Activation; Pooling; Classification by fully connected layers.
In the first step, the feature kernel and an image are lined up, and the multiplication of each image pixel by the corresponding feature pixel is performed. Then, the total sum of these values is divided by the total number of pixels providing output strength. This signal is passed through an activation function in the following step. In most cases, the rectified linear unit (ReLU) transform function is applied [1,29,32]: This activates a node in a network if the input is above a specified threshold. In case the input is below zero, the output is zero; however, when the input rises above a certain threshold, it has a linear relationship with the input variable. In this way, all of the negative values from the convolution are removed and changed to zero, while all the positive values remain unchanged. The next step is called pooling, in which the downsampling (compression) and smoothing of the feature map (FM) are performed. This process is usually undertaken by taking the averages or the maximum of a sample of the signal and helps to prevent overfitting, which is where the network learns facets of the training cases too well and fails to generalize new data. MaxPooling is the most commonly applied approach, which obtains the maximum value of the pixels within an individual filter. Thus, in the case of a 2 × 2 pooling filter, a quarter of the information is preserved, and three-quarters are dropped. A similar function is executed by the dropout operation, which is performed by multiplying the weight matrix of neurons W with a mask vector D. A visualization of the dropout operation is shown in Figure 4. Depending of the position of 1 and 0 in a mask vector D, certain neurons are discarded in matrix W (i and j refer to the position of a neuron in a layer; k is the layer number), providing a reduced topology W D (Figure 4b), which results in a shorter training time for each epoch. Another tuning operation is batch normalization. When the input data to the network are normalized, the hidden layers in the CNN can also be normalized to speed up the learning. Thus, batch normalization normalizes the output of a previous activation layer by subtracting the mean of a batch and dividing it by the standard deviation of a batch. The final classification step is based on fully connected layers (FCs), meaning that the neurons of the preceding layers are connected to every neuron in the subsequent layers (similar to an MLP neural network). As a preparation, a prior flattening operation is required to present the data in the form of a vector. The primary purpose of the ANN is to analyze the cluster of input features and fuse them into different attributes that will be used in the classification. Thus, the layers form collections of neurons that represent different features of an image. When enough of these neurons are activated in response to an unknown input image, the image will be classified properly. The general architecture of a convolutional neural network is shown in Figure 5.

Experimental Results
The application of convolutional neural networks to a sequence of partial discharge images is presented here. As one of the key indicators of high-voltage insulation deterioration, partial discharges are often used in monitoring systems. The test specimen was aged under high electric stress, and the measurement results were saved continuously within a predefined time period. The sequence of the phase-resolved PD images taken from the long-term aging experiment was analyzed. The four distinguishable classes of the electrical insulation degradation process were defined, mimicking the changes that occurred within the electrical insulation in the specimens (i.e., start, middle, end and noise/disturbance), with the goal of properly recognizing these stages in the untrained image samples. Representative PD images of the distinctive classes in the long-term monitoring of the electrical insulation aging are shown in Figure 6. The presented results were developed in the Python environment with the TensorFlow, Keras, and Scikit-learn deep-learning frameworks [46,47]. The machine-learning algorithms implemented in these environments expect the data to be represented and stored in a two-dimensional array in a particular format ([n_samples, n_features]), where a sample can be a PD image and a feature is a distinct label of the class. The PD image has three spatial dimensions; i.e., width, height and a third dimension (the depth corresponding to the number of channels of the image). The input set consisted of 250 training images of each class and 25 test images per class. An exemplary set of both the training and test images is shown in Figure 7. In both cases, the four distinguishable classes (start, middle, end and noise/disturbance) of the electrical insulation degradation stages are presented. The exemplary convolutional neural network architecture used in the experiments is presented in Figure ??. As mentioned above, designing, configuring and testing neural networks is a complex task with a huge number of hyperparameters and many degrees of freedom. In the presented example, the structure as well as the hyperparameters were chosen by trial and error.
The CNN consists of up to six convolutional Conv2D and MaxPooling layers, followed by up to five fully connected (FC) hidden layers, each having from 128 to 1024 sigmoidal nodes. The output layer has a number of outputs that is equal to the number of recognized classes of PD images.

Discussion
The exemplary parameters of the test CNN network architecture applied to PD images are illustrated in Figure 8. The individual blocks represent consecutive layers in the network structure, indicating the input and output dimensions. The first part refers to the Conv2D convolutional feature maps (FM, including the MaxPooling and DropOut stages), while the fully connected (FC) dense network after the flattening operation is depicted in the second part.
In the preprocessing phase, PD images were resized to a format of 128 × 128 × 3 and individually labeled. The feature mask kernel was set to 5 × 5 pixels (also to a dimension of 3 × 3 in some test cases), and up to six convolutional layers were implemented. The number of convolution kernels M was changed between 32 and 128. The dense fully connected layers were defined with 512 or 1024 neurons and the activation function ReLu, followed by an output layer with the Softmax activation function and a number of neurons corresponding to the number of classes to be recognized. The model was trained with various batch sizes (e.g., 32-256) and numbers of epochs , and the data were split into training and validation sets (20-30%). A dropout operation of 0.2 to 0.5 after the convolution and future map layers was tested.
According to the general rules, adding more feature map layers is recommended if the existing network is not able to recall the image attributes properly, whereas extra dense layers are needed in the case of network abstraction expansion. On the other hand, network expansion in terms of a feature map and hidden layers might lead to overfitting, in contrast to an overly small model that would be underfit. There are also various strategies regarding the selection of the size of the convolution layers; i.e., either they should be kept at the same size or the size should be increased as they go deeper in subsequent stages. The whole data package of PD images can be divided into three categories: a training set, a validation set and a test set. The first set is used to fit the parameters, such as by adjusting the weights. The role of the validation set is to tune the parameters, including the matching of the architecture. The test set is used to assess the performance in terms of generalization and the power of prediction.
The recognition score presented below is composed of the form of an array, in which the rows correspond to the distinguishable classes and the columns to the number of the test set. The contents of the score array show the numeric recognition probability along with a color palette. The exemplary trial of PD monitoring stage recognition is shown in Figure 9. The performance was assessed by the accuracy and loss metric parameters. A training accuracy of 97% and a validation accuracy of 92% were achieved in this case. The accuracy adjustment after 30 epochs is shown in Figure 9a and that after 200 epochs is shown in Figure 9b, where the saturation effect is visible. Comparisons of the partial discharge recognition score after 10 epochs and after 50 epochs are shown in Figures 9c and 9d, respectively. The visualization of the output filter banks provides an illustration of the automatic feature extraction (as shown in Figure 10). A comparison of the filters in the first layer ( Figure 10a) and third layer (Figure 10b) shows that when going deeper in the network structure in terms of the convolution number, the feature map changes from fine (focused on details) to coarser (revealing the general aspects and the providing compression).
Another assessment criterion of the CNN is a confusion matrix (also called an error matrix), which is a technique used to summarize the performance of a classification algorithm. Each column of the matrix refers to the instances in a predicted class, while each row expresses the instances in an actual (true) class. Thus, this is a way of tabulating the number of misclassifications; i.e., the number of predicted classes that ended up in an incorrect classification slot based on the true classes. By definition, a confusion matrix size is equal to the number of observations or classes. An exemplary confusion matrix for a set of 400 validation PD images subdivided into four classes (start, middle, end and noise) representing the insulation stages of electrical insulating aging monitoring is shown in Figure 11. Another interpretation of the network performance may be seen in terms of the following terminology: One general metric for evaluating classification models is called accuracy A; i.e., the fraction of predictions that the model got correct. Further, the following two matrices are used: Precision P-correct proportion of positive identifications: Recall R-correct proportion of actual positives: These define the overall measure (called F1 in statistical analysis) of a model's accuracy and combine precision and recall: The measure F1 for a multiclass case with a number P of classes will be equal to the following [48]: A comparison of the multilabel classification performance of the CNN is shown in Table 1. The results are presented for an exemplary PD image belonging to the "end" class from the test data set in the form of the accuracy A of the proper assignment to the actual class "end". The result for the training and validation accuracy precision shows that a tradeoff should be made between the model complexity, the tuning of the hyperparameters and time constraints.
Different 2D convolution topologies were tested (usually with 64-128 filter channels); each stage was followed by the MaxPooling layer. Applying a lower number of filters (16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30)(31)(32) resulted in the rapid downgrading of the accuracy by 30%. Two types of kernel sizes were compared: 3 × 3 and 5 × 5 (input image size of 128 × 128 pixels; stride was equal to 2). It was noticed that mixing kernel sizes within consecutive convolution stages (e.g., 5 × 5, followed by 3 × 3) decreased the recognition performance. Up to four fully connected FC layers were tested (1024-1024-512-4). It was observed also that increasing the number of layers has certain limitations with respect to the accuracy and is done at the cost of network complexity and computational time. At a certain point, the network will reach an overfit, but this is a very complex topic since there is also an interplay between the number of neurons in each layer and the number of layers and other hyperparameters. This effect is highlighted in Table 1, rows 1 and 5, where the results show that topology 1024-1024-512-4 has an accuracy of 99.14%, while the reduced one, 1024-512-4, shows an oaccuracy of 99.21% for the same kernels. A tradeoff was observed between the numbers of convolutional layers and fully connected layers. The applied DropOut operation did not result in an improvement in accuracy (rather in an increased speed of calculation). The results presented above reflect the exemplary CNN's performance and its resilience to manipulations of the network architecture and values of the hyperparameters.

Conclusions
This paper reports the application of a convolutional neural network to partial discharge images with the aim of recognizing the stages of aging of high-voltage electrical insulation. The presented example refers to the monitoring of electrical insulation deterioration. The PD images represented the phase-resolved patterns. The performance of the applied architecture was tested by manipulating the number of feature maps, the size of convolutional layers and kernels as well as the values of hyperparameters. The assessment was based on the recognition score, confusion matrix and accuracy metric. A tradeoff between these parameters was demonstrated.
PD images represent a new category of diagnostic evaluation, referring to qualitative analysis and defect discrimination. A system that requires no calibration in absolute units and in which qualitative discrimination could be performed by the analysis of the shapes of statistically accumulated images would be very desirable, especially in on-site diagnostics or monitoring measurements. This research direction is a currently visible trend in future autonomous PD expert systems. The most challenging aspects of today's partial discharge image recognition are related to multi-source PD classification, the separation between real internal PDs and both noise and disturbances. Thus, future work will focus on adjusting the CNN architecture and hyperparameters for multi-source PD recognition for diagnostic applications.