Deep Learning in High Voltage Engineering: A Literature Review

: Condition monitoring of high voltage apparatus is of much importance for the maintenance of electric power systems. Whether it is detecting faults or partial discharges that take place in high voltage equipment, or detecting contamination and degradation of outdoor insulators, deep learning which is a branch of machine learning has been extensively investigated. Instead of using hand-crafted manual features as an input for the traditional machine learning algorithms, deep learning algorithms use raw data as the input where the feature extraction stage is integrated in the learning stage, resulting in a more automated process. This is the main advantage of using deep learning instead of traditional machine learning techniques. This paper presents a review of the recent literature on the application of deep learning techniques in monitoring high voltage apparatus such as GIS, transformers, cables, rotating machines, and outdoor insulators.


Introduction
Diagnosis of electrical insulation degradation is essential for monitoring the integrity of an electric power system. A well-known diagnostic method, which has been employed for a number of decades, is the measurement of localized discharges known as partial discharge (PD) [1]. Detecting fault or PD in electric apparatus, such as transformers, rotating machines, cables, gas insulated switchgear (GIS) and outdoor insulators, has always required the knowledge of expertise who are able to characterise and differentiate the different sources of fault, PD, defect, or degradation. Throughout the years, different parameters had to be extracted manually from recorded patterns or signals. The aim has been to use the manually-extracted parameters in order to implement a classifier that would be able to perform the task of differentiation and characterization of fault, PD, defect, or degradation. Though the process is partially automated, the fact that experts have to select the features presented a problem since different features might result in different outcomes. This influences the performance of the classifier due to its dependence on the manually-selected features.
Deep learning allows the feature selection stage to be integrated with the learning process, thus making the process all automated. In high voltage (HV) applications, the aim has mostly been to classify or localize faults, defects, or PD that occur in HV apparatus or determine the degradation of insulating material. The abundance of computational capabilities and the existence of big data has allowed researchers in different fields to take advantage of deep learning algorithms. Other than the main purpose of classifying and localizing the PD or fault in HV apparatus, a deep learning algorithm, namely the Generative Adversarial Network (GAN), allows researchers to generate more input data from a limited amount of experimental/simulation results (e.g., see [2]).
Classification refers to the process of differentiating between different sources of fault, defect, or PD or levels of degradation. Given that in real life scenarios, fault or PD can happen due to various sources, it is necessary to identify the source. When the source is identified, one can investigate techniques to eliminate that source from the high voltage system. Different sources or causes of fault, defect, PD, or degradation exhibit different characteristics that are unique to each source, making their classification (differentiation) possibly feasible. On the other hand, localization refers to the process of identifying the position of the fault or PD taking place in high voltage apparatus [3].
Extensive research has been done on the use of traditional machine learning techniques in high voltage applications, e.g., [4][5][6][7]. This paper only considers the literature that employs deep learning techniques. Furthermore, the focus of this review paper is solely on the application of deep learning in high voltage engineering and not on the deep learning algorithms themselves. Figure 1 gives an overview of the flow of this review paper.  A summary of the papers on classification of PD using deep learning is shown in Table 1. Further to PD classification, deep learning has also been used for other applications in transformers and outdoor insulators that are also reviewed in this paper.
The organization of the paper is as follows: In Section 2, four deep learning techniques (namely convolutional neural networks, recurrent neural networks, autoencoders, and generative adversarial networks) are briefly reviewed. These are the commonly-used techniques in the HV application. In Sections 3-8, a review of papers on the application of deep learning in HV apparatus including GIS, transmission line networks, rotating machines, cables and solid insulators, transformers, and outdoor insulators is presented. Finally, the concluding remarks are presented in the last section of the paper, Section 9. Table 1. Summary of deep learning algorithms used for classification of PD in various HV applications: specification of characteristics of collected data used and whether multiple-labeled sources of PDs are mentioned.  [20,21] Transformer PRPD Lab Yes * ResNet [22] Cables PRPD Lab No Transfer Learning on CNN [23] Rotating Machines PRPD Field Yes ANNs incorporated in a hierarchical fashion [24] Cables T-S waveforms Simulation No CNN [25] Cables T-S waveforms Lab No CNN, DBN [26,27] Power Cables T-S waveforms Field No ensemble of deep learning algorithms (CNN, convolutional RNN, LSTM and bidirectional LSTM) [28] Hydrogenerators T-S waveforms Field No Variational Autoencoder [29] GIS GIS T-S waveforms Simulation No CNN-LSTM [37] Power Lines T-S waveforms Field No 1D-CNN with Global Average Pooling layer, Dual Cycle-consistency network [38,39] Conductors T-S waveforms Field No time-series decomposition and LSTM, CNN-LSTM with attention layer [40,41] Insulators T-S waveforms Lab No CNN with Bayesian optimization for hyper-parameters tuning [42]

Deep Learning
Deep learning is a branch of machine learning that enables data-driven learning of feature representations for input data originating in diverse application domains [43][44][45][46]. Unlike traditional machine learning algorithms, where features need to be extracted explicitly through pre-defined hand-crafted rules, deep learning has the advantage of using raw data, and learn to extract features depending on the task [47]. This is appreciated especially in complex systems where such features are not necessarily known for a given dataset. As a result, deep neural networks subsume the feature extraction step within the learning phase, thereby computing intrinsic representations of the raw input data in an automatic manner.
Similar to traditional machine learning, deep learning also has the following three key paradigms: supervised, unsupervised, and reinforcement learning.
For the supervised setting, a labeled dataset is required. The type of output can either be continuous (used in a regression problem) or discrete/categorical (used for classification). For unsupervised systems, data with no labels are given, and the objective is either to cluster the data according to their intrinsic characteristics or learn representations which can be later used for downstream supervised or unsupervised settings [48][49][50]. In scenarios involving agent based learning, exhaustive collection of supervised data is often prohibitively difficult. In such situations, reinforcement learning is a powerful paradigm which allows data collection through interaction with the environment [51]. The agent's goal is to learn policies based on the environment in order to maximize long-term expected rewards [51]. In recent years, research in deep reinforcement learning has gained significant traction wherein the agent policies are learnt through deep neural networks [52][53][54] (see Figure 3). Depending on the inputs and the desired outputs for most of the high voltage application, a handful of mainly supervised deep learning algorithms have been of interest in this area of research. In the next section of this review paper, a brief introduction on major supervised deep learning algorithms is presented.

Deep Learning
Unsupervised Reinforcement Data is clustered according to their intrinsic characteristics and unsupervised feature representation.
Decision based on experience and interaction with the environment.

Classification Regression
Labeled data is needed. Figure 3. Different deep learning branches: supervised, unsupervised, and reinforcement learning.

Convolutional Neural Networks
Convolutional neural networks (CNNs) represent a class of deep learning architectures which were originally designed for processing data represented in a grid-like topology, e.g., images [47]. A CNN has four main components: convolutional layer, activation function layer, pooling layer, and fully connected layers. Typically, the output of the convolutional layer is passed to an activation function layer where the output of the latter is passed to the pooling layer. In a deep network, this set of the three components are often cascaded multiple times thereby constituting multiple layers and making the network progressively deeper [43,55,56]. While the initial layers usually end up learning low-level features, the deeper layers tend to learn more complex features. The cascade of these layers constitutes the automatic feature extraction stage, and the fully connected layers constitute the classification stage [57]. More details on each of these components are presented below: Convolutional layer: This layer consists of a bank of learnable linear 1D, 2D, or 3D filters, which are also called kernels [58]. In the high voltage applications, usually 1D and 2D CNNs are used. The 1D-CNN, for example, is used with time-series waveforms, whereas in problems involving phase resolved partial discharge (PRPD) patterns or spectrograms, a 2D-CNN is used. Some of the researchers have employed a 2D-CNN for time series waveforms as well, where they considered an image of the signal as an input rather than the 1D data. These filters are convolved with the input data or the output from a previous layer. The output is a set of feature maps, where the number of feature maps is equal to the number of the filters.

Activation function layer:
The purpose of adding activation layers is to introduce nonlinearity in the input-to-output mapping being learned by the neural network. This is desired because complex data include nonlinear features that need to be detected. Most frequently employed activation functions include sigmoid, ReLU [59] and tanh [60].
Pooling layer: The aim of pooling layer is to subsample the output feature maps so that wider receptive fields can be spanned during convolution without increasing the size of the filter kernel. Another advantage of this layer is to provide positional invariance or shift-invariance to the network [61]. Commonly-used pooling operations are maximum pooling and average pooling.

Fully connected layers:
In a fully connected (FC) layer, every neuron in one layer is connected to every neuron in the next layer. FC layers are also referred to as dense layers in the literature [47]. In a CNN, the input to the first fully connected layer is the output of the last set of the first three components mentioned above, where the corresponding features maps are flattened into 1D vectors. For classification problems, the architecture is appended by FC layers and ends with a classification layer where the number of neurons is equal to the number of classes.
A typical CNN architecture is shown in Figure 4. The main advantage of CNNs compared to traditional neural networks is the weight sharing when training the learnable kernels, which reduces the learnable parameters in the network [62].

Recurrent Neural Networks
Recurrent Neural Network (RNN) is another family of deep learning architectures which are intended for the processing of sequential data [63,64]. A simple illustration of an RNN model is shown in Figure 5.
RNN process data from each time point in a sequential manner. However, the output is not just influenced by data at the current time, but also by the entire history of inputs that have been fed into the RNN previously. This is reflected by the cycles in the architecture, which are maintained in the hidden unit as a state vector including the history of the previous time points. RNN cells have one common set of weights, and when backpropagation runs, data from different time points contribute in updating the same set of weights.
.... .... Figure 5. RNN architecture with no output: the network has feedback connections which can be unfolded in time and trained using back-propagation. The input X is processed by incorporating it into the state S that is passed forward through time.

Unfold
The mathematical representation of the RNN is shown as where h t represents the updated state vector, h t−1 is the hidden state vector from the previous time step, x t represents the input vector at time t, and f w represents a given function corresponding to learnable weight vector w. The input can either be a vector or a sequence, and the output can either be a vector, sequence, or a value. For example, given a high voltage problem where classification of different PD pulses is required, the input is a vector of PD pulses and the output is a label corresponding to a PD source. The drawback of a typical RNN is the long-term dependency where the current state depends on all the previous states, which causes the vanishing gradient problem [47]. The vanishing gradient emerges from the fact that, as RNN processes more time steps, repeated multiplication of small weights causes the gradients to approach zeros. To overcome this problem, long short-term memory (LSTM) architecture is used [65]. The main difference in an LSTM architecture is that, instead of computing the hidden state directly from the previous one, LSTM computes additional states, and this structure allows alternative paths to gradients to flow during the backpropagation avoiding repeated matrix multiplications [66]. An LSTM cell has two hidden states c t corresponding to the cell state and h t corresponding to the hidden state which are calculated as where i is the input gate, f is the forget gate, o is the output gate, and g is the gate. The operator is element-wise multiplication operation. The input gate decides what new information will be stored in the cell state, the forget gate decides what information will be removed from the cell state, and the output gate decides what information from the cell state will be used in the output [67].

Autoencoder
Autoencoders (AE) were introduced in 1980s [68] in order to learn useful representations in an unsupervised fashion by the use of the input data on its own [69]. They were then reintroduced in 2006 with the booming of the deep learning architectures [70]. The idea behind an autoencoder is to train a neural network such that the model learns a latent intrinsic representation of the original input. An autoencoder consists of an encoderdecoder architecture, wherein the role of the encoder is to transform the input to a latent representation, while the decoder is responsible for transforming the latent representation back to the original data. The two parts (encoder and decoder) are learned jointly so as to minimize the reconstruction error between the decoder's output and the network input. A simple illustration of an autoencoder model is shown in Figure 6. Assuming that the input is x and the reconstructed output isx, the model is trained to minimize the reconstruction error L(x,x). The encoder and decoder can be fully connected layer networks or any deep learning architecture. The encoder is expressed as a function G such that where b i represents the latent feature representation (bottleneck) of a single observation sample x i . The decoder F accepts b i as input and producesx i is. This is shown iñ The goal is then to find F and G that would minimize arg min where the summation is over all the observations during training.

Generative Adversarial Networks
Generative models allow sampling data from the probability distribution of a given data. There is a long tradition of learning data distributions including methods for density estimation [71,72]. For high-dimensional continuous data, classical density estimation techniques become intractable. Generative Adversarial Networks (GANs) enable generative models for data stemming from unknown probability distributions [73]. The key idea is to learn a neural network to map a datapoint sampled from a simple distribution (such as normal) to data from the training data distribution. To assess if the generated data have modeled training distribution, another network, known as the discriminator, is trained to distinguish between the generated examples and examples from the original dataset. The goal of the GAN is to train the generator network such that the best discriminator performs as worse as possible. In other words, when the generator parameters have been updated in a way that it becomes difficult to train a classifier for distinguishing between generated and real samples, it implies that the generator is producing outputs that resemble examples from training data distribution. A simple illustration of a GAN model is shown in Figure 7.  . GAN architecture: discriminator network learns a classifier to distinguish between generated and real samples. Generator network updates its parameter so that the discriminator's task is as difficult as possible.

Gas Insulated Switchgear
Classification of PDs has been a standard procedure in the maintenance of high voltage assets. Various distinctive parameters extracted from PD measurements have been introduced for the PD classification application. Starting with current or voltage signals, time series waveforms have been proven to acquire unique behavior for each source of fault or PD [74]. Phase resolved partial discharge (PRPD) patterns have also been used to differentiate between different PD sources [75].
Gas insulated switchgear (GIS) is widely used in the industry [76]. A GIS platform is shown in Figure 8. GIS has its components close to each other which makes the fault occurrence in one component transfer to other components easily. The components of the GIS are power conducting components and the control system. The power conducting components are responsible for ensuring the flow of the electric current in the system, and the control systems work on monitoring the behavior of the conducting components. Thermal, mechanical, and electrical faults comprise the main faults that take place in a GIS platform. This section reviews the application of deep learning to solve the problem of classification of only PDs (electrical faults) in GIS. For the condition monitoring of GIS, different detectors have been used in literature including antennas, coupling capacitors, high frequency current transformer (HFCT), UV sensor, thermal sensor and Rogowski coils.
There are different sources of PD that take place in a GIS platform. Whether using PRPD patterns or time-series waveforms, we can divide the literature into the following two main groups.

PD Classification Using the PRPD Pattern
In 2017, the authors of [8] simulated four sources of PDs that take place in a GIS platform. The four sources of PDs are: protrusion, contamination, gap, and particle defects. For each of the sources of PDs, four severity states of the PD were collected: normal state, attention state, serious state, and dangerous state. The authors proposed a stacked sparse autoencoder (AE) model, where the output of the middle layer (bottleneck) of the preceding AE is the input to the next AE. The output of middle layer of the final AE is the input to a softmax layer which decided on the assigned severity level label for each sample. The effect of changing different hyperparameters, such as the number of stacked AE or number of nodes in the middle layer, were examined. The proposed model was compared with support vector machine (SVM), where nine statistical characteristics were extracted from the PRPD patterns.The study reports enhanced average classification accuracy of the PD severity compared to SVM.
The authors of [12] published in 2018 collected PRPD data from experimental setup and more than 30 live GIS substations. The five defects studied in this work are floating electrode discharge, surface discharge, corona discharge, insulation void discharge, and free metal particle discharge. A known CNN architecture, LeNet-5, was employed and compared with back propagation neural network (BPNN) and SVM, where statistical features were extracted for the latter two and raw data were fed to the CNN. The statistical features extracted from the PRPD patterns include skewness, steepness, asymmetry and cross correlation coefficient of the PD amplitude and rate in both positive and negative half cycles of the applied voltage. In order to optimize the weights of the CNN, the authors trained an autoencoder and used the weights as an initialization for the CNN training. The authors reported an improved average accuracy of the CNN compared to that of SVM and BPNN.
A long short-term memory (LSTM) recurrent neural network (RNN) has been used to classify PRPDs in a GIS in [9]. In this work, four different types of defects have been simulated in a controlled lab environment, where the PRPD patterns have been collected. The PD sources simulated are: protruding electrodes, floating electrodes, free particles, and void defects. In addition to that, the authors simulated noise by using an air purifier and the noise signals were obtained using the external UHF sensor. The authors compared the classification accuracy with SVM and a fully dense artificial neural network. The study reported that, although the proposed model takes more training time, the classification accuracy is superior to the other two machine learning models.
In 2019, the authors of [13] simulated PDs using a laboratory setup and collected PD data from a live substation. The four PD sources considered in this work that take place in a GIS platform are: floating electrode defects, metallic protrusion defects, insulation void discharge defects, and free metal particle discharge defects. A variational autoencoder (VAE) was trained to extract the eigenvalues corresponding to the PRPD data. The training set included a mix of both the laboratory and substation data. For the test dataset, a matching algorithm based on cosine distance was used in order to decide to what class the test PRPD belongs to. The proposed method was compared with statistical features, deep belief networks (DBN) [77] and CNNs. The authors reported that the eigenvalues extracted from the VAE feature vector have improved results over the other methods used.
Four sources of PD that usually occur in GIS (protruding electrodes, floating electrodes, void defects, and free particles) were simulated in [10]. The authors used a Siamese network where the raw input data are pairs of PRPDs. The motivation behind using Siamese network is that PDs usually result in small datasets. Two identical independent CNN models are trained, and the distance between the embedded features resulting from the two CNN models is calculated. As a result, a decision is made whether the pair belong to the same or a different class. The authors compared their proposed architecture with SVM and a CNN model. They reported that the proposed method performed better compared to the latter two.
Ref. [11] presents the simulation of artificial defects that take place in a GIS platform, where the PRPD patterns corresponding to each source of PD were recorded. The four PD sources in this work are: corona, floating electrode, particle, and void. The authors proposed a multi-head self attention LSTM based model for PD (LSANPD) and a self attention based neural network model for PD (SANPD). They compared the classification of the two models with their previous published work which used an LSTM-RNN based model. They reported that SANPD and LSANPD are better in terms of classification accuracy and that SANPD is better than the LSANPD and LSTM-RNN model in terms of complexity.

PD Classification Using Time-Series Waveform
Some researchers have used sensors, such as HFCT or UHF sensors, to record voltage or current waveforms induced by PD. The waveforms are in the time domain and referred to as time-series waveforms. This section reviews the papers that have used time-series waveforms instead of PRPD patterns to train and optimize a deep learning algorithm.
In 2018, five sources of PD were generated in a GIS tank model in a laboratory setup that includes a floating electrode, a metal protrusion on the conductor and the tank, surface contamination, and free metal particles [35]. Four planar spiral antennas were installed at different locations on the tank. For each signal collected, the authors calculated three different short time Fourier transform (STFT) by changing the window lengths. The different window lengths correspond to high time resolution, high frequency resolution and medium resolution. The proposed model was a CNN-LSTM based model. The three different STFTs calculated from each signal were used to train three different CNN models, where the three outputs of the CNN models are combined by a fully connected layer. The output of the fully connected layer is the input to the LSTM. Since there are four sensors, the model is comprised of four fully-connected layers which are the input to four separate LSTMs. The outputs of the LSTMs collectively decide on the label of each input sample.
The authors compared the model performance with other baseline models and with the case where a single window length was used for the STFT. The model showed improved results compared to the other models.
In 2019, the authors of [30] investigated GIS PD data which were collected from laboratory experiments and finite difference time domain simulations. A conditional variational autoencoder (CVAE) was used to generate more training data. A seven-layer CNN model was used for the classification of four different sources of PD, free metal particle, metal tip defects, floating electrode defects, and insulation void. The authors also reported a visualization of the feature maps from the first two convolutional layers. They compared their results with support vector machine (SVM), decision trees (DT), back propagation neural network (BPNN), and a few CNN architecture (LeNet5, AlexNet and VGG16). The proposed CNN model outperformed the above mentioned models.
Using a laboratory setup, four defects that take place in medium voltage switchgear were replicated [31]. The four sources of PDs included: cable termination floating earth, earth cable in contact with cable termination insulation, voltage presence indicating systems (VPIS) bushing screen disconnected, and earth grounding spring missing on bus bar connector. The spectrogram of the PD signals collected using a coupling capacitor is generated by applying the continuous wavelet transform (CWT). In addition, spectrograms from noises and other HF signals are generated. The authors proposed a convolutional autoencoder (CAE) that is able to reconstruct the spectrograms of the different sources of PDs and noise. After the CAE is trained, the decoder part of the autoencoder is removed and substituted by a fully connected layer followed by the classification layer. This model is trained using a labeled dataset, where the model is able to output the percentage of belonging of a tested spectrogram to each of the four PD sources and the noise/HF signals classes. The study reported high performance ability of the proposed model.
In another work [32], four sources of PDs in a gas-insulated switchgear were simulated in a lab setup. An existing CNN architecture, AlexNet, was the method used in this work where the inputs are the time series waveforms treated as images. The results of the proposed method are compared with the fractal method and mean discharge method. The time series waveforms were transformed into a PRPD plot for the sake of applying the fractal method and mean discharge method. Both the fractal and mean discharge methods provide features which are considered as input to two fully connected neural networks. The study included reporting the average classification accuracy of the three models with different percentage of noise added to the signals. The proposed method showed improved results compared with the other two methods especially with a high noise percentage. In addition, the author reported that the time consumed for PD classification was the least using the proposed CNN-based method.
In 2021, four types of PDs that take place in a GIS platform (free metal particle defects, metal tip defects, floating electrode defects and insulation void defects) were experimentally simulated and time-series waveforms corresponding to each type of PD were recorded using UHF sensors (butterfly antennas) [33]. The variability in the dataset was introduced by randomly changing the position of the defect. The deep learning model proposed by the authors was based on a depth-wise CNN model where the convolution is divided into two parts: the first part is composed of convolving one channel at a time with the convolution kernel (i.e., depth-wise convolution) and the second part is to mix the feature map using a 1 × 1 convolution kernel (i.e., point-wise convolution). A generative adversarial network (GAN) was also used in order to generate more data. The proposed model was compared with other CNN-based models such as MobileNetV1, MobileNetV2, Xception, ResNet, and LeNet models. The proposed model reported enhanced classification accuracy compared to the other models. Visualization of the feature maps of some layers is presented as well, which highlight what each layer was capable of learning.
In the same year, a research group simulated four different sources of PDs in a GIS controlled lab environment [34]. Varying the defect location for each of the artificial defects ensured the variability in the collected dataset. The authors proposed a 1D-CNN model where a multiple scale convolution kernel is used instead of a single scale convolution kernel. Channel shuffling was used on the outputs of the two feature maps produced by the multiple scale convolutional kernel in order to have a unified feature map. Since having labelled data is time consuming and needs expertise knowledge, the authors proposed domain adversarial transfer strategy (DATS), which is inspired by the GAN. Four different unbalanced datasets were acquired from an actual GIS in order to test the performance of the proposed model. The 1D-CNN that was trained on the experimental data are used to classify the on-site GIS PD data where some data had no labels. For the proposed 1D-CNN model, the authors compared the results of the proposed model with traditional 1D-and 2d-CNN models. With regard to the performance of DATS, the authors compared the results with other transfer learning (TL) techniques such as fine-tuning TL and domain adaptation TL. The study reports enhanced results using the proposed framework. The authors suggested that future work will focus on the automatic optimization of hyperparameters and on trying the platform in an online monitoring system.
In [36], the aim of the research work was to investigate transfer learning, especially domain adaptive deep transfer learning (DADTL) CNN, for GIS PD diagnosis. The authors used four different datasets for the training of their model. The datasets consisted of measured and simulated data. Dataset A included field data from three types of fault (rolling element, inner ring, and outer ring). Dataset B corresponded to the GIS PD simulation data using the finite difference time domain (FDTD) technique, where four sources of PDs (metal particle, tip, floating electrode, and insulator air gap defect) were simulated. Dataset C corresponded to a 252 kV GIS experimental platform, where signals were captured corresponding to the four defects mentioned in Dataset B. Finally, dataset D corresponded to the PD samples collected from a provincial power company's GIS failure.
By the use of maximum mean discrepancy to minimize the sliced Wasserstein distance (SWD), the authors aimed to ensure that transferable features have minimal discrepancy. The proposed model starts with data pre-processing, where samples from the larger datasets are classified as source domain samples, and samples from smaller dataset are classified as target domain samples. The aim is to minimize the differences between the features learned from both source and target datasets. The authors used residuals units in the CNN based architecture. The authors compared the results with traditional CNNs (LeNet and AlexNet) with the same number of layers. The proposed model reported improved results compared to the other deep learning models especially when the dataset is small.
Another work which used simulated data are presented in [37], where a CNN-LSTM network is proposed for the classification of PD sources in a GIS system. The model consists of two blocks of convolutional layers followed by pooling layers. The output of the second block is fed to an LSTM layer which is followed by a fully connected layer and ending with the classification layer. In order to generate the dataset, simulation software XFDTD is used. The four sources of PDs are metal tip defect, insulator air gap defect, floating electrode defect, and free metal particle defect. The authors reported the precision, recall, and F1-score of the four sources of PDs. They compared the performance of the proposed model with other models like SVM, LSTM and CNN. The proposed model reported high average classification accuracy compared to the other models.
Research has been focused on using different DL techniques for the purpose of identifying PD sources in a GIS platform varying from autoencoders, CNN, LSTM, or a combination of the above techniques. Authors have compared their proposed models with other DL models or traditional machine learning models. Despite the advantages of using different DL techniques, future work should focus on the quality of the input data with regard to the interference, noise, and the fact that multiple PDs can take place at the same time. The integration of the developed models in real-life systems would present its own challenges especially when it comes to developing industrial standards or regulations, and implementing condition-based maintenance asset management policies depending on the severity of the situation.

Transmission Line Networks
Transmission line networks are used to enable the long distance transmission of power. A few research groups have developed deep learning models to identify PD from non-PD signals collected from a publicly-available dataset. ENET Centre in Czech Republic developed a meter to measure the voltage signal induced by the stray electric field along covered conductors that contained PD or fault signals. The dataset contains noisy real world measurements from high-frequency voltage sensors, where the objective is to identify damaged three-phase, medium-voltage overhead power lines [78].
In 2020, the authors of [39] performed pre-processing of the raw data in order to remove noise and low-frequency components of the signals. The output of this process was the time and frequency representation of the signal by applying short-term Fourier transform. The time and frequency domain positive and negative half cycle signals are the input to the proposed deep learning algorithm. The proposed model is a Dual Cycle-Consistency network. Both time and frequency domain branches consist of three blocks. Each block contains a 2D convolutional layer, a Rectified Linear Unit (ReLU), and a batch normalization layer. The output from block-3 of the time-domain and frequency-domain branches is passed through a global average pooling layer, a shared fully connected layer, and a sigmoid layer. In order to calculate the cycle-consistency loss, the outputs are fed to the dual-domain attention module block (DDAM) for joint learning. The prediction is then based on the weighted average of the output from the fully connected layers and the output from the DDAM block. The results are compared with other models such as Random Forest, Resnet18 + VggNet11, and LSTM. The performance metric used is the Matthews Correlation Coefficient (MCC), in addition to precision, recall, and F1-score. The authors reported better results compared to the other approaches.
The authors of [40] developed a model based on time-series decomposition and LSTM for to classify PD from non-PD signals from the same public dataset. Seasonal-Trend decomposition using Loess (STL) was used to decompose each raw signal into three parts: trend, seasonal, and residual. PD is mostly reflected in the residual part. Four different STL modules with different seasonal window lengths were used to generate four different residual components. Feature engineering was then applied on the residual parts where a sequential feature vector is extracted. As a result, many-to-one sequential data are generated and are considered as the input to the long short-term memory network (LSTM) classifier. The proposed model was compared with other classifiers such as fully connected layers, SVM, XGBoost, and Multivariate Logistic Regression (MLR). The proposed model showed enhanced classification accuracy compared to the other models.
In 2021, the authors of [38] aimed to classify PD versus no-PD signals using the same publicly-available dataset of damaged power lines as the previous paragraph. The proposed model was a traditional 1D CNN model where a Global Average Pooling (GAP) layer is employed before the fully connected layer. Each sample in the dataset is compromised of voltage of the three phases over one period. For each phase, a highpass filter is used to remove the power frequency, after which a maximum filter is used to extract a set of pulses. Each set of pulses from each phase is the input to the trained 1D CNN. Finally, the decision on the label of the power line is based on the three outputs of each phase. In order to visualize what the model is looking at, in order to decide on the label, a pulse activation map (PAM) was used. The evaluation metrics used are Matthews Correlation Coefficient, precision, recall, and accuracy. The authors compared their results with other publicly reported results where models such as LSTM were used. The proposed model showed enhanced results, and the authors suggested that a larger dataset will be more compatible for hyperparameter tuning.
In [41], the authors aimed to classify PD versus no PD using the same dataset as the previous papers, which included the three-phase voltage signals. FFT noise reduction algorithms were used on the raw data. The proposed model was a CNN-LSTM model with an attention layer before the classification layer. Starting with two blocks of convolutional and max-pooling layers, the output is fed to a fully connected layer which is considered as the input to the LSTM layer. The output of the LSTM is the input to an attention layer, where multiplication of the feature vector obtained from the LSTM is done with learnable weight coefficients. The output of the attention layer is fed to to a sigmoid which decides on PD versus no PD label. The performance metrics used are precision, recall, and F1score. The proposed model is compared with other traditional models such as SVM, CNN, and bidirectional LSTM. The study reported higher average accuracy compared to the other models.
The papers discussed in this section used the same publicly-available dataset, where different DL approaches were adopted to detect PD pulses from non-PD. This serves as proof that different DL algorithms can give good results; however, the deployment of such algorithms in real-life systems would definitely give a better perspective of what models to use. This section was also a good example to show how various DL techniques can be evaluated using a common dataset.

Rotating Machines
In rotating machines, voltages are generated due to time-varying magnetic fields, which is the result of the the change in the flux [79]. The change in the flux results from the mechanical motion of the rotating machine. Rotating machines consist of stator and rotor structures which are made of thin laminations of electrical steel, insulated from each other in order to reduce losses and prevent discharges and faults to take place. Various types of stress, such as thermal, electrical, ambient, or mechanical stress, can affect the insulation system of rotating machines. Statistical data show PD activities have preceded a large number of stator failures [80] and, as such, PD detection in rotating machines has been attracting attention. Extensive research has been done on using traditional machine learning techniques, such as Naïve Bayes-, SVM-, and kNN-based techniques [81], for rotating machine electrical insulation diagnosis. This section mentions the work done using deep learning for the rotating machine PD diagnosis.
The PPRDs of a number of sources of PD found in rotating machines through online PD measurements in hydro-generators operating in real-world conditions have been collected in [24]. The PD sources are: internal void, internal delamination, delamination between conductors and insulation, slot, corona, surface tracking, and gap discharges. The authors proposed a methodology to de-noise the PRPDs first and use an image pre-processing technique to separate different clouds in the PRPD patterns. The output of this stage was denoised sub-PRPDs which represent different sources of PD. Three features which are extracted for each sub-PRPD are the input to four different artificial neural networks (ANN). These different ANNs were incorporated in a hierarchical fashion in order to perform the final classification. The authors reported a good overall classification accuracy for all the PD sources.
A framework was proposed using visual data analysis for PD source classification in hydrogenerators with a minimum of labeled data [29]. A convolutional encoder was used to project the PD signals acquired from the generator stators to a 2D-visualization latent space. This serves as a visual aid for the expert to analyze the distribution of the training dataset. After being labeled by the experts, the labeled data are trained by a neural network classifier. Other unlabeled data are tested using the already trained classifier, and if any conflict area appeared on the 2D latent space, the human experts will have to label by conflict area sample data. The new labeled data are then added to the dataset, and this procedure is done in an iterative manner until the area of conflicted data is minimized. This study reported a base that integrates both expert knowledge and the advantages of deep learning in order to have a correctly-labeled dataset of PD sources.
Although performing preprocessing of the input data is crucial to help the models learn the intrinsic characteristics of patterns (i.e., denoising in HV applications), future work should focus on applying different denoising techniques. Investigating the effect of different denoising techniques on the effectiveness of the DL models yields better understanding of the learning process. Moreover, it is observed that more research is being directed towards exploring the problem of unlabeled data, which is a crucial step for the deployment of any algorithm in a real-life diagnosis system.

Cables and Solid Insulation
Electric power can be transmitted by underground cables or by overhead transmission lines. The main advantage of underground cables compared to overhead lines is the low maintenance cost. This is linked to the fact that overhead lines are exposed to environmental factors such as storms or lightning. An underground cable consists of one or more conductors which are covered with suitable insulation and the external component is the protecting cover [82]. The major disadvantage of using underground cables though is the problem of degradation and failure of the insulation under high voltage stress. Hence, detecting PDs/faults is crucial for assessing the health of the system. This section reviews the application of deep learning to solve the problem of classification of faults/PDs in cable insulation and solid dielectric using PRPD patterns or time-series waveforms as the input to train the classification model.

PD/Fault Classification Using the PRPD Pattern
Different sources of PD have been classified in a solid insulation in addition to the prediction of the aging stage of the insulation in [14]. The authors of this paper classified three different sources of PD (corona, surface, and internal discharges) which are simulated in a lab environment. They compared the deep belief network (DBN) output with three other machine learning approaches. The input to the DBN was the raw PRPDs, whereas the inputs to the other approaches were features extracted using statistical and vector-normbased operators. Classification accuracy is the performance metric used in this study. The authors show that the DBN learns distinguishable features without any pre-processing of the PRPDs.
The performance of a CNN model was evaluated on the prediction of the ageing stage of high voltage insulation material using PRPD data [15]. Three classes of start, middle, and end as well as noise/disturbance were defined for the electrical insulation degradation process representing the ageing that occurs in an insulation specimen under electrical stress. Precision, recall, and F1 score were the metrics used for the evaluation of the CNN model. The author reported that the performance is consistent even with changes in the CNN hyper-parameters' values.
The effect of noise in PRPD patterns on the classification accuracy of different artificial defects in a 11 kV cross linked polyethylene (XLPE) cable joints has been investigated in [23]. There are a total number of five PD sources considered in this study. After training a CNN architecture using noise-free PRPD patterns, transfer learning was performed where the authors used this model to start training another CNN architecture but this time with noisy PRPDs. The results were compared with those obtained using traditional machine learning classifiers where hand crafted features were extracted. The authors reported that the CNN-based model was able to outperform the models that use manual feature extraction with an increase of 16.9% in the classification accuracy.
In [83], the authors simulated five different defects that take place in a 36 kV crosslinked polyethylene (XLPE) cable terminations such as protrusion, void, and corona discharge. The authors used a commercial PD detector in order to capture the PRPD patterns. A CNN architecture was proposed where the authors investigated the effect of different hyperparameters such as pooling and kernel size on the classification accuracy percentage. The authors compared their results with off-the-shelf CNN architectures such as AlexNet, VGG, ResNet and GoogleNet. They reported higher classification accuracy of their proposed model compared to the other models.

PD/Fault Classification Using Time-Series Waveform
In 2019, a traditional CNN model was used to differentiate between synthetic PD pulses in power cables [25]. The variability in the synthetic dataset was introduced by the signal-to-noise ratio (SNR) and the position at which the PD initiated. The model was compared with a support vector machine (SVM), where the study reported enhanced results using the proposed algorithm.
In the same year, five types of artificial defects in ethylene-propylene-rubber cables in a high voltage laboratory were collected to generate signals containing PD data [26]. Seventeen features were extracted from the time-series waveforms corresponding to characteristics such as pulse width, rise time, fall time, peak voltage, pulse polarity, mean voltage, and root mean square (RMS) voltage. In addition, 16 wavelet features were extracted from the transient signals using Wavelet Transform. In total, 33 features constituted the input corresponding to each signal to the proposed CNN model. Analysis was performed on the effect of the change in the hyperparameters of the CNN architecture such as the number of layers and the convolution kernel sizes. The results were compared with those obtained using SVM and back propagation neural network (BPNN) models. The study reported better classification accuracy when compared with the other two models.
In 2020, four common DC insulation faults were simulated during the operation of XLPE cables that include conductor burrs, external semi-conductive layer residue, internal air gap, and scratch on the insulation surface [27]. A modified Canny edge detection operator was used in order to extract the part of the time series signal which includes the PD. A deep belief network is proposed by the authors where the ADAM optimizer is used. neural network to make the classification decision. CNN, convolutional RNN, LSTM, and bidirectional LSTM (BILSTM) were used in the ensemble frame, but two of these models were used at a time. In this paper, two scenarios were considered: In one scenario where there is difference in the prediction of a sample between two deep learning (DL) models, a human expert will have to decide on the label of that sample, and in the other, the output from the activation function of two different models were added together. Five different cables were tested on the trained models. Adaption training was done for each of the five cables where the classifier layer is re-trained with the measured calibration pulse specific to each cable. The authors reported the results of the five cables for the two ensemble scenarios and with different binary selection of the above-mentioned DL algorithms. They also reported the results when each DL model is used alone. It was reported that the CNN paired with the BILSTM gave the best results.
In Ref. [84], the authors aimed to target the problem of losing the voltage signal information that is used for plotting PRPD patterns. Different datasets consisting of three types of cable joint defects were generated in a lab environment, where the signals were recorded using an oscilloscope. Pulse sequence analysis (PSA) was performed by using the change in the magnitude of the PD pulses resulting in a magnitude difference heat map image that was as the input to the proposed CNN model. Investigation was carried out to optimize the CNN hyperparameters, in addition to investigating the effect of the different image features including the size, type, color, and marker size of the images. The authors compared the accuracy of the model with having PRPDs as input versus the PSA as input. The authors claimed that using PSA instead of PRPD yields a higher classification accuracy.
The authors in [85] classified and localized ten faults in an 11-kV, three-phase underground cable consisting of various combinations of phase to phase or to ground faults. The dataset was generated by simulation using PSCAD/EMTDC software with varying different system parameters such as fault inception angle and fault location. Additive Gaussian white noise was added to the signals. The authors proposed a CNN-LSTM architecture along with the application of a sliding window technique. The input to the architecture is the current and voltage signals, and the outputs are the fault location, fault inception time, and fault type. The authors compared their results with other deep learning architectures such as CNN and LSTM. The authors reported better performance for their model compared with other models.
A simulation model using Matlab was developed in [86] to generate fault signals of an underground cable distribution system consisting of sixteen cables. The fault types are ground fault, short-circuit fault, or open-circuit fault for each of the three phases of the sixteen cables. The authors proposed a deep belief network for this purpose and they compared the classification accuracy with a shallow neural network. The authors reported better results for the proposed deep learning model. The authors of [87] located aged cable segments in underground power distribution systems labeled as even ageing, uneven ageing, and terminal ageing patterns. The signals were captured using an HFCT. The authors proposed a combined stacked autoencoder and CNN architecture for detecting aged segments. When an aged segment is detected, another CNN model was developed to indicate the location and severity of the aged segment. The authors compared the results with other machine learning models such as support vector regression and deep belief network. The proposed model performed better compared with other models for both detection and localization of the aged segments.
Ref. [88] proposed a technique to detect the inception faults that take place in cables. The authors used PSCAD/EMTDC in order to simulate the inception fault signals, over-current disturbance signals, and normal current signals. The authors proposed an architecture that includes a sparse autoencoder followed by a deep belief network. They compared the classification accuracy with support vector machine and K-nearest neighbor, where they reported better results performed by the proposed model.
The authors of [89] aimed to detect inception faults as well. They used PSCAD/EMTDC as well, where variability in the generated signals was introduced by changing some parameters in the simulation model such as fault impedance and fault location. The authors proposed restricted Boltzmann machine to compress the signals. This was the input to the stacked autoencoder. The authors investigated the trained model performance with simulated and measured data. In addition, they compared the classification accuracy with other models such as CNN, deep belief network, and random forest. The proposed model outperformed the other models.
In [90], the authors detected phase-to-ground faults in a typical 10 kV resonant grounding distribution system, which was simulated using PSCAD/EMTDC. The signals were generated under different fault conditions including different fault locations, different grounding resistances, and different fault initial phase angles. Continuous wavelet transform (CWT) was applied to the signals to generate 2D images. The images were transformed to grey-scale, and this was the input to the CNN. The authors investigated the robustness of the model to different parameters such as interference. The authors compared the results with SVM and adaBoost, and they reported that the proposed model gave better results.
Although the investigation of different DL techniques is essential for the sake of completeness of any work, it is observed that having common, publicly-available datasets can help in focusing on the generalization of any developed algorithm. In addition, as mentioned in previous sections, the investigation of the effect of any preprocessing technique is necessary in order to have a deeper understanding of the intrinsic characteristics that the DL model is learning.

Transformers
Power transformers play a significant role in power systems, so any failure in this apparatus may interrupt the power supply and cause outages and loss of profit. A photo of a power transformer is shown in Figure 9. One of the beneficial methods for preventing the failure in the power transformers and raising the reliability of these systems is detecting faults in power transformers accurately and promptly. Whether the target is to classify/localize the sources of PDs/faults taking place in a transformer, or to identify overheat or vibration, deep learning has been used for this purpose. The following section summarizes the literature on the use of deep learning for the transformer application.

PD Classification Using the PRPD Pattern
Four typical transformer insulation defects were simulated in [16] that include metal protrusion, oil paper void, surface discharge, and floating potential defects. The authors developed a CNN-LSTM based model where the input is the PRPD data. They compared the results with a CNN-only model and an LSTM-only model. The evaluation metric used was the classification accuracy where the authors reported that CNN-LSTM has better overall recognition accuracy than CNN and LSTM alone.
In 2020, the authors of [17] simulated six types of PDs that take place in power transformers using artificial cells in a laboratory setup. They collected the PRPDs of the six PDs which include protruding electrode, moving particle, floating object, surface discharges, bad contact between windings, and void. In order to reduce the input size of the PRPD, the authors used the phase-amplitude (PA) response that is extracted from PRPDs. The authors proposed a CNN model for classification of PD sources. Comparing the classification accuracy of the proposed architecture versus other machine learning classifiers, such as linear and nonlinear SVM, the authors reported a better performance. They also reported that using the PA response as an input increases the accuracy by 1.46% compared to using the raw PRPDs as the input to the CNN model.
The PRPD data of different PD sources in a transformer were collected in a laboratorycontrolled setup and reported in [18]. The PD sources included tip discharge, surface discharge, air gap discharge and suspended discharge. The squeeze-and-excitation (SE) module that is a lightweight attention mechanism and the nonlinear function hard-swish (h-swish) were used in addition to a CNN model in order to decrease the accuracy loss of the model further. The authors performed image pre-processing such as segmentation, binarization and enhancement of the data before feeding it to the training model. They compared the results of their model versus other models such as AlexNet, ResNet-18, and VGG16. They reported enhanced average accuracy versus the other models, in addition to less weight storage and reduction of parameters.
In the same year, an investigation of a transformer bushing insulation quality, which was affected by poor drying and impregnation, was reported in [19]. The authors used a simple CNN (i.e., 3300 parameters) for the identification of four types of dry impregnation defects using PRPDs as the input to the proposed CNN. The performance metrics used for the evaluation of the model are the precision rate, recall rate, and F1 score. The authors reported 97.1% average accuracy rate and indicated that their model can be used for online monitoring as it is a small model.
A novel convolutional architecture for single and multiple source PD classification, where the model is trained on single-source PDs, was proposed in [20]. The dataset included PRPDs of single and multiple sources of PD taking place in air, oil, and SF 6 which mimic common sources of PD. The six single PD sources of floating electrode in SF 6 , moving particle in SF 6 , fixed protrusion in SF 6 , free particle in transformer oil, needle electrode in transformer oil, and corona in air were simulated in a laboratory setup. The proposed architecture has a convolutional backbone feeding into multiple fully connected neural networks (FCNs). The performance metrics used are the arithmetic mean of recall and precision in addition to the classification accuracy and false negative rate. The authors compared their results with one-versus-all CNN and reported that their model has better results than the traditional single-branch CNN architecture.
Adam et al. [21] simulated six artificial PD sources in a controlled lab environment that mimics PDs in a power transformer. The PD sources include two discharge sources in air and four discharge sources in mineral oil. The time at which the discharge takes place, the apparent discharge in pC, and the phase angle are recorded for each PD event. In addition, 100 PD events constituted a sample. Superimposed patterns were created by using the single sources patterns, where 30 different combinations of samples with two class labels are formed. The authors proposed an LSTM model which is able to classify multiple and single sources of PDs, where the training was done just on single sources of PDs. The study reported the multi-label accuracy in addition to the single-label accuracy. The multi-label accuracy is defined as the proportion of the correctly predicted labels to the total number of labels for each sample. The model showed a 99% average accuracy for single PD sources and 43% for the average multi-label classification problem.
Ref. [22] outlined the same objective as the previous two papers. Single and multiple sources of corona discharge in a controlled lab environment were simulated. The four single sources were: sphere-plane, sphere-sphere, needle-plane, and needle-needle. The three multiple sources were: needle-needle and sphere-sphere, needle-plane and sphere-sphere, and sphere-plane and needle-needle. The PRPD patterns were collected, and pre-processing was performed by filtering discharges that have small magnitude. The input to the deep learning models were greyscale images of 75 by 75 pixels. The classes were labeled from 0 for the first single class to 6 for the double-sourced configuration that is considering the multi-source classes as a new class. The authors proposed an optimized ResNet model which they compared with other DL models such as AlexNet, Inception-V3, residual network (ResNet), and DenseNet. The study reported enhanced classification accuracy and least computational cost.

Dissolved Gas Analysis Using Deep Learning
The dissolved gas analysis (DGA) is an established method for detecting internal faults in transformers. In recent years, the application of deep learning in DGA has received more attention. This section reviews the literature where the researchers have used deep learning to improve the accuracy of the DGA.
The DGA of insulating oil was conducted for transformer fault diagnosis by Dai et al. in 2017 [91]. To improve the efficiency of diagnosis, the authors proposed a novel transformer fault diagnosis approach based on deep belief networks (DBN), which outperforms power transformer fault diagnosis using support vector machine (SVM), back-propagation neural network (BPNN), and ratio methods. A variety of sources were used to collect the input DGA data, including data provided by the State Grid Corporation of China and previous publications. The proposed model was trained using different combinations of DGA ratios associated with fault patterns (the so-called non-code ratios). The training and testing accuracy of 96.4% and 95.9% were observed for the DBN with non-code ratios in this study, respectively.
In 2020, a semi-supervised autoencoder with an auxiliary task (SAAT) was introduced by Kim et al. to extract a health feature space for power transformer fault diagnosis considering DGA [92]. The DGA dataset was provided by Korea Electric Power Corporation. The proposed SAAT achieved an accuracy of over 90% in both fault detection and fault identification. The same group has also developed a framework that bridges Duval's method with a deep neural network (DNN) technique for power transformer fault diagnosis employing DGA [93]. The dataset employed contains 4000 unlabeled and 117 labeled DGA data. The obtained results emphasize the superiority of the proposed method compared with the existing AI-based methods in terms of accuracy.
Wu et al. introduced a CNN-LSTM deep parallel diagnostic method for transformer DGA employing its ability to extract nonlinear features [94]. The authors showed that this method has a better anti-interference ability compared to the other techniques studied in the paper. In this parallel CNN-LSTM based diagnostic method, the input was in the form of an image derived from the DGA numerical data. The issue of insufficient data was overcome by using the transfer learning technology. The results obtained in this study indicate that the diagnostic accuracy rate is 96.9% without complicated feature extraction.
In 2021, Taha et al. presented a CNN model to precisely diagnose a variety of transformer faults using DGA data and considering different noise levels [95]. The results obtained from applying the proposed method on 589 dataset samples, collected from utilities and literature with various noise levels up to ±20%, indicate that the CNN model with combined input ratios improves the prediction accuracy. The obtained accuracy was compared to traditional machine learning methods as well.
To improve fault diagnosis in transformers, Hu et al. [96] proposed a method based on refined deep residual shrinkage network (DRSN). The input dataset was based on the amount of gas in the transformer oil, the temperature data, and the number of collected data points based on the timing sequence. The recognition results indicate that the average accuracy of refined DRSN is around 99.67% for the training set and 97.82% for the test set.
On average, the proposed method could improve the recognition accuracy by 2% compared to the existing fault diagnosis methods.
A probabilistic neural network (PNN)-based fault diagnosis model was presented by Zhou et al. for power transformers [97]. This model optimizes the smoothing factor of the pattern layer of PNN using an optimization technique, improved gray wolf optimizer (IGWO), to enhance the classification accuracy of the PNN. Different fault types data from a real transformer were collected using smart sensors. The obtained results indicate a high diagnostic accuracy of 99.71% achieved by the IGWO-PNN model.

Detect Mechanical Defects in Winding
In 2021, Rucconi et al. [98] analyzed the vibration data measured by transformer sensors, such as accelerometers, installed on the transformer tank. These sensors record time series waveforms to build a dataset. An ensemble of fully-connected, feedforward deep neural networks was employed to classify the transformer winding condition (tight or loose). The robustness of the models was investigated by testing them with data collected by sensors at locations other than those used for training. The authors reported a high accuracy in the results.
Li et al. used the frequency response analysis (FRA) for detecting the mechanical defects of power transformers in 2021 [99]. The authors employed a lumped-parameter transformer model since creating actual faults experimentally on a real transformer was not practical. The proposed deep learning approach was based on a decision tree classification model and a fully connected neural network that used the FRA data for training. Fifty-five FRA samples were generated as the input to the proposed model by simulating a variety of transformer fault types and levels. The mean absolute error (MAE) and mean square error (MSE) of the validation set were both at a low level, which were employed to reflect the accuracy of the model.
A fault diagnosis technique was proposed in [100] by Hong et al. using the vibration analysis. The vibration samples were collected in more than 100 operating transformers and were divided into the categories of normal, degraded, and anomalous. Next, the vibration monitoring data were converted into an image. A deep learning method based on a CNN was employed to classify the images of various input sizes, which indicated an overall accuracy of 98.3%.
A fault diagnosis method based on a deep learning model was presented by Wang et al. applied on a 110 kV three-phase oil-immersed transformer in 2018 [101]. The model used self-powered radio-frequency identification (RFID) sensors and employed the stacked denoising autoencoder (SDA) to learn features. Based on experimental results, the highest accuracy was achieved by the proposed methods and in the shortest time in comparison with other existing methods.
In 2021, a method was proposed by Moradzadeh et al. for analyzing a transformer FRA using image processing and a deep learning method, graph convolutional neural network (CNN) [102]. The obtained results using simulation data indicate that the normal mode of CNN (without considering visual images) and with considering the visual images have an accuracy of 97.28% and 98.33%, respectively. Using experimental data, an accuracy of 98.01% and 100% was reported, respectively.

Detect Electrical Faults in Winding
The authors of [103] proposed a CNN model for the identification and localization of faults in transformer winding. The dataset was collected by generating single/multiple discto-disc faults of winding insulation in a transformer model at different winding positions, where the current waveforms were recorded. The faults were generated in an analog model of a 33 kV winding of a 3 MVA transformer. The training dataset was generated using an EMTP (Electromagnetic Transient Programming) based digital model, and the test dataset included the data collected from the analog model of the transformer. The results of the CNN model were compared with other methods such as self-organizing maps, fractal features aided SVM, and wavelet-aided SVM. The CNN showed improved classification results compared to the other methods.
In 2019, Duan et al. [104] presented an inter-turn fault diagnosis technique to diagnose 15 types of an inter-turn short circuit fault. A multi-channel signal matrix that contains voltage and current waveforms of a simulated transformer was generated as the input of a deep learning-based model. An autoencoder was employed for feature extraction followed by a classifier consisting of convolutional and pooling layers. The proposed model could achieve a recognition accuracy of 99.5%.
An online continuously-operating fault monitoring system for cast-resin transformers was presented by Fanchiang et al. in [105]. They presented an overheating fault diagnosis approach with a maximum accuracy of 99.95%. The model used infrared thermography (IRT) images as input provided by a thermal camera monitoring system. The images were used to train a Wasserstein autoencoder reconstruction (WAR) model and a differential image classification (DIC) model to classify a number of faults such as inter-turn short circuit or poor contact of primary and secondary sides.
Wu et al. introduced an approach to obtain an optimal identification of the operation state of a converter transformer based on vibration detection technology and a deep belief network optimization algorithm [106]. The fused feature extraction technique considered in this study accurately extracted the eigenvectors of the vibration signals. This deep belief network optimization algorithm has offered a high classification accuracy.
Similar to the application of DL in any area, the prerequisite of developing any DL algorithm to assess the health of a transformer is the availability of reliable training data. Some data acquisition techniques for transformers are easy to implement even with the transformer in operation (such as DGA or acoustic signals) while some (such as FRA) require an outage. Equipping transformers with sensors that can collect internal data will enhance the application of DL techniques. Most of the papers in the literature either employ simulated lab data or even numerically-generated data. Training a DL algorithm with such data will make the performance of the algorithm in the field require proper validation. Furthermore, research on the application of DL techniques that employ polarization methods data, such as recovery voltage method (RVM), polarization and depolarization currents (PDC), and frequency-domain dielectric spectroscopy (FDS) will enhance the ageing diagnosis of the transformer dielectric materials.

Outdoor Insulators
Outdoor insulators play an important role in distribution and transmission overhead lines. They mechanically support the high voltage conductors and electrically insulate the high voltage lines from the grounded tower structure. Although they account for approximately 5-8% of the total capital cost of transmission lines, they are responsible for more than 70% of power line outages [107]. Therefore, it is crucial to continuously inspect them to avoid any risks of premature failure. A photo of a 220-kV overhead transmission line is shown in Figure 10.
Outdoor insulators are classified into ceramic and non-ceramic insulators. Despite the differences in their characteristics, both are prone to aging due to the combined effect of electrical, mechanical, and environmental stresses. The main problems associated with outdoor insulators can be categorized into physical defects and pollution related issues. Physical defects like cracks in any parts of the insulator, air voids in the housing material or in the interface between various insulator materials and metallic sharp edges of insulator fittings can cause localized partial discharge (PD) activities to occur. These discharge activities can contribute to the insulation degradation. On the other hand, the accumulation of pollutants on the surface of insulators in the presence of moisture can reduce the leakage resistance and allow leakage currents (LCs) to flow on the surface. Therefore, heat is dissipated, evaporating part of the moisture, and forming dry bands. Since dry bands possess a relatively higher resistance compared to the wet surfaces, voltage stress will concentrate across these bands resulting in the formation of dry band arcing which may lead to complete flashover. As a result, estimating and forecasting pollution levels is critical for utilities to plan their washing schedule to avoid power interruption. Both the Equivalent Salt Deposit Density (ESDD) and the Non-Soluble Deposit Density (NSDD) are commonly used to assess pollution severity. Nowadays, it is estimated that approximately 150 million ceramic insulators are deployed in North American overhead transmission and distribution networks [108]. A significantly high portion of them had either approached or exceeded their lifetime. As a result, utilities are increasingly favoring defective insulator detection systems that are fast, reliable, and cost-effective. To achieve this objective, it is crucial to select the proper sensors and measurement techniques, as well as to use effective machine/deep learning tools [109,110].
Condition monitoring techniques of outdoor insulators can be classified into intrusive and non-intrusive techniques. Since intrusive techniques are not safe, costly and may require the removal of the insulators from the field for further examination, they are time consuming and are not field inspection friendly. Non-intrusive techniques, on the contrary, are faster methods for assessing the health conditions of outdoor insulators and are therefore more favored in field inspections.
One of the most common non-intrusive inspection techniques deployed in the field involves the use of manned helicopters equipped with several sensors (like Cameras, IR Cameras, UV Cameras, etc.) for the purpose of recording inspection data. However, this possesses some risk since helicopters require hovering very closely to the electric power transmission line to obtain a better quality of the inspection data. To address this, several alternative solutions were proposed, and they fall mostly under two main categories: Unmanned Aerial Vehicles (UAVs) [111] and Rolling on Wire robots (ROW) [112]. Most of the work in the literature is moving towards UAVs because they have a slight advantage over ROW in the sense that their design does not need to adapt to different physical structures. However, UAVs also exhibit some restrictions in terms of the flying duration, which introduces some challenges when it is required to inspect a large number of insulators in the same trip. The massive volumes of data acquired throughout the inspection process are examined by an experienced crew of human inspectors, hence, the procedure may be time consuming, and the decisions made can be very subjective. Therefore, it is crucial to fuse artificial intelligence modules with UAVs for faster and better inspection performance.
In the recent years, machine leaning methodologies have evolved towards the use of deep learning techniques, which have proved to deliver excellent results for pattern recognition problems in a variety of applications. As a result, several authors have used deep learning models to assess the insulator condition. One of the drawbacks of deep learning models is its requirement of a large amount of training data. Despite this limitation, proper use of augmentation techniques can resolve this issue by altering the existing data to create more data for the model training process.
The aim of this section is to review deep learning applications along with non-intrusive condition monitoring techniques to assess both ceramic and non-ceramic insulators. In general, the work in the literature can be classified into two main categories, i.e., using deep learning models to detect different physical defects and/or predict the pollution severity level. Both categories relied primarily on either image processing or radiationbased approaches for classification. In the next sections, each category will be discussed along with the most recent research findings.

Physical Defect Detection
In ceramic insulators, researchers focused on detecting cracks, broken and missing discs in using UAVs. Most of the proposed methods based on deep learning algorithms share the same concept, i.e., the insulator in each image is located using object detection techniques; then, the defect is identified by a pre-trained deep neural network. The existing deep learning algorithms for object detection can be classified into two main categories: one stage and two stage networks [113]. Two stages networks consist of one stage for object detection and another stage for classification, while one-stage networks are endto-end methods which can predict the position information and classification probability simultaneously in a rapid manner. Generally, two-stage networks possess a higher detection accuracy compared to one-stage networks; however, they have a relatively lower detection speed and thus may not be the best option for real-time operations. Some examples of two stage networks include: regions with convolutional neural networks (RCNN), Fast R-CNN, Region based fully convolutional neural network (R-FCN) and Mask R-CNN and examples of one-stage networks include algorithms like YOLO and single multi-box detector (SSD).
An example of two-stage networks is proposed in [114]. It is based on a novel deep CNN cascading architecture. The cascaded architecture is composed of two networks: the first network is responsible for detecting all the insulators in the images by confining them inside detection boxes and cropping them while the rest of the image is discarded. On the other hand, the second network detects the missing caps from the cropped images. The scarcity of the defective images for training was addressed by different data augmentation methods. The precision and recall of the proposed method were found to be 91% and 96%, respectively.
To address the issues of slow detection speeds, the authors of [115] proposed a onestage network using a YOLOv3 deep learning model to recognize and classify images. Moreover, their proposed system combines deep learning with Internet of Things (IoT) through a Raspberry Pi. The work also considered the motion blur in aerial images by implementing a super resolution CNN to reconstruct the blurry images to a highresolution image before classification. The results show that the proposed system obtains rapid and high accuracy of 95.6% in the identification and classification of insulators' defects. One of the early signs of surface damage of non-ceramic insulators is the loss of their hydrophobicity. Hence, measuring the hydrophobicity is crucial for assessing the insulator surface condition. According to IEC 62073, there are three methods to estimate the hydrophobicity level of insulators, i.e., the contact angle, surface tension and the spray methods [116]. Among these three methods, the spray method is the one that can be applied in the field. The method involves spraying distilled water on the non-ceramic insulator surface; then, the surface can be classified from HC1 (highly hydrophobic) to HC7 (Highly hydrophilic) as shown in Figure 11. The classes are determined based on the size of the wetted area and the contact angle of the droplets. Unfortunately, the main drawback of this method is the subjectivity of human judgment. To overcome this issue, numerous researchers have proposed digital image processing methods to analyze and quantify the hydrophobicity class. Figure 11. Hydrophobic classes from HC1 (highly hydrophobic) to HC6 (highly hydrophilic) [117].
In [118], the spray method was used to generate a huge amount of data images which were fed to a deep convolutional neural network model (AlexNet) for the purpose of wettability classification. Compared to other machine learning algorithms, deep learning overcomes the manual dependency on feature extraction and involves less training time due to the transferred learning approach that was used in the article. The algorithm's performance was very promising when compared to other networks like ResNet50, VGGNet16, VGGNet19 and GoogleNet with an overall accuracy of approximately 96%. However, this method may require the removal of the insulator from the field which can be impractical.
To resolve this problem, the authors of [119] proposed a method to detect the hydrophobicity of composite insulators using a UAV technology. The drone is equipped with a camera and water spray device in addition to an embedded artificial intelligence (AI) module for non-intrusive classification. Initially, the You Only Look Once version3 (YOLOv3) is used to locate the wet umbrella skirt area of the composite insulator in the complex aerial image, then VGGNet16 was used to classify the Hydrophobicity of the images. An overall classification accuracy of 92.57% was achieved.
Other approaches included using image data and deep learning algorithms to assess the material surface degradation in composite insulators. Tracking and Erosion is one of the irreversible physical defects that occurs in non-ceramic insulators which can lead to insulator failure. In [120,121], the authors used transfer learning to train CNN to estimate the severity of erosion in silicone rubber insulators. The algorithm showed robust performance against different lighting conditions which shows the potential of their proposed model in practical applications.
To the best of our knowledge, there seems to be a gap in the literature that involves training deep learning models to detect internal and external physical defects using radiation-based measurements like RF antenna and ultrasonic sensors. All the work that has been done on radiation-based techniques involves the use of feature extraction and machine learning techniques [122][123][124][125][126].

Contamination Diagnosis
Several methods have utilized image processing techniques to classify the pollution severity. In [127], for example, a total of 4500 images of ceramic and silicone rubber post insulators were captured under different surface conditions, i.e., clean dry surface, clean with water droplets, contaminated surfaces with cement, contaminated surfaces with soil, wet surface contaminated with soil and wet surface contaminated with cement. Deep CNNs were employed for classification, and a brute-force model selection was introduced to identify and optimize the structure of the CNN classifiers. It was demonstrated that this model selection has achieved a highly accurate architecture. Furthermore, a complexity reduction technique was then applied to achieve lighter architectures. This considers the potential of implementing the CNN classifier in resource limited embedded devices. The results show that this proposed model reduction technique corresponds to a three times lighter architecture at the expense of a slight reduction in the classification accuracy (6.5% only). This is intended to reduce memory usage and flop counts when implemented using embedded devices.
In [128], the pollution severity was estimated using UV images. First, Insulator samples were uniformly contaminated with an ESDD level of 0.1 mg/cm 2 , 0.2 mg/cm 2 and 0.4 mg/cm 2 . After applying the voltage, a UV camera was used to capture the discharge activities on the surface of contaminated porcelain insulators. The images were then preprocessed by first graying the image, then changing the pixels to 0 or 255. When doing so, the light spot becomes white, while the rest of the image becomes black, thus highlighting the regional characteristics of the discharge spot in the image. Finally, CNN was used to evaluate the pollution severity of the insulators. It has been found that there is a positive correlation between the pollution level and the severity of the discharge activities under the same voltage level.
Other approaches used deep learning models to estimate the LC and an indirect method to estimate the contamination level. For example, the authors of [129] proposed an online monitoring system that uses real-time weather data to predict and classify the LC using bidirectional long short-term memory (Bi-LSTM) model. The sequential weather data consist of parameters like humidity, temperature, rainfall, dew point, solar illumination, wind speed, air pressure and wind direction. They are measured hourly and transferred to data servers. Besides the meteorological data, the LCs are also measured for the purpose of training and validation of the networks using the current transformer. The LC is classified into one of eight groups (levels): i.e., 100 µA-500 µA, 500 µA-1 mA, 1 mA-5 mA, 5 mA-10 mA, 10 mA-100 mA, 100 mA-1 A, 1 A-10 A, and greater than 10 A. Grid search is used to tune the hyperparameters involved in the Bi-LSTM model. The results show that the model achieved an improvement by 12.8% in accuracy compared to other models like LSTM, GRU and RNN.
Seven PD sources pertinent to artificially damaged insulator sheds in a controlled lab experiment were simulated in [42]. The first three sources corresponded to damage in one shed of an HV insulator. The other four correspond to damage in two or all sheds in the HV insulator. For this matter, the CNN was used, however, in order to tune the hyperparameters in the CNN architecture, the authors used Bayesian optimization. To generate the training data, the scalogram pattern of the PD signal was generated and transformed using wavelets. Three different mother wavelets (Morse, Amor, and Bump) were used. In addition, different training optimizers' (including stochastic gradient descent with momentums (SGDM), RMSprop, Adam, and Nadam) were used. The authors compared the Bayesian-CNN (B-CNN) with the traditional CNN with no Bayesian optimization in addition to other off-the-shelf deep learning architectures such as VGG19, Resnet 50, and Googlenet. The average classification accuracy was used as the performance metric. The study reported enhanced results for the B-CNN compared to the other architectures with the Bump mother wavelet. The authors also tested the model on another 15-kV porcelain insulator dataset, and the average classification accuracy showed optimistic results which reflect the generalization capabilities of the B-CNN.
In general, the literature shows an inadequate amount of work that has been devoted to train deep network architectures to classify and predict pollution levels non-intrusively. All the work done is either focused on applying deep learning models to intrusive measurement techniques [130,131] or applying classical machine learning using non-intrusive techniques [132][133][134][135]. More research is needed to combine deep learning models with non-intrusive approaches for monitoring, particularly those based on radiation type sensors. Furthermore, the majority of the publications were focused on employing one type of sensor for their diagnostics, although this may yield satisfactory results, but could be further improved using multiple sensors. For example, ultrasonic sensors can detect both low and high frequency surface discharges, but it might be difficult to detect internal discharges. On the other hand, RF antennas can be utilized to detect internal and external high frequency discharges but cannot be used to detect low frequency discharges. Thus, combining ultrasonic sensors and RF antenna can be used to detect and classify a wider range of defects.

Conclusions
Monitoring of electrical insulation of high voltage apparatus is crucial for the reliable operation of power systems. Such a high voltage apparatus includes but is not limited to gas-insulated switchgear (GIS), transformers, cables, rotating machines, and outdoor insulators. Extensive research has been done on the classification of sources of partial discharge (PD), detection and localization of faults that take place in such apparatus, and the quality and remaining lifetime of insulating material. Modern techniques have been based on machine learning methods, where the input to such methods is composed of manually-extracted features, i.e., feature extraction has required the intervention of human experts. Deep learning, which is a branch of machine learning, has been used to enhance the performance of PD classification, fault and defect detection, contamination diagnosis of outdoor insulators, etc. This enhancement is attributed to the capability of deep learning techniques to use raw data as the input to the classification model. In other words, instead of using manually-extracted features, raw data such as PRPD patterns, time-series waveforms, or images are used as the input to the deep learning systems. This allows the classification model to be fully automated where the feature extracting stage is integrated into the learning stage.
In this article, the potential of applying deep learning in assessing the health conditions of different power system assets is highlighted. The following shortcomings/future needs are identified:

1.
Most published research employs training data generated in a laboratory environment or by computer simulation that leads to achieving high classification accuracy. A limitation is always presented when data are collected in a controlled lab environment due to the fact that acquiring real data is expensive, intrusive, and time-consuming. Hence, integrating research work in real online or offline systems is always appreciable in order to incorporate all the uncertainties of such systems in the learning process of deep-learning models. Moreover, future research should focus more on the utilization of the use of a generative adversarial network in order to generate more data that mimic real data instead of using lab data that present its own limitations. In addition, future directions should focus more on the utilization of DL techniques such as one shot learning [136] towards the issue of small datasets, which is a typical restriction in the HV application.

2.
Prior knowledge of the defect types and/or knowing the exact location of the defect is far from the reality of the field conditions. Moreover, unknown sources and types of external noise may hinder the deep learning algorithms capabilities to identify and/or localize the defect type. Hence, future research needs to focus more on unsupervised learning when it comes to high voltage applications.

3.
More work should focus on the occurrence of multiple, simultaneous PDs or faults. The reason is that, in real-life systems, multiple sources of faults or PDs can take place at the same time. Therefore, more focus should be directed towards this problem. 4.
One of the limitations of the reported research is the utilization of single sensors like ultrasonic sensors, RF antenna, or IR camera. It is expected that the use of multiple sensors can improve the overall classification accuracy when sensor fusion is applied and different 1D and/or 2D signals are fed to the deep learning classifiers.

5.
Integrating the state-of-the-art deep learning algorithms along with promising technologies like drones can improve the inspection efficiency of outdoor insulation systems. With the current improved computational power of micro-controllers, realtime condition monitoring and diagnostics of different defects are feasible using drones and deep learning algorithms.

6.
Future research directions should focus on developing electrical insulation ageing models using DL techniques that employ polarization methods data, such as RVM, PDC, and FDS. 7.
Using deep learning techniques in the high voltage application is still in the starting stage. More work should be done on deciding on the best standard to specify the optimal architecture per application. One aspect of this is utilizing the use of already established hyperparameter optimization techniques such as the Bayesian optimization technique. In addition, the industrial deployment of the DL algorithms should be addressed, since this requires a different action for each scenario. If the deployment is taking place on a local server, the aim would be to maximize the performance of the algorithm while taking advantage of high-speed and high-end hardware resources. This is different in case the deployment is to take place on a portable monitoring device, where the restriction of space and speed will be presented. 8.
With the emerging of the digital twin technologies, deep learning should be utilized for different digital twins of assets, such as transformers or rotating machines. Digital twins are virtual representations of the interactions and behavior that assets can undergo in the physical world. More information on the application of digital twins in power system assets can be found in [137][138][139].