Partial Discharge Recognition with a Multi-Resolution Convolutional Neural Network

Partial discharge (PD) is not only an important symptom for monitoring the imperfections in the insulation system of a gas-insulated switchgear (GIS), but also the factor that accelerates the degradation. At present, monitoring ultra-high-frequency (UHF) signals induced by PDs is regarded as one of the most effective approaches for assessing the insulation severity and classifying the PDs. Therefore, in this paper, a deep learning-based PD classification algorithm is proposed and realized with a multi-column convolutional neural network (CNN) that incorporates UHF spectra of multiple resolutions. First, three subnetworks, as characterized by their specified designed temporal filters, frequency filters, and texture filters, are organized and then intergraded by a fully-connected neural network. Then, a long short-term memory (LSTM) network is utilized for fusing the embedded multi-sensor information. Furthermore, to alleviate the risk of overfitting, a transfer learning approach inspired by manifold learning is also present for model training. To demonstrate, 13 modes of defects considering both the defect types and their relative positions were well designed for a simulated GIS tank. A detailed analysis of the performance reveals the clear superiority of the proposed method, compared to18 typical baselines. Several advanced visualization techniques are also implemented to explore the possible qualitative interpretations of the learned features. Finally, a unified framework based on matrix projection is discussed to provide a possible explanation for the effectiveness of the architecture.


Introduction
Gas-insulated switchgears (GISs) are widely used as the major control and protection equipment in medium to ultra-high voltage substations, due to their superior compactness, high reliability, strong dielectric strength, and maintenance-free properties. However, unavoidable imperfections and deterioration in the insulation system pose serious threats to the reliability and safety of the GIS and whole power grids [1][2][3]. The occurrence of partial discharges (PDs) is not only one of the main characteristics that can effectively reflect the inner dielectric flaws, but is also the cause of the accelerated degradation [4,5]. Therefore, it is imperative to assess the potential correlation between the PD patterns and the defect types, so that corresponding maintenance activities can be taken before a complete breakdown.
The conventional monitoring methods of PD could be summarized into three main categories: chemical, electrical, and acoustic [6]. Among these, ultra-high-frequency (UHF) signals are becoming 1.
We propose a deep multi-column architecture that incorporates multi-resolution information for the recognition of UHF signals with specified temporal filters, frequency filters, and texture filters.
Multi-sensor fusion based on LSTM is also utilized to further improve the diagnostic accuracy.

2.
To alleviate the risk of overfitting induced by the expanded structure, a spectral embedding technique based on manifold learning is utilized for the pretraining. We demonstrate that the transfer learning-based training process is able to achieve remarkable performance with limited data.

3.
A unified matrix projection-based framework is proposed to enhance the interpretability of the proposed model. Table 1. Summary of the relevant features and classifiers that have been used for partial discharge (PD) recognition.

Reference PD Types Feature Extraction Classifier Recognition Accuracy
Umamaheswari and Sarathi [9] Four types of artificial insulation defects Partial power analysis SVM Average of 95.25% Darabad et al. [10] Ten types of PD sources Texture features + PCA SOM Grouped data visualization Mas'ud et al. [11] Six PD fault geometries in oil, air, and poly-ethylene-terephthalate Statistical parameters from the PRPD patterns ENN Average 95% Evagorou et al. [12] Four types of artificial PDs in oil and air.
Wavelet packet transform + statistical parameters PNN 97.49%, 91.9%, 100%, and 99.8% Wang et al. [13] Four types of artificial defect models in the oil and air. S-transform Affinity propagation clustering 99.67% Li et al. [14] Four kinds of typical defects in a 252 kV GIS Cumulative energy in time and frequency domain + mathematical morphological gradient Fuzzy maximum-likelihood algorithm 98% Li et al. [15] Four kinds of defects in GIS Statistical parameters of both the TRPD and PRPD patterns Dempster-Shafer evidence theory + Neural network 97.25% Albarracín et al. [16] Separation of PDs with electromagnetic noises Power ratios and the signal times Grouped data visualization The separation criteria was given Li et al. [17] Six types of artificial insulation defect models in the oil, air, and paper-fiber interface.
Wang et al. [18] Three typical PD defect patterns in a 252 kV GIS Fourier transform + Chromatic methodology SVM 86.67% Gu et al. [19] Three common defect types of 25 kV cross-linked polyethylene (XLPE) power cable HHT + Fractal parameters NN 100%, when 5% random white noise Dai et al. [20] Four kinds of artificial defects in GIS S-transform + Singular value decomposition (SVD) SVM 98.33% Majidi et al. [21] Air voids with dimensions of 1, 1.5, and 2 mm 1-norm, 2-norm, and infinity-norm of the statistical features Sparse representation vector 99.7%, 92.9%, 94.0%, and 81.6% for four different scenarios Khan et al. [22] Parallel-plane stainless steel electrodes in SF6 with 10 different particle lengths and positions Statistical features + PCA NN 88% The rest of this paper is organized as follows. Some related theories are introduced first in Section 2. Section 3 provides the details of both the model architectures and the transfer learning process. The platform presented in Section 4 provides the necessary dataset for verifying the proposed method, and Section 5 describes the recognition results. Based on the experimental results, Section 6 discusses the effectiveness of the proposed model. Finally, Section 7 concludes the paper.

Related Work
In deep learning, the most straightforward way to improve the model performance has shifted from feature engineering to network architecture designing. ResNet [30] proposed by Microsoft is a good example, which introduces an extra column with residual mapping, and has state-of-the-art performance. Although the early implementation of the multi-column neural network only aimed at accelerating the training process with multiple Graphics Processing Units (GPUs) [28], it has been realized that the several columns can become specified experts for different features. For example, the Inception model [31] is a successful multi-column deep neural network (DNN) to approximate the optimal sparse structure of the input with different convolutional kernels. The Xception model [32] goes even further by decoupling both the cross-channel and spatial correlations to obtain a multi-branch structure. In addition, Saining exposed a new dimension called cardinality to standardize the multi-branch structure [33]. Ling attempted to train a multi-column neural network with spectral clustering to embed the hierarchical manifolds into a compact vector [34].
Although sharing the similar structure of multiple columns, the multi-resolution theories focus more on the structures of different scales. Sander attempted to learn music features by extracting the mel-spectrograms with different windows [35]. Michael utilized CNN to predict future images from different resolutions and concluded that a lower resolution was able to capture the entire context, while a higher resolution recovered more details [36]. The similar idea was also shared by MMDenseNet [37], which enables the neural network to model both the fine-grained structures and the long contexts efficiently. In the Deep Belief Network, the multi-resolution structures also showed their superiority in image-related tasks [38]. Furthermore, the idea of multiple resolutions has long been integrated into the underlying design of neural networks. The pooling layers and the variants, such as subsequent spatial pyramid pooling all attempt to obtain the larger space by reducing the resolution of the intermediary layers [39].

Gabor Representations and Multi-Resolution Analysis
Fourier transform is a powerful tool to provide the frequency distributions of complex signals. However, we must simply remark that the one-dimensional solutions are delocalized in time and not comprehensive enough. The bidimensional functions that provide both the frequency and time localizations are preferred. To introduce the time resolution, a time-localized window is utilized to suppress the signal outside the neighborhood, which is the well-known short-time Fourier transform (STFT) [24]: where x(u) is the original signal, and h is the analysis window. t and v represent the time and frequency, respectively. In practical applications, the discrete STFT is preferred, which is given by: where m, n ∈ Z. t 0 and v 0 are the time and frequency steps. Specifically, the STFT with Gaussian windows is able to theoretically maximize the concentration in the TF plane by approaching the lower bound of the Heisenberg-Gabor inequality [24]: Here, T is the time duration, and B is the normalized bandwidth. The STFT of this particular case is also called the Gabor transform. The most important inference of the Heisenberg-Gabor inequality is the trade-off between time and frequency resolutions in accordance with the width of the analysis window. An illustrative example is shown in Figure 1. As shown in Figure 1, the long window gives rise to harmonic information, while the pulses emerge from the short time window. In this study, the spectrograms of different resolutions are all employed for the diagnosis of the UHF signals. It should be noted that the concept of multiple resolutions is different from the wavelet or S-transform, which still generate only a single resolution for each frequency band.
Furthermore, a semiquantitative method is proposed as an assistance criteria for choosing the proper analysis windows, considering the morphology characteristics of the STFT spectrums. As indicated by Figure 2, centering on the max value, the time and frequency ranges before reaching a certain threshold are calculated, which can be regarded as a rough description of the peak shape. Intuitively, the greater the time range is, the less the time resolution is retained, while the frequency isolation is more obvious. The frequency range varies in an opposite way. The proper window can be selected based on the trade-off between the time and frequency resolutions, depending on their priorities. As shown in Figure 1, the long window gives rise to harmonic information, while the pulses emerge from the short time window. In this study, the spectrograms of different resolutions are all employed for the diagnosis of the UHF signals. It should be noted that the concept of multiple resolutions is different from the wavelet or S-transform, which still generate only a single resolution for each frequency band.
Furthermore, a semiquantitative method is proposed as an assistance criteria for choosing the proper analysis windows, considering the morphology characteristics of the STFT spectrums. As indicated by Figure 2, centering on the max value, the time and frequency ranges before reaching a certain threshold are calculated, which can be regarded as a rough description of the peak shape. Intuitively, the greater the time range is, the less the time resolution is retained, while the frequency isolation is more obvious. The frequency range varies in an opposite way. The proper window can be selected based on the trade-off between the time and frequency resolutions, depending on their priorities.

Convolutional Neural Network and the Design of Convolutional Kernels
In recent years, CNN and its derivatives have been regarded as the state-of-the-art choices for computer vision tasks. The key ideas in CNN include the concepts of local receptive fields, shared weights, and spatial subsampling. Specifically, in each convolutional layer, the feature maps are first convolved with several kernels. After the biases, activation function and sometimes, the pooling operation, the outputs are then fed into the next convolutional layer, which can be summarized as: where * donates the convolutional operator. and are the input and output of the lth layer, respectively, and is the bias.
is a selection of the feature maps, and is the element of the kernel matrix.
The activation function is chosen as the Relu function for the convenience of the gradient calculation. A simple parameter analysis can be given as follows. Assuming that the volume of input is 1 × 1 × 1, after applying convolutions of filters (without zero padding), the new volume of 2 × 2 × 2 is: where 1 and 2 are the width and height of the filter cores, and 1 and 2 are the strides. One special characteristic of CNN is the shift invariance. However, although dramatically useful in image recognition, the invariance features pose an impediment for the recognition of Gabor representations that own a clear physical meaning. Any shift of a generally rectangular filter may distort the original TF distributions and confuse the CNN. To alleviate the restriction, three kinds of convolutional kernels are designed in accordance with the spectrograms of different resolutions, namely frequency filters, temporal filters, and texture filters. As shown by the red rectangles in Figure 3, the temporal filters are designed for maps of higher time resolutions, and they share the same widths with the maps, while the frequency filters are

Convolutional Neural Network and the Design of Convolutional Kernels
In recent years, CNN and its derivatives have been regarded as the state-of-the-art choices for computer vision tasks. The key ideas in CNN include the concepts of local receptive fields, shared weights, and spatial subsampling. Specifically, in each convolutional layer, the feature maps are first convolved with several kernels. After the biases, activation function and sometimes, the pooling operation, the outputs are then fed into the next convolutional layer, which can be summarized as: where * donates the convolutional operator. x l−1 j and x l i are the input and output of the lth layer, respectively, and b is the bias. M j is a selection of the feature maps, and k l ij is the element of the kernel matrix.
The activation function f is chosen as the Relu function for the convenience of the gradient calculation. A simple parameter analysis can be given as follows. Assuming that the volume of input is W1 × H1 × D1, after applying convolutions of k filters (without zero padding), the new volume of W2 × H2 × D2 is: where F1 and F2 are the width and height of the filter cores, and S1 and S2 are the strides. One special characteristic of CNN is the shift invariance. However, although dramatically useful in image recognition, the invariance features pose an impediment for the recognition of Gabor representations that own a clear physical meaning. Any shift of a generally rectangular filter may distort the original TF distributions and confuse the CNN. To alleviate the restriction, three kinds of convolutional kernels are designed in accordance with the spectrograms of different resolutions, namely frequency filters, temporal filters, and texture filters. As shown by the red rectangles in Figure 3, the temporal filters are designed for maps of higher time resolutions, and they share the same widths with the maps, while the frequency filters are suitable for higher frequency resolutions, and they are as high as the spectrograms. The texture filters are of medium size as supplements. suitable for higher frequency resolutions, and they are as high as the spectrograms. The texture filters are of medium size as supplements. The designing principles of the special convolutional filters may be explained qualitatively with orthogonality, since the convolutional operation is basically a template matching process. The first step is to make the filter cover either a complete frequency axis or a whole time axis, thereby allowing for perfect frequency or temporal matching. Second, by assuming that the filter is also a kind of special signal, it is assumed that orthogonality exists between the unfolding direction of the filters and the spectrograms, to maximize the diversity in a filter as much as possible. Following this idea, the frequency filters are applied to the Gabor maps of higher frequency resolutions, while the temporal filters are applied to the higher time resolutions. Another advantage of the frequency and temporal filters refers to Equations (5) and (6). The outputs of the frequency and temporal filters are only one-dimensional, which dramatically decreases the model's size.

Architecture of the Proposed Model
As has been mentioned before, TF maps of different resolutions can reveal different rhythms in the 2-dimensional space. It is hypothesized that it would be of great benefit to explicitly provide TF maps of different resolutions as inputs to a deep model. This combines the multiple views with deep learning's ability to extract intrinsic properties from raw inputs, so that the discriminative features could be learned while maintaining explanatory factors with physical meaning.
Accordingly, the proposed multi-resolution network is summarized in Figure 4. It mainly contains three parts, namely, single-resolution embedding, multi-resolution fusion, and multi-sensor fusion. Here, the first part is realized by three subnetworks; the second part is a fully-connected neural network attached at the end of the subnetworks, and Long Short-Term Memory (LSTM) [40] is implemented for multi-sensor fusion. The designing principles of the special convolutional filters may be explained qualitatively with orthogonality, since the convolutional operation is basically a template matching process. The first step is to make the filter cover either a complete frequency axis or a whole time axis, thereby allowing for perfect frequency or temporal matching. Second, by assuming that the filter is also a kind of special signal, it is assumed that orthogonality exists between the unfolding direction of the filters and the spectrograms, to maximize the diversity in a filter as much as possible. Following this idea, the frequency filters are applied to the Gabor maps of higher frequency resolutions, while the temporal filters are applied to the higher time resolutions. Another advantage of the frequency and temporal filters refers to Equations (5) and (6). The outputs of the frequency and temporal filters are only one-dimensional, which dramatically decreases the model's size.

Architecture of the Proposed Model
As has been mentioned before, TF maps of different resolutions can reveal different rhythms in the 2-dimensional space. It is hypothesized that it would be of great benefit to explicitly provide TF maps of different resolutions as inputs to a deep model. This combines the multiple views with deep learning's ability to extract intrinsic properties from raw inputs, so that the discriminative features could be learned while maintaining explanatory factors with physical meaning.
Accordingly, the proposed multi-resolution network is summarized in Figure 4. It mainly contains three parts, namely, single-resolution embedding, multi-resolution fusion, and multi-sensor fusion. Here, the first part is realized by three subnetworks; the second part is a fully-connected neural network attached at the end of the subnetworks, and Long Short-Term Memory (LSTM) [40] is implemented for multi-sensor fusion. 1. Single-resolution embeddings with the subnetworks Each branch of the subnetworks has its own micro-structure, which is basically a CNN with the convolutional operation. The main differences lie in the specified designed frequency filters, temporal filters, and texture filters. After the convolutional filter, an activation function, a pooling layer, and a dropout layer are arranged hierarchically as a basic building block, which repeats itself twice before connecting to a flattening layer. Finally, a fully-connected layer is attached at the end of each subnetwork to generate an embedding vector as the intermediate feature space.

Multi-resolution information fusion using the fully-connected neural network
The fully connected neural network is a normal multi-layer perceptron (MLP) containing two layers of 100 units and 50 units, respectively, that are attached after the concatenation of the three subnetworks. The model automatically chooses the best filters by assigning them higher weights, and the information of different resolutions is merged to offer comprehensive representations of compact forms.
3. Multi-sensor information fusion using the LSTM The propagation and reflections of UHF signals in the GIS can significantly influence the characteristics of the signals, especially at the L corner for the inconsistency of the wave impedance [41]. Therefore, the characteristics of sensors of different positions can reveal different patterns. By fusing the information from different sensors, it is possible to obtain a more comprehensive understanding of the PD patterns.
Recurrent neural network (RNNs) [42] is the most popular choice for sequence mining tasks such as speech recognition, natural language translation, and video recognition. As shown in Figure   Figure 4. Structure of the multi-resolution CNN and the LSTM fusion.

Single-resolution embeddings with the subnetworks
Each branch of the subnetworks has its own micro-structure, which is basically a CNN with the convolutional operation. The main differences lie in the specified designed frequency filters, temporal filters, and texture filters. After the convolutional filter, an activation function, a pooling layer, and a dropout layer are arranged hierarchically as a basic building block, which repeats itself twice before connecting to a flattening layer. Finally, a fully-connected layer is attached at the end of each subnetwork to generate an embedding vector as the intermediate feature space.

Multi-resolution information fusion using the fully-connected neural network
The fully connected neural network is a normal multi-layer perceptron (MLP) containing two layers of 100 units and 50 units, respectively, that are attached after the concatenation of the three subnetworks. The model automatically chooses the best filters by assigning them higher weights, and the information of different resolutions is merged to offer comprehensive representations of compact forms.

Multi-sensor information fusion using the LSTM
The propagation and reflections of UHF signals in the GIS can significantly influence the characteristics of the signals, especially at the L corner for the inconsistency of the wave impedance [41]. Therefore, the characteristics of sensors of different positions can reveal different patterns. By fusing the information from different sensors, it is possible to obtain a more comprehensive understanding of the PD patterns.
Recurrent neural network (RNNs) [42] is the most popular choice for sequence mining tasks such as speech recognition, natural language translation, and video recognition. As shown in Figure 5, in each time step t, the hidden state h t is updated by a function of the last hidden state h t−1 and the new input x t : where the hidden state h t is a d-dimensional vector, and b is the bias. W and U are the weights. Tanh is the activation function. LSTM is a branch of RNN, which introduces a memory cell to preserve the information for long sequences, and relieves the exploding or vanishing gradients problem in the original RNN [40]. The basic unit of LSTM can be explained as a collection of a block input x t , a memory cell c t , and a hidden state h t , controlled by an input gate i t , a forget gate f t , and an output gate o t as shown in Figure 5, with the following relationships: where σ is the sigmoid function and donates the point multiplication. As the net structure is very complex, more sophisticated techniques are required to relieve the risk of overfitting. Details of the training process will be presented in the next part.
Sensors 2018, 18, x 9 of 26 5, in each time step , the hidden state ℎ is updated by a function of the last hidden state ℎ and the new input : where the hidden state ℎ is a d-dimensional vector, and is the bias. and are the weights. Tanh is the activation function. LSTM is a branch of RNN, which introduces a memory cell to preserve the information for long sequences, and relieves the exploding or vanishing gradients problem in the original RNN [40]. The basic unit of LSTM can be explained as a collection of a block input , a memory cell , and a hidden state ℎ , controlled by an input gate , a forget gate , and an output gate as shown in Figure 5, with the following relationships: where σ is the sigmoid function and ⊙ donates the point multiplication. , , , , , , , and are the weights. , , , and are the biases. Intuitively, the input gate and the output gate control how much the old memories and new information are recombined as the new memory cell. The hidden state ℎ is a gated, partial view of the internal memory of the cell. By varying the gate values in different time steps, the LSTM can keep itself up-to-date while maintaining the most discriminable information acquired from different time steps. Finally, the features from different sensors are integrated for information fusion to give the final diagnosis results.
As the net structure is very complex, more sophisticated techniques are required to relieve the risk of overfitting. Details of the training process will be presented in the next part.

Model Training by Transfer Learning
Although the multi-resolution views and the multi-steam structures gain great potential for more comprehensive diagnosis, a larger number of parameters accompanies its bigger size, making

Model Training by Transfer Learning
Although the multi-resolution views and the multi-steam structures gain great potential for more comprehensive diagnosis, a larger number of parameters accompanies its bigger size, making the model more prone to overfitting. Thus, a transfer learning framework was proposed that classifies the training process into the three stages of single-resolution embedding, multi-resolution fusion, and multi-sensor fusion, in accordance with the network's structure.
Transfer learning is a technique focusing on transferring knowledge gained from one problem to a different but related task [43]. The parameters of the target network are partly or all initialized from another network that is pretrained from the source task, while the nontransferred weights are randomly initialized. In most cases, transfer learning is done by weight transfer.
The usage of transfer learning was more flexible in this study, where the source and target problem use the same dataset. The key idea is relative independence among columns of different resolutions. The same dataset is much less prone to overfitting when it is applied to a smaller problem. Therefore, it is beneficial if the single-column is trained first independently. The knowledge is then transferred to the higher levels for more sophisticated targets. The details of the three learning stages are shown in Figure 6.

1.
The single-column features extracted by each subnetwork are learned using manifold embedding.

2.
The weights of the MLP attached at the end of the three subnetworks are randomly initialized, and the whole structure is fine-tuned. 3.
The weights gained from step (2) are frozen as feature extractors. Only the weights of the LSTM are updated for multi-sensor fusion.
Sensors 2018, 18, x 10 of 26 the model more prone to overfitting. Thus, a transfer learning framework was proposed that classifies the training process into the three stages of single-resolution embedding, multi-resolution fusion, and multi-sensor fusion, in accordance with the network's structure. Transfer learning is a technique focusing on transferring knowledge gained from one problem to a different but related task [43]. The parameters of the target network are partly or all initialized from another network that is pretrained from the source task, while the nontransferred weights are randomly initialized. In most cases, transfer learning is done by weight transfer.
The usage of transfer learning was more flexible in this study, where the source and target problem use the same dataset. The key idea is relative independence among columns of different resolutions. The same dataset is much less prone to overfitting when it is applied to a smaller problem. Therefore, it is beneficial if the single-column is trained first independently. The knowledge is then transferred to the higher levels for more sophisticated targets. The details of the three learning stages are shown in Figure 6.
1. The single-column features extracted by each subnetwork are learned using manifold embedding. 2. The weights of the MLP attached at the end of the three subnetworks are randomly initialized, and the whole structure is fine-tuned. 3. The weights gained from step (2) are frozen as feature extractors. Only the weights of the LSTM are updated for multi-sensor fusion.

Single Column Embedding with Manifold Learning
It is worth noting that the subnetwork that embeds the high-dimensional data into a fixed vector is similar to a general dimensionality reduction problem that maps the high dimensional points onto a low dimensional manifold to make the similar points adjacent. Although some general dimensionality reduction frameworks exist [44,45], the same key point is to preserve the distances of

Single Column Embedding with Manifold Learning
It is worth noting that the subnetwork that embeds the high-dimensional data into a fixed vector is similar to a general dimensionality reduction problem that maps the high dimensional points onto a low dimensional manifold to make the similar points adjacent. Although some general dimensionality reduction frameworks exist [44,45], the same key point is to preserve the distances of the original vertex pairs. Given the data x 1,..., x U , the loss function parameterized by α and embedding function f is: where W ij is the element of a pairwise distance matrix in a class-dependent or class-independent way. For example, Multidimensional scaling (MDS) [46] can be used in this pattern by defining W ij as the Euclidean distance. For the specified problem, the distances are defined in a class-dependent way as follows. W ij = 0 if i and j belong to the same class, and 1 otherwise. Furthermore, to make the loss function learnable and more suitable for the subnetworks, a Siamese network structure [47] is implemented to realize the pairwise affinity calculation. A Siamese network is a special neural network with two symmetric branches, as shown in Figure 7. Each time, only one subnetwork is fabricated into the Siamese structure for pretraining with the following loss function: where m is the margin. The loss function is different from MDS in two ways. First, the distance is class-dependent. Second, the losses are chosen as the hinge losses for better generalization. After accomplishing the model construction, Adadelta optimizer is utilized for the weights update. the original vertex pairs. Given the data ,…, , the loss function parameterized by α and embedding function is: where is the element of a pairwise distance matrix in a class-dependent or class-independent way. For example, Multidimensional scaling (MDS) [46] can be used in this pattern by defining as the Euclidean distance. For the specified problem, the distances are defined in a class-dependent way as follows.
= 0 if and belong to the same class, and 1 otherwise.
Furthermore, to make the loss function learnable and more suitable for the subnetworks, a Siamese network structure [47] is implemented to realize the pairwise affinity calculation. A Siamese network is a special neural network with two symmetric branches, as shown in Figure 7. Each time, only one subnetwork is fabricated into the Siamese structure for pretraining with the following loss function: where is the margin. The loss function is different from MDS in two ways. First, the distance is class-dependent. Second, the losses are chosen as the hinge losses for better generalization. After accomplishing the model construction, Adadelta optimizer is utilized for the weights update.

Transfer Learning for Cascaded Training
For the other two steps, the weights of the MLP that were attached at the end of the three subnetworks were randomly initialized first, and then another layer with the same node number of the classes was connected to make a complete single-sensor diagnosis model. After fine-tuning with Adadelta, the weights gained from step (2) were frozen, and they only acted as feature extractors for the information fusion of different sensors, while the weights of the LSTM were made to be trainable. The convergence speeds of the two steps were likely to be much faster compared with the first step, due to the existing class-dependent knowledge. In addition, some advanced training techniques such as early stopping and heuristic learning rate adaption were also utilized.

PD Laboratory Setup
For the purpose of the experimental study, a simulative L-shaped GIS tank with a coaxial cylindrical conductor was implemented to simulate the realistic equipment, as shown in Figures 8

Transfer Learning for Cascaded Training
For the other two steps, the weights of the MLP that were attached at the end of the three subnetworks were randomly initialized first, and then another layer with the same node number of the classes was connected to make a complete single-sensor diagnosis model. After fine-tuning with Adadelta, the weights gained from step (2) were frozen, and they only acted as feature extractors for the information fusion of different sensors, while the weights of the LSTM were made to be trainable. The convergence speeds of the two steps were likely to be much faster compared with the first step, due to the existing class-dependent knowledge. In addition, some advanced training techniques such as early stopping and heuristic learning rate adaption were also utilized.

PD Laboratory Setup
For the purpose of the experimental study, a simulative L-shaped GIS tank with a coaxial cylindrical conductor was implemented to simulate the realistic equipment, as shown in Figures 8 and 9a, with the central conductor diameter being 90 mm, and the enclosure tank diameter being 320 mm. The tank was filled with SF6 gas of 0.1 MPa to simulate the propagation medium.
Four planar equiangular spiral antennas (PESAs) with impedance transformers were installed inside the GIS chamber in the hand holes parallel to the axis of the GIS bus-bar, as shown in Figures 8 and 9b. The PESA utilized in this study is an ultra-wideband antenna that can be installed internally in the GIS. Its outside and inside radii are 109 mm and 2 mm, respectively, with the substrate thickness being 1 mm. The other key parameters that can affect the performance of the PESA sensors include the relative dielectric constant, the spiral growth rate, and the rotation angle, which are chosen as 2.65, 0.364, and 3.5π, respectively. In practice, the most commonly quoted parameter to evaluate the performance of the antenna is the reflection coefficient S 11 , which represents how much power is reflected from the antenna. The range of the bandwidth that satisfies S 11 < −10 dB was 0.8 GHz to 3.5 GHz from the lower limit to the higher limit for the PESAs. Besides, the S 11 parameter ranged between −8 dB to −6 dB from 0.2 GHz and 0.7 GHz, and close to −10 dB in the 0.7 GHz to 0.8 GHz range, which satisfied the measurement requirements of the UHF signals. The detailed S 11 curve can be found in our previous study [48]. The geometrical dimensions of the GIS tank and the relative distances between the PESAs are also illustrated in Figure 8. and 9a, with the central conductor diameter being 90 mm, and the enclosure tank diameter being 320 mm. The tank was filled with SF6 gas of 0.1 MPa to simulate the propagation medium. Four planar equiangular spiral antennas (PESAs) with impedance transformers were installed inside the GIS chamber in the hand holes parallel to the axis of the GIS bus-bar, as shown in Figures  8 and 9b. The PESA utilized in this study is an ultra-wideband antenna that can be installed internally in the GIS. Its outside and inside radii are 109 mm and 2 mm, respectively, with the substrate thickness being 1 mm. The other key parameters that can affect the performance of the PESA sensors include the relative dielectric constant, the spiral growth rate, and the rotation angle, which are chosen as 2.65, 0.364, and 3.5 , respectively. In practice, the most commonly quoted parameter to evaluate the performance of the antenna is the reflection coefficient S11, which represents how much power is reflected from the antenna. The range of the bandwidth that satisfies S11 < −10 dB was 0.8 GHz to 3.5 GHz from the lower limit to the higher limit for the PESAs. Besides, the S11 parameter ranged between −8 dB to −6 dB from 0.2 GHz and 0.7 GHz, and close to −10 dB in the 0.7 GHz to 0.8 GHz range, which satisfied the measurement requirements of the UHF signals. The detailed S11 curve can be found in our previous study [48]. The geometrical dimensions of the GIS tank and the relative distances between the PESAs are also illustrated in Figure 8.   Five types of PDs were fabricated to simulate the various insulation defects, including the floating electrode, a metal protrusion on the conductor and the tank, surface contamination, and free metal particles. Besides, three relative angles between the defect position and the PESA antennas, which were 0°, 90°, and 180°, were integrated with the five defect types in the defect simulations.
As shown in Figure 10, the floating electrode defect was simulated by fixing two adjacent copper nuts to an insulated bolt that was attached on the conductor. The nuts were 3 mm away from the conductor, and the distance between the copper nuts was 1 mm. Two kinds of metal protrusions were replicated by a 30 mm needle adhered on the high-voltage conductor and tank, respectively. Surface contamination was replicated by a piece of aluminum foil (2 × 20 mm 2 ) adhered to the surface of the spacer. Finally, the free particles defect was produced by connecting an insulation tube that contained some aluminum particles inside the conductor, with the length of the insulation tube being 20 mm. The first four types of defects were arranged at three relative angles between the defect positions and the sensor position for 0°, 90°, and 180°, as shown in Figure 8, generating 13 PD conditions in total (the free metal particles were only simulated at 90°). Figure 11 shows some typical acquisitions of the UHF signals within 100 ns, and their corresponding frequency spectra.  Five types of PDs were fabricated to simulate the various insulation defects, including the floating electrode, a metal protrusion on the conductor and the tank, surface contamination, and free metal particles. Besides, three relative angles between the defect position and the PESA antennas, which were 0 • , 90 • , and 180 • , were integrated with the five defect types in the defect simulations.
As shown in Figure 10, the floating electrode defect was simulated by fixing two adjacent copper nuts to an insulated bolt that was attached on the conductor. The nuts were 3 mm away from the conductor, and the distance between the copper nuts was 1 mm. Two kinds of metal protrusions were replicated by a 30 mm needle adhered on the high-voltage conductor and tank, respectively. Surface contamination was replicated by a piece of aluminum foil (2 × 20 mm 2 ) adhered to the surface of the spacer. Finally, the free particles defect was produced by connecting an insulation tube that contained some aluminum particles inside the conductor, with the length of the insulation tube being 20 mm. The first four types of defects were arranged at three relative angles between the defect positions and the sensor position for 0 • , 90 • , and 180 • , as shown in Figure 8, generating 13 PD conditions in total (the free metal particles were only simulated at 90 • ). Figure 11 shows some typical acquisitions of the UHF signals within 100 ns, and their corresponding frequency spectra. Five types of PDs were fabricated to simulate the various insulation defects, including the floating electrode, a metal protrusion on the conductor and the tank, surface contamination, and free metal particles. Besides, three relative angles between the defect position and the PESA antennas, which were 0°, 90°, and 180°, were integrated with the five defect types in the defect simulations.
As shown in Figure 10, the floating electrode defect was simulated by fixing two adjacent copper nuts to an insulated bolt that was attached on the conductor. The nuts were 3 mm away from the conductor, and the distance between the copper nuts was 1 mm. Two kinds of metal protrusions were replicated by a 30 mm needle adhered on the high-voltage conductor and tank, respectively. Surface contamination was replicated by a piece of aluminum foil (2 × 20 mm 2 ) adhered to the surface of the spacer. Finally, the free particles defect was produced by connecting an insulation tube that contained some aluminum particles inside the conductor, with the length of the insulation tube being 20 mm. The first four types of defects were arranged at three relative angles between the defect positions and the sensor position for 0°, 90°, and 180°, as shown in Figure 8, generating 13 PD conditions in total (the free metal particles were only simulated at 90°). Figure 11 shows some typical acquisitions of the UHF signals within 100 ns, and their corresponding frequency spectra.

Experiment Evaluations
A detailed analysis was carried out to ascertain the model's discriminatory capability. Both the relative angles and defect types were taken into consideration, including the floating electrode (0°, 90°, and 180°), the metal protrusion on the conductor (0°, 90°, and 180°), the metal protrusion on the tank (0°, 90°, and 180°), the surface contamination (0°, 90°, and 180°), and the free particles (90° only), thus creating 13 defect modes. There were 1386 samples in total, of which 20% were used for testing. Two cases were designed for recognition with the same dataset, namely, (1) the combined diagnosis of positions and defect types together, which had 13 classes in total, and (2) diagnosis of the defect types only, which has five classes only.

Implementation Details
First, the lengths of the analysis windows were chosen based on the frequency decline range and the time decline range described in Figure 2, and 50% of the max value was chosen as the threshold. Instead of using the ranges directly, their proportions in the whole analysis time axis and frequency axis were calculated for ease of comparison. The time range limit and frequency range limit were chosen as 100 ns and 3 GHz, respectively. The sampling rate was 10 GHz, indicating that 0.1 ns is Figure 11. Typical UHF waveforms: (a) UHF signals of five different defect types; (b) UHF signals acquired from the four channels; (c) UHF signals of the three relative angles.

Experiment Evaluations
A detailed analysis was carried out to ascertain the model's discriminatory capability. Both the relative angles and defect types were taken into consideration, including the floating electrode (0 • , 90 • , and 180 • ), the metal protrusion on the conductor (0 • , 90 • , and 180 • ), the metal protrusion on the tank (0 • , 90 • , and 180 • ), the surface contamination (0 • , 90 • , and 180 • ), and the free particles (90 • only), thus creating 13 defect modes. There were 1386 samples in total, of which 20% were used for testing. Two cases were designed for recognition with the same dataset, namely, (1) the combined diagnosis of positions and defect types together, which had 13 classes in total, and (2) diagnosis of the defect types only, which has five classes only.

Implementation Details
First, the lengths of the analysis windows were chosen based on the frequency decline range and the time decline range described in Figure 2, and 50% of the max value was chosen as the threshold. Instead of using the ranges directly, their proportions in the whole analysis time axis and frequency axis were calculated for ease of comparison. The time range limit and frequency range limit were chosen as 100 ns and 3 GHz, respectively. The sampling rate was 10 GHz, indicating that 0.1 ns is added when the window length increases by 1. Figure 12 shows the variations of the percentages of the average time and frequency ranges. It is obvious that as the window length increased, the frequency resolution increases as well, while the time resolution showed an opposite trend. In order to retain enough resolution information for both the axis, 30% was selected as the threshold. It is observed that the window lengths were near 6 and 100 at this point. Therefore, the temporal window and frequency window were chosen as 6 and 100, respectively. Besides, the crossing point was obtained near 30, which was selected as the window for the texture channel.
added when the window length increases by 1. Figure 12 shows the variations of the percentages of the average time and frequency ranges. It is obvious that as the window length increased, the frequency resolution increases as well, while the time resolution showed an opposite trend. In order to retain enough resolution information for both the axis, 30% was selected as the threshold. It is observed that the window lengths were near 6 and 100 at this point. Therefore, the temporal window and frequency window were chosen as 6 and 100, respectively. Besides, the crossing point was obtained near 30, which was selected as the window for the texture channel. Besides, in order to further reduce the input dimension, a bigger stride was used along the time axis for the frequency channel, while a bigger frequency stride was utilized for the temporal channel in STFT, which could be seen as a subsampling. A cut-off of 2 GHz was also utilized in the temporal channel and the texture channel, considering that the details in the high-frequency band were trivial in their diagnosis. Finally, the windows and the input shapes of the three subnetworks are summarized in Table 2. Considering the limited amount of data, the neural network architecture was wide, but it had only a few convolutional maps per column. The detailed net structures are also shown in Table 2. The three columns are the time column, frequency column, and texture column, respectively.  Besides, in order to further reduce the input dimension, a bigger stride was used along the time axis for the frequency channel, while a bigger frequency stride was utilized for the temporal channel in STFT, which could be seen as a subsampling. A cut-off of 2 GHz was also utilized in the temporal channel and the texture channel, considering that the details in the high-frequency band were trivial in their diagnosis. Finally, the windows and the input shapes of the three subnetworks are summarized in Table 2. Considering the limited amount of data, the neural network architecture was wide, but it had only a few convolutional maps per column. The detailed net structures are also shown in Table 2. The three columns are the time column, frequency column, and texture column, respectively. Flatten-Concatenation-Dense100-Dense50 LSTM with 16 inner nodes-Flatten-Dense50-Output The learning algorithm setup was as follows: the Adadelta optimizer with a learning rate of 1.0 for the pretraining and MLP embedding stages, and a Rmsprop optimizer for LSTM with a learning rate of 0.001 and epsilon of 1 × 10 −6 . The pretraining was carried out for 1000, 800, and 400 iterations, with the patience being 100, 50, and 50. All networks were realized by Keras and NVIDIA 750TI GPUs (NVIDIA, Santa Clara, CA, USA). The tests averaged over five sets of initialization parameters were reported as the final results.

Diagnosis Accuracies
The recognition accuracies of the two cases of (1) diagnosis of both the defect positions and types, and (2) the defect types only, are shown in Tables 3 and 4. Overall, we achieved 97.51% testing accuracy for the recognition of both the angles and defect types, and 98.20% for the defect types only, based on the experimental dataset.
Furthermore, the diagnostic accuracies of single resolutions and single sensors with partial configurations are also presented as quantitative measures of the informativeness. The following points may be observed, based on the recognition accuracies.

1.
Multi-resolution diagnosis leads to a smaller misclassification rate than by using only the single resolution information.

2.
Sensor1 has the best accuracy among all the sensors with respect to the shortest distance to the defect location. The information loss can be quite significant after the L-shaped corner.

3.
A higher frequency resolution is more valuable for single resolution diagnosis.

4.
The multi-sensor combination is not certainly better than a single sensor without appropriate methods for the increasing the dimensionality and the confusing information carried by different sensors.

Loss Curves and Time Consumption
To diagnose the effectiveness of transfer learning, the loss curves and time consumptions of the single column training, multi-column training, and the LSTM training are shown in Figures 13 and 14. It is obvious that during the pretraining stage, the convergence speed of the texture channel was the fastest, while the frequency channel ranked second, and the temporal channel was the slowest. However, the time consumption was the other way around. The texture channel called for the longest training time. This was attributed to the number of dot calculations for each iteration being much more than the other two kinds of kernels. Besides, both the iteration number and the time consumption of the second and third training stages were much less, compared with their previous steps. For the final LSTM information fusion stage, only less than 10 s was needed.
The results in Tables 3 and 4, and Figures 13 and 14 show the exhaustive role played by transfer learning. It is significant that during the three training stages, the diagnostic accuracy increased, while both the training iterations and time decreased dramatically, even with a larger network. This dynamic indicates that the knowledge learned from earlier stages was successfully transferred to a higher level with less proneness to overfitting.
Sensors 2018, 18, x 17 of 26 more than the other two kinds of kernels. Besides, both the iteration number and the time consumption of the second and third training stages were much less, compared with their previous steps. For the final LSTM information fusion stage, only less than 10 s was needed. The results in Tables 3 and 4, and Figures 13 and 14 show the exhaustive role played by transfer learning. It is significant that during the three training stages, the diagnostic accuracy increased, while both the training iterations and time decreased dramatically, even with a larger network. This dynamic indicates that the knowledge learned from earlier stages was successfully transferred to a higher level with less proneness to overfitting.   more than the other two kinds of kernels. Besides, both the iteration number and the time consumption of the second and third training stages were much less, compared with their previous steps. For the final LSTM information fusion stage, only less than 10 s was needed. The results in Tables 3 and 4, and Figures 13 and 14 show the exhaustive role played by transfer learning. It is significant that during the three training stages, the diagnostic accuracy increased, while both the training iterations and time decreased dramatically, even with a larger network. This dynamic indicates that the knowledge learned from earlier stages was successfully transferred to a higher level with less proneness to overfitting.

Visualization of the Convolutional Filters
Neural networks are usually considered as black boxes. However, with the help of visualization methods, it is possible to obtain good qualitative interpretations of the black boxes of deep networks. Besides, visualization is also an effective way to diagnose the misclassifications and evaluate the degree of overfitting. In this study, the Activation Maximization [49] method was utilized to visualize the learned filters. The idea of Activation Maximization is quite simple: generating an image to maximize the activation of a hidden unit as: where x * is the visualization result, and θ donates the fixed neural network parameters. Thus, h ij (θ, x) is the activation values of unit j in layer i with fixed parameters θ and bounded norm ρ. For simplicity and the convenience of comparison, only the first layer of filters were visualized, as shown in Figure 15. It is clear that the learned convolutional filters were quite smooth in space, thus indicating that the training was sufficient. The temporal filters were sensitive to pulses along the time axis, while the frequency filters showed clear patterns of different frequency combinations. Besides, the diversity of patterns was richer in the visualization results of the texture filters, where both the temporal and frequency patterns revealed their importance. It is also encouraging that both the temporal filters and frequency filters gained significant resolution power over multiple scales, as shown by the different densities of the lines.

Visualization of the Convolutional Filters
Neural networks are usually considered as black boxes. However, with the help of visualization methods, it is possible to obtain good qualitative interpretations of the black boxes of deep networks. Besides, visualization is also an effective way to diagnose the misclassifications and evaluate the degree of overfitting. In this study, the Activation Maximization [49] method was utilized to visualize the learned filters. The idea of Activation Maximization is quite simple: generating an image to maximize the activation of a hidden unit as: where * is the visualization result, and donates the fixed neural network parameters. Thus, ℎ , is the activation values of unit in layer with fixed parameters and bounded norm . For simplicity and the convenience of comparison, only the first layer of filters were visualized, as shown in Figure 15.
It is clear that the learned convolutional filters were quite smooth in space, thus indicating that the training was sufficient. The temporal filters were sensitive to pulses along the time axis, while the frequency filters showed clear patterns of different frequency combinations. Besides, the diversity of patterns was richer in the visualization results of the texture filters, where both the temporal and frequency patterns revealed their importance. It is also encouraging that both the temporal filters and frequency filters gained significant resolution power over multiple scales, as shown by the different densities of the lines. Besides, a direct visualization of the outputs of the filters was helpful in illustrating the automatic feature extraction and the information compression of the filters. An illustrative example is shown in Figure 16. The outputs of the frequency filter and the temporal filter were only onedimensional, while the texture filter transferred the two-dimensional input to a smaller feature map. That is also why the texture filter calls for the longest training time. Besides, a direct visualization of the outputs of the filters was helpful in illustrating the automatic feature extraction and the information compression of the filters. An illustrative example is shown in Figure 16. The outputs of the frequency filter and the temporal filter were only one-dimensional, while the texture filter transferred the two-dimensional input to a smaller feature map. That is also why the texture filter calls for the longest training time.

Visualization of the Learned Embedding Features
Furthermore, by using a t-Distributed Stochastic Neighbor Embedding (t-SNE) projection algorithm, the 50-dimensional features learned by the subnetworks were reduced to three dimensions, as shown in Figures 17 and 18 for the cases of 13 and five classes, respectively. It was observed that the features learned automatically by the CNN of different defects fell into distinct clusters in the three-dimensional space, where each cluster showed clear manifold characteristics of a continuous distribution in the space.

Visualization of the Learned Embedding Features
Furthermore, by using a t-Distributed Stochastic Neighbor Embedding (t-SNE) projection algorithm, the 50-dimensional features learned by the subnetworks were reduced to three dimensions, as shown in Figures 17 and 18 for the cases of 13 and five classes, respectively. It was observed that the features learned automatically by the CNN of different defects fell into distinct clusters in the three-dimensional space, where each cluster showed clear manifold characteristics of a continuous distribution in the space.

Visualization of the Learned Embedding Features
Furthermore, by using a t-Distributed Stochastic Neighbor Embedding (t-SNE) projection algorithm, the 50-dimensional features learned by the subnetworks were reduced to three dimensions, as shown in Figures 17 and 18 for the cases of 13 and five classes, respectively. It was observed that the features learned automatically by the CNN of different defects fell into distinct clusters in the three-dimensional space, where each cluster showed clear manifold characteristics of a continuous distribution in the space.

Comparison with the Baselines
To further verify the effectiveness, some baselines were implemented. Instead of focusing on the specified methods, we first extracted the common framework utilized in the practice of UHF signal recognition, which were two-dimensionalization, matrix compression, decomposition, features extraction, feature selection (FS), and the final classifier. Second, by using typical models in certain steps, representative methods could be obtained. A description of some implemented techniques is as follows.
1. Two-dimensionalization: Wavelet spectrum and Hibert-Huang spectrum; 2. Matrix compression: NMF and 2DPCA; 3. Signal Decomposition: Wavelet transform and Hibert-Huang transform (HHT); 4. Features extraction: The extracted time and frequency (T&F) features include the max value, root mean square deviation, standard deviation, skewness, kurtosis, and the peak-to-peak value in the time domain. The frequency features include the mean frequency, frequency center, root mean square frequency, and the standard deviation frequency. Besides, the entropy is also calculated. 5. Feature selection: ExtraTrees Classifier and LinearSVC feature selection; 6. Classifier: Finally, both the SVM and DNN of dense layers are chosen as the final classifiers, due to their high representativeness in engineering practice.
Through the combination of the different techniques, 18 diagnosis models are summarized in Table 5 as the baselines, including both simple methods and complicated flows. However, the recognition accuracies are not certainly proportional to the complexity of the methods.
All of the hyperparameters are selected based on cross-validation to the best of our knowledge. The final recognition accuracies of (1) combined the diagnosis of positions and defect types together, 13 classes, and (2) diagnosis of the defect types only; five classes are shown in Tables 6 and 7, respectively.

Comparison with the Baselines
To further verify the effectiveness, some baselines were implemented. Instead of focusing on the specified methods, we first extracted the common framework utilized in the practice of UHF signal recognition, which were two-dimensionalization, matrix compression, decomposition, features extraction, feature selection (FS), and the final classifier. Second, by using typical models in certain steps, representative methods could be obtained. A description of some implemented techniques is as follows.
Features extraction: The extracted time and frequency (T&F) features include the max value, root mean square deviation, standard deviation, skewness, kurtosis, and the peak-to-peak value in the time domain. The frequency features include the mean frequency, frequency center, root mean square frequency, and the standard deviation frequency. Besides, the entropy is also calculated.
Classifier: Finally, both the SVM and DNN of dense layers are chosen as the final classifiers, due to their high representativeness in engineering practice.
Through the combination of the different techniques, 18 diagnosis models are summarized in Table 5 as the baselines, including both simple methods and complicated flows. However, the recognition accuracies are not certainly proportional to the complexity of the methods.
All of the hyperparameters are selected based on cross-validation to the best of our knowledge. The final recognition accuracies of (1) combined the diagnosis of positions and defect types together, 13 classes, and (2) diagnosis of the defect types only; five classes are shown in Tables 6 and 7, respectively.  The accuracy comparisons may provide some insights into the fundamental differences between the traditional feature-based methods and the deep learning approaches, as well as some essential flaws of the feature engineering methods.

1.
The misclassification rate of the proposed multi-resolution CNN clearly indicates its superior capability compared with the baseline methods. STFT's performance is slightly better than the Wavelet spectrum in certain resolutions, and much better than the HHT map.

2.
The best performance in the baseline methods is gained by the combination of STFT, NMF, and SVM, which is 92.09% for the 13 classes, and 93.88% for the five classes. Firstly, it is observed that NMF slightly outperforms 2DPCA, which may be explained as follows. In the experiments, the NMF gains the best performance with four dimensions, compared with the 12 dimensions of the 2DPCA, indicating that the redundancy is huge in the STFT maps. Thus, searching for a global projection in the redundancy data may be not a good idea. Second, the features extraction usually does not increase the model performance, which reflects the eternal contradiction among the learning ability of the model, the input dimensions, the distribution of the data, and the discernibility of the features. 3.
The performances of SVM and DNN are similar when utilizing the same input features. Therefore, for traditional feature-based recognition methods, the recognition accuracy relies much on the discriminative features.

4.
Similar to the conclusion drawn from Tables 3 and 4, the performance of multiple sensors is not certainly better than the single sensor, especially when the nearest single sensor can gain good diagnostic accuracy.

Discussion
Although the deep networks have gained tremendous performance in many areas, low interpretability is always an obstacle to applying them in some highly regulated environments, due to the uncontrollability that comes with the unclear mechanism. Sometimes, the training of the neural networks becomes a trial-and-error problem. In recent years, a great number of studies have focused on the interpretability of deep learning, and great progress has been made in both the theories and experiments. In the discussion, instead of pursuing a general solution, we only concentrate attention on the special CNN structures that are implemented here, and try to explain its effectiveness from the perspective of manifold learning.
A comprehensive comparison of the similarities between the filters learned by the CNN and the matrix projection-based methods, such as 2DPCA, is presented to provide some insights into the advantages of deep learning methods, compared with the feature extraction solutions. It is quite interesting to notice the similarity between the projection operation in 2DPCA and the convolutions in CNN. In 2DPCA, an image A of m × n matrix is projected into a lower dimension by: where X k is a vector of n dimensions, and k is the index. A visualization example of the 2DPCA's compressed result of the STFT spectrum is shown in Figure 19. It is easy to notice its similarity with the outputs in Figure 16, while the information concentration in Figure 16 is better. on the special CNN structures that are implemented here, and try to explain its effectiveness from the perspective of manifold learning. A comprehensive comparison of the similarities between the filters learned by the CNN and the matrix projection-based methods, such as 2DPCA, is presented to provide some insights into the advantages of deep learning methods, compared with the feature extraction solutions. It is quite interesting to notice the similarity between the projection operation in 2DPCA and the convolutions in CNN. In 2DPCA, an image of × matrix is projected into a lower dimension by: is a vector of dimensions, and is the index. A visualization example of the 2DPCA's compressed result of the STFT spectrum is shown in Figure 19. It is easy to notice its similarity with the outputs in Figure 16, while the information concentration in Figure 16 is better. where is the compression dimension. Therefore, in 2DPCA, it is assumed that any input can be approximately reconstructed as a weighted sum of smaller collections. In summary, the key points in 2DPCA include: (1) The projection operation for compressing, (2) The assumption that complex images are composed of eigenimages.
It is quite interesting that similar characteristics are shared by the proposed temporal and frequency filters.
(1) The dot production between the frequency filters and the spectrum, and the similar operation between the temporal filters and the spectrum can be seen as a projection operation, (2) In the CNN, it is also assumed that images are formed from elements of the lower level, like pixels to more composite representations [28].
Therefore, a common framework that distinguishes different categories by matrix projection may be extracted to unify the deep structure and 2DPCA. The CNN and 2DPCA both try to classify different categories by using matrix projections, and the main differences lies in their supervision methods, where the CNN is supervised by labels, while 2DPCA is supervised by the inner product in a high-dimensional space. Furthermore, a special case of NMF that aims at reducing the reconstruction errors by projecting could also be defined to fit this framework. The main specialty lies in that the loss function is defined at the single sample, and it uses the mean square error between the original images and the reconstructed ones. Moreover, it is also possible to reconstruct the original image from the compressed information by: (19) where d is the compression dimension. Therefore, in 2DPCA, it is assumed that any input can be approximately reconstructed as a weighted sum of smaller collections. In summary, the key points in 2DPCA include: (1) The projection operation for compressing, (2) The assumption that complex images are composed of eigenimages.
It is quite interesting that similar characteristics are shared by the proposed temporal and frequency filters.
(1) The dot production between the frequency filters and the spectrum, and the similar operation between the temporal filters and the spectrum can be seen as a projection operation, (2) In the CNN, it is also assumed that images are formed from elements of the lower level, like pixels to more composite representations [28].
Therefore, a common framework that distinguishes different categories by matrix projection may be extracted to unify the deep structure and 2DPCA. The CNN and 2DPCA both try to classify different categories by using matrix projections, and the main differences lies in their supervision methods, where the CNN is supervised by labels, while 2DPCA is supervised by the inner product in a high-dimensional space. Furthermore, a special case of NMF that aims at reducing the reconstruction errors by projecting could also be defined to fit this framework. The main specialty lies in that the loss function is defined at the single sample, and it uses the mean square error between the original images and the reconstructed ones.
Moreover, based on this framework, the 2DPCA can also be reformulated and approximated as a special case of the CNN with the following setups: (1) Use only a single layer of convolutions, and the width of the convolutional kernels should be restricted at one. (2) The loss function is defined as the sum of the pairwise Euclidean distance losses to force the model to maintain its relative relations in higher space. Some new algorithms may be developed based on this idea. Therefore, the superiority of the proposed may be explained as follows. Based on the same framework, only the CNN is goal-oriented at the final classifying task, while the 2DPCA and NMF aim at the other specified goals. It is usually hard to determine the effectiveness of these specified goals until obtaining the final results. Is it better to maintain the global information as 2DPCA, or to keep the local completeness as in NMF? In addition, the multiple projections also equip the CNN with a better learning ability to handle more complex problems.

Conclusions
Inspired by the phenomenon that the PD spectrograms of different resolutions can capture the information of different patterns, both multi-resolution and multi-sensor fusion algorithms are proposed. They are realized by a multi-column deep CNN characterized by specified filters and a sequential LSTM network. A multi-stage transfer learning framework is also presented to train the model with limited data. Several conclusions may be drawn based on the detailed analysis.

1.
The proposed multi-resolution model gains accuracies of 97.51% and 98.20% on the tasks of diagnosing (1) the positions and defect types together, and (2) the defect types only; thus indicating its clear superiority compared with the baselines. 2.
The loss curves, time consumptions, and diagnostic accuracies show the effectiveness of transfer learning, which successfully transfers knowledge from a lower level to a higher level with less proneness to overfitting.

3.
The comparisons with the baseline methods reveal several fundamental flaws of the feature based methods, such as the difficulties in choosing the most suitable approach, and the curse of dimensionality when encountering too many features from multiple sensors.

4.
A matrix projection framework is proposed to enhance the interpretability of the proposed deep network structure.
Author Contributions: G.L. and X.W. conceived and designed the experiments; X.L. provided the experimental data; G.L. wrote the paper; A.Y. revised the contents and reviewed the manuscript; M.R. provided theoretical guidance and supported the study.
Funding: This work was supported by the National Natural Science Foundation of China (No. 51521065).