A Deep Parallel Diagnostic Method for Transformer Dissolved Gas Analysis

: With the development of Industry 4.0, as a pivotal part of the power system, large-capacity power transformers are requiring fault diagnostic methods with higher intelligence, accuracy and anti-interference ability. Considering the powerful capability for extracting non-linear features and the sensitivity di ﬀ erences to features of deep learning methods, this paper proposes a deep parallel diagnostic method for transformer dissolved gas analysis (DGA). In view of the insu ﬃ cient and imbalanced dataset of transformers, adaptive synthetic oversampling (ADASYN) was implemented to augment the fault dataset. Then, the newly constructed dataset was normalized and input into the LSTM-based diagnostic framework. Then, the dataset was converted into images as the input of the CNN-based diagnostic framework. At the same time, the problem of still insu ﬃ cient data was compensated by the introduction of transfer learning technology. Finally, the diagnostic models were trained and tested respectively, and the Dempster–Shafer (DS) evidence theory was introduced to fuse the diagnostic conﬁdence matrices of the two models to achieve deep parallel diagnosis. The results of the proposed deep parallel diagnostic method show that without complex feature extraction, the diagnostic accuracy rate could reach 96.9%. Even when the dataset was superimposed with 3% random noises, the rate only decreased by 0.62%


Introduction
As the pivotal equipment of the power system, a power transformer's health directly affects the safety and stability of the entire power grid. Therefore, it is of great significance to study the fault diagnostic method of a transformer [1]. With the continuous development of computer storage and sensor technology, the online monitoring DGA data of power transformers will show an explosive growth trend [2,3], which has put forward higher requirements on the learning ability, feature extraction ability, and adaptability of transformer diagnostic methods.
The dissolved gas analysis (DGA) is an online monitoring technology to analyze the composition and content of dissolved gas in transformer oil [4]. By studying the correlation between the dissolved gas and the fault states of the transformer, the health status of the transformer can be effectively diagnosed and the latent faults can be eliminated in time. A lot of studies have been conducted on traditional diagnostic methods based on DGA data. Traditional DGA interpretation methods such as IEC Ratio, Rogers Ratio, and Doernenburg Ratio are simple and easy to implement and have been widely used in engineering practice. Yet, there are still problems, such as the absolute diagnostic boundaries and missing codes [5,6]. In addition, although traditional intelligent methods have achieved certain effects, they still have certain limitations on popularization and application due to solution to the problem of imbalanced data is oversampling, which has become the simplest and most reliable method to solve the imbalance and deficiency of a specific dataset [35]. The adaptive synthetic oversampling (ADASYN) algorithm is an improved technique for the synthetic minority over-sampling technique (SMOTE). According to the learning difficulty level of insufficient samples, the algorithm uses adaptive weight for different types of insufficient samples instead of using homogeneous weight. That is, if it is more difficult to understand a specific minority sample, more synthetic data about that sample will be generated. On the other hand, for the problem of insufficient data, transfer learning technology can be introduced in the CNN diagnostic framework. Transfer learning can map the knowledge in the B domain to the A domain [36][37][38]. With the transfer learning method, a small amount of data can still achieve a high diagnostic accuracy [39,40].
Generally speaking, there still exist the following problems in current transformer DGA fault diagnostic methods: • The diagnostic accuracy, adaptability and anti-interference ability of traditional methods still need to be improved; • The DGA fault dataset is in fact imbalanced and insufficient, which may have a bad effect on fault diagnosis modeling; • The sensitivity difference to fault features of deep learning models as well as their fusion have not been taken into consideration in recent researches.
In order to further improve the current transformer DGA fault diagnostic effect, this paper proposes a deep parallel diagnostic method for transformers. The main work and innovations of this article are as follows: • A deep parallel diagnostic method for power equipment is proposed. It can make full use of the capabilities of different deep learning methods to extract a complex nonlinear feature. Without complex feature extraction and a selection process, a higher diagnostic accuracy can be achieved, which means that the model has a high generalizability.

•
This article uses the data visualization method to convert DGA numerical data into images for CNN-based diagnostic framework, which can extract different features compared with the LSTM-based framework. Moreover, it can significantly enhance the anti-interference ability of the proposed method; • In view of the imbalance and inadequacy of the DGA fault dataset, this paper proposes to use ADASYN and transfer learning to solve the problem of imbalanced and insufficient data on fault diagnosis of a transformer.

Materials and Methods
CNN and LSTM are two important frameworks in deep learning field. Among them, CNN relies on convolution operations to directly process two-dimensional images, which can effectively learn corresponding features from a large number of samples and avoid complex feature extraction processes. LSTM inherits the characteristics of most RNN models and solves the problem of gradient dissipation. At present, it seems that feed-forward networks represented by CNNs still have performance advantages in the diagnosis field, while LSTMs have fewer applications in this field. But in the long run, the potential of LSTMs for more complex tasks which CNN cannot match will gradually begin to be accentuated because it more realistically characterizes or simulates the cognitive processes of human behavior, logical development, and neural tissue. Therefore, the research on the combination of CNN and LSTM is meaningful and worth noting.
The deep parallel diagnostic framework proposed in this paper is shown in Figure 1. The materials used in this paper and the implementation of ADASYN data augmentation technique are discussed in Section 2.1. The CNN-based diagnostic model is introduced in Section 2.2.1. The LSTM-based diagnostic model is introduced in Section 2.2.2. Finally, the deep parallel fusion (DPF) method is introduced in Section 2.2.3. This paper fully considers the sensitivity differences of deep learning methods to different features, constructs different deep learning methods to model on the same diagnostic problem, and introduces the Dempster-Shafer evidence theory (DS evidence theory) to perform DPF and obtain a higher diagnostic accuracy. This paper applies the whole framework to the field of transformer fault diagnosis based on DGA.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 18 perform DPF and obtain a higher diagnostic accuracy. This paper applies the whole framework to the field of transformer fault diagnosis based on DGA.

Distribution and Preprocessing of Dataset
In this paper, 528 cases of transformer DGA fault samples were collected from the State Grid companies as well as relevant papers. Each sample contains five major dissolved gas contents in transformer oil, including H2, CH4, C2H6, C2H4, C2H2. The data distribution is shown in Table 1. As can be seen from Table 1, the number of samples of serious faults such as high temperature overheating, low energy discharge, and high energy discharge is much larger than that of minor faults such as medium temperature overheating, low temperature overheating, and partial discharge. This seems to contradict the common assumption that the occurrence frequency of serious faults should theoretically be lower than that of the minor faults. That is because the current sensitivity of sensors and the accuracy of diagnostic methods still need to be strengthened. It is still difficult to effectively diagnose latent faults, which often remain undetected until they become serious faults. From this perspective, it is necessary to study the more accurate diagnostic method of transformers. Low temperature overheating 14 4 Partial discharge 20 5 Low energy discharge 239 6 High energy discharge 113 Total 528 Each DGA data was normalized according to Formula (1) into fault characteristic gas index as shown in Formula (2).

Distribution and Preprocessing of Dataset
In this paper, 528 cases of transformer DGA fault samples were collected from the State Grid companies as well as relevant papers. Each sample contains five major dissolved gas contents in transformer oil, including H 2 , CH 4 , C 2 H 6 , C 2 H 4 , C 2 H 2 . The data distribution is shown in Table 1. As can be seen from Table 1, the number of samples of serious faults such as high temperature overheating, low energy discharge, and high energy discharge is much larger than that of minor faults such as medium temperature overheating, low temperature overheating, and partial discharge. This seems to contradict the common assumption that the occurrence frequency of serious faults should theoretically be lower than that of the minor faults. That is because the current sensitivity of sensors and the accuracy of diagnostic methods still need to be strengthened. It is still difficult to effectively diagnose latent faults, which often remain undetected until they become serious faults. From this perspective, it is necessary to study the more accurate diagnostic method of transformers. Each DGA data was normalized according to Formula (1) into fault characteristic gas index as shown in Formula (2).
where X i,j represents the fault samples; E(X) is the mean; D(X) is the variance; p i,j represents the normalized value of the j-th fault sample of the i-th fault type; a n i,j represents the n-th normalized fault characteristic gas index of p i,j . In this paper, n = 5.

ADASYN's Principle and Implementation
The occurrence frequency of different fault types in the given dataset is significantly different. If the imbalanced dataset is used directly to train a deep learning framework, the results will not be ideal. In view of the adverse effects of the imbalanced dataset, the task of balancing the original dataset should be given priority.
The basic idea of the SMOTE algorithm is to artificially synthesize new samples of the minority class based on the given dataset in order to balance and expand the dataset [41]. One of its obvious limitations is that the objects of the minority class are homogenized. The facts show that the importance and impact of different fault samples on the further learning of the diagnostic models are not similar. In addition, if the fault samples in the original dataset were mixed with noise, assigning the same oversampling weight to all the fault samples in the minority class may be counterproductive. In addition, many studies have shown that during the training process, fault samples closer to the decision boundary contribute more to the establishment of classification boundaries than other fault samples. In order to solve this limitation for the transformer DGA imbalanced dataset, we propose to use an adaptive form of SMOTE, called the ADASYN algorithm, to enhance the imbalanced dataset.
The steps of the ADASYN algorithm are as follows: First, define multiple data spaces: Γ i (i = 1, 2, . . . , m − 1) and Γ i | i=m , which represent spaces for fault samples in minority class and space for fault samples in the majority class, respectively. In this paper, m= 6. Γ 1 represents space for samples with a high temperature overheating fault; Γ 2 represents space for samples with a medium temperature overheating fault; Γ 3 represents space for samples with a low temperature overheating fault; Γ 4 represents space for samples with a partial discharge fault; Γ 5 represents space for samples with a high energy discharge fault; Γ 6 which has the highest number of samples, represents space for samples with a low energy discharge fault. Each space has five dimensions because there are five different fault characteristic indexes. However, for the convenience of illustration, the space will be converted into a two or three-dimensional diagram in this paper.
Next, supposed the number of neighbors of the j-th fault sample p i,j in the i-th minority class data space Γ i is z, then calculate its local reachability density to the majority fault samples cluster as well as the minority fault samples cluster. The calculated result can be, respectively, noted as D maj i,j and D min i,j ( j = 1, 2, . . . , n i ). Local reachability density is defined by where Z is the nearest neighbor set of the fault sample p i,j when the number of neighbors is z, and q is the neighbors of p i,j in Z.
The meaning of the local reachability density is illustrated in Figure 2. The pink ball represents fault samples of the i-th minority class and the blue prism represents fault samples of the majority class. When z is equal to 5, D min i,j of the selected minority fault sample p i,j , which is positioned at the origin of coordinates, is big, while its D maj i,j is small, indicating that there are more fault samples of the same Appl. Sci. 2020, 10, 1329 6 of 18 minority class in the neighborhood. Thus, the learning of this selected fault sample is not too difficult. The learning difficulty can be quantified by the ratio of D maj i,j and D min i,j , as shown in Equation (4). Therefore, the learning difficulty of each minority fault sample can be quantified so that each fault sample can be assigned a tailor-made oversampling weight according to its learning difficulty.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 18 where Z is the nearest neighbor set of the fault sample , i j p when the number of neighbors is z , and q is the neighbors of , The meaning of the local reachability density is illustrated in Figure 2. The pink ball represents fault samples of the i-th minority class and the blue prism represents fault samples of the majority class. When z is equal to 5,  (4). Therefore, the learning difficulty of each minority fault sample can be quantified so that each fault sample can be assigned a tailor-made oversampling weight according to its learning difficulty.
(a) When z = 5, The number of synthetic fault samples generated from each minority class can be calculated according to the difficulty coefficient as shown in Equation (6).  Then, the oversampling weights of fault samples of the minority class are normalized to obtain the difficulty coefficient ∂ i,j , as shown in Equation (5).
The number of synthetic fault samples generated from each minority class can be calculated according to the difficulty coefficient as shown in Equation (6).
where N o is the expected number of newly synthetic samples. Finally, new samples were synthesized by the procedures introduced in [41].

CNN-Based Transformer DGA Diagnostic Method
As shown in Table 2, transforming measured data such as waveforms into pictures through certain data visualization techniques and inputting them to CNN for diagnosis is a commonly used method in the current researches of diagnostic methods [42]. DGA fault data contains rich information on transformer fault states. The reason for the relatively lower accuracy of traditional methods is that it is difficult to find accurate mathematical models to characterize the complex nonlinear relationship between DGA data and transformer fault states. In order to fully mine the transformer state information Appl. Sci. 2020, 10, 1329 7 of 18 contained in the DGA data, the normalized numerical DGA data are converted into the more intuitive images, which are convenient for CNN to process. Table 2. Data visualization methods.

Monitoring image
As shown in Table 2, transforming measured data such as waveforms into pictures through certain data visualization techniques and inputting them to CNN for diagnosis is a commonly used method in the current researches of diagnostic methods [42]. DGA fault data contains rich information on transformer fault states. The reason for the relatively lower accuracy of traditional methods is that it is difficult to find accurate mathematical models to characterize the complex nonlinear relationship between DGA data and transformer fault states. In order to fully mine the transformer state information contained in the DGA data, the normalized numerical DGA data are converted into the more intuitive images, which are convenient for CNN to process.

Monitoring image
Trim away the extra parts and highlight the detected or pivotal parts.

Waveform graph
Perform a decomposition, spectrum analysis, and other process to convert the curve into a characteristic spectrum.

Parameters
Construct a suitable data graph according to the parameter value size, value range, etc.
Text Use text visualization technologies such as Generative Adversarial Networks (GAN).
The continuous improvement of equipment manufacturing technologies and fault monitoring methods may lead to insufficient DGA fault samples. Transfer learning can map the knowledge in the B domain to the A domain. When using transfer learning, the CNN parameters after pre-trained on the public dataset are set to initial values and part of the network is frozen for training. Transfer learning is used to enable the network to obtain the fine characteristics of the equipment to be diagnosed, making up for the lack of power equipment's fault samples. With the transfer learning method, a small amount of data can still achieve a high training accuracy [43].
At present, there are dozens of CNN image recognition models suitable for large image databases. Among them, MobileNet-V2 [44] (its basic module is shown in Figure 3) is a lightweight neural network for limited computing resource environment proposed by the Google team in 2018. MobileNet-V2 is based on the basic concepts of MobileNet-V1 [45]. MobileNet-V1 innovatively proposed to use depth-wise separable convolution instead of traditional convolution operations. Although this can reduce the number of parameters and operations, it will also cause a loss of features and decrease accuracy. MobileNet-V2 has made certain relevant improvements against the Trim away the extra parts and highlight the detected or pivotal parts.

Waveform graph
As shown in Table 2, transforming measured data such as waveforms into pictures through certain data visualization techniques and inputting them to CNN for diagnosis is a commonly used method in the current researches of diagnostic methods [42]. DGA fault data contains rich information on transformer fault states. The reason for the relatively lower accuracy of traditional methods is that it is difficult to find accurate mathematical models to characterize the complex nonlinear relationship between DGA data and transformer fault states. In order to fully mine the transformer state information contained in the DGA data, the normalized numerical DGA data are converted into the more intuitive images, which are convenient for CNN to process.

Type of Data Examples Data Visualization Method
Monitoring image Trim away the extra parts and highlight the detected or pivotal parts.

Waveform graph
Perform a decomposition, spectrum analysis, and other process to convert the curve into a characteristic spectrum.

Parameters
Construct a suitable data graph according to the parameter value size, value range, etc.
Text Use text visualization technologies such as Generative Adversarial Networks (GAN).
The continuous improvement of equipment manufacturing technologies and fault monitoring methods may lead to insufficient DGA fault samples. Transfer learning can map the knowledge in the B domain to the A domain. When using transfer learning, the CNN parameters after pre-trained on the public dataset are set to initial values and part of the network is frozen for training. Transfer learning is used to enable the network to obtain the fine characteristics of the equipment to be diagnosed, making up for the lack of power equipment's fault samples. With the transfer learning method, a small amount of data can still achieve a high training accuracy [43].
At present, there are dozens of CNN image recognition models suitable for large image databases. Among them, MobileNet-V2 [44] (its basic module is shown in Figure 3) is a lightweight neural network for limited computing resource environment proposed by the Google team in 2018. MobileNet-V2 is based on the basic concepts of MobileNet-V1 [45]. MobileNet-V1 innovatively proposed to use depth-wise separable convolution instead of traditional convolution operations. Although this can reduce the number of parameters and operations, it will also cause a loss of features and decrease accuracy. MobileNet-V2 has made certain relevant improvements against the Perform a decomposition, spectrum analysis, and other process to convert the curve into a characteristic spectrum.

Parameters
As shown in Table 2, transforming measured data such as waveforms into pictures through certain data visualization techniques and inputting them to CNN for diagnosis is a commonly used method in the current researches of diagnostic methods [42]. DGA fault data contains rich information on transformer fault states. The reason for the relatively lower accuracy of traditional methods is that it is difficult to find accurate mathematical models to characterize the complex nonlinear relationship between DGA data and transformer fault states. In order to fully mine the transformer state information contained in the DGA data, the normalized numerical DGA data are converted into the more intuitive images, which are convenient for CNN to process.

Type of Data Examples Data Visualization Method
Monitoring image Trim away the extra parts and highlight the detected or pivotal parts.

Waveform graph
Perform a decomposition, spectrum analysis, and other process to convert the curve into a characteristic spectrum.

Parameters
Construct a suitable data graph according to the parameter value size, value range, etc.
Text Use text visualization technologies such as Generative Adversarial Networks (GAN).
The continuous improvement of equipment manufacturing technologies and fault monitoring methods may lead to insufficient DGA fault samples. Transfer learning can map the knowledge in the B domain to the A domain. When using transfer learning, the CNN parameters after pre-trained on the public dataset are set to initial values and part of the network is frozen for training. Transfer learning is used to enable the network to obtain the fine characteristics of the equipment to be diagnosed, making up for the lack of power equipment's fault samples. With the transfer learning method, a small amount of data can still achieve a high training accuracy [43].
At present, there are dozens of CNN image recognition models suitable for large image databases. Among them, MobileNet-V2 [44] (its basic module is shown in Figure 3) is a lightweight neural network for limited computing resource environment proposed by the Google team in 2018. MobileNet-V2 is based on the basic concepts of MobileNet-V1 [45]. MobileNet-V1 innovatively proposed to use depth-wise separable convolution instead of traditional convolution operations. Although this can reduce the number of parameters and operations, it will also cause a loss of features and decrease accuracy. MobileNet-V2 has made certain relevant improvements against the Construct a suitable data graph according to the parameter value size, value range, etc.

Text
As shown in Table 2, transforming measured data such as waveforms into pictures through certain data visualization techniques and inputting them to CNN for diagnosis is a commonly used method in the current researches of diagnostic methods [42]. DGA fault data contains rich information on transformer fault states. The reason for the relatively lower accuracy of traditional methods is that it is difficult to find accurate mathematical models to characterize the complex nonlinear relationship between DGA data and transformer fault states. In order to fully mine the transformer state information contained in the DGA data, the normalized numerical DGA data are converted into the more intuitive images, which are convenient for CNN to process.

Monitoring image
Trim away the extra parts and highlight the detected or pivotal parts.

Waveform graph
Perform a decomposition, spectrum analysis, and other process to convert the curve into a characteristic spectrum.

Parameters
Construct a suitable data graph according to the parameter value size, value range, etc.
Text Use text visualization technologies such as Generative Adversarial Networks (GAN).
The continuous improvement of equipment manufacturing technologies and fault monitoring methods may lead to insufficient DGA fault samples. Transfer learning can map the knowledge in the B domain to the A domain. When using transfer learning, the CNN parameters after pre-trained on the public dataset are set to initial values and part of the network is frozen for training. Transfer learning is used to enable the network to obtain the fine characteristics of the equipment to be diagnosed, making up for the lack of power equipment's fault samples. With the transfer learning method, a small amount of data can still achieve a high training accuracy [43].
At present, there are dozens of CNN image recognition models suitable for large image databases. Among them, MobileNet-V2 [44] (its basic module is shown in Figure 3) is a lightweight neural network for limited computing resource environment proposed by the Google team in 2018. MobileNet-V2 is based on the basic concepts of MobileNet-V1 [45]. MobileNet-V1 innovatively proposed to use depth-wise separable convolution instead of traditional convolution operations. Although this can reduce the number of parameters and operations, it will also cause a loss of features and decrease accuracy. MobileNet-V2 has made certain relevant improvements against the Use text visualization technologies such as Generative Adversarial Networks (GAN).
The continuous improvement of equipment manufacturing technologies and fault monitoring methods may lead to insufficient DGA fault samples. Transfer learning can map the knowledge in the B domain to the A domain. When using transfer learning, the CNN parameters after pre-trained on the public dataset are set to initial values and part of the network is frozen for training. Transfer learning is used to enable the network to obtain the fine characteristics of the equipment to be diagnosed, making up for the lack of power equipment's fault samples. With the transfer learning method, a small amount of data can still achieve a high training accuracy [43].
At present, there are dozens of CNN image recognition models suitable for large image databases. Among them, MobileNet-V2 [44] (its basic module is shown in Figure 3) is a lightweight neural network for limited computing resource environment proposed by the Google team in 2018. MobileNet-V2 is based on the basic concepts of MobileNet-V1 [45]. MobileNet-V1 innovatively proposed to use depth-wise separable convolution instead of traditional convolution operations. Although this can reduce the number of parameters and operations, it will also cause a loss of features and decrease accuracy. MobileNet-V2 has made certain relevant improvements against the shortcomings of MobileNet-V1 and proposed two innovative design ideas: inverted residuals and linear bottlenecks.  Inverted residuals: the depth-wise convolution of MobileNet-V2 was preceded by a 1 × 1 "expansion" layer. Its purpose is to expand the number of channels in the data before the data enters deep convolutions, enrich the number of features, and improve accuracy.
Linear bottlenecks: MobileNet-V2 proposed to replace the ReLU activation function with a linear activation function after layers with fewer channels. Due to the introduction of the "expansion layer", a large number of features output by the convolution layer need to be "compressed" to reduce the amount of calculation. As the number of channels decreases, if the activation function is still ReLU, the features will be destroyed. This is because ReLU's outputs for negative inputs are all zeros; the original features are already "compressed" and if they pass through ReLU, a lot of them will be lost.
Due to the large cost of accidents in large power transformers, the efficiency requirements for the fault diagnosis algorithm are numerous. MobileNet-V2 can obtain a higher accuracy without too much calculating time and computing resource. Therefore, the MobileNet-V2 model was introduced in this diagnostic scenario as the pre-trained model for transfer learning.

LSTM-Based Transformer DGA Diagnostic Method
Due to the length limitation, the principle of LSTM, which can be found in the [46], will not be described in detail. Let us focus on the deep learning framework built based on LSTM in this paper. It is a five-layer network, as shown in Figure 4, which includes the input layer, the LSTM layer, the fully connected layer, the softmax layer, and the classification output layer. The state excitation function of the LSTM network node is 'tanh' and the gate excitation function is 'sigmoid'.  Inverted residuals: the depth-wise convolution of MobileNet-V2 was preceded by a 1 × 1 "expansion" layer. Its purpose is to expand the number of channels in the data before the data enters deep convolutions, enrich the number of features, and improve accuracy.
Linear bottlenecks: MobileNet-V2 proposed to replace the ReLU activation function with a linear activation function after layers with fewer channels. Due to the introduction of the "expansion layer", a large number of features output by the convolution layer need to be "compressed" to reduce the amount of calculation. As the number of channels decreases, if the activation function is still ReLU, the features will be destroyed. This is because ReLU's outputs for negative inputs are all zeros; the original features are already "compressed" and if they pass through ReLU, a lot of them will be lost.
Due to the large cost of accidents in large power transformers, the efficiency requirements for the fault diagnosis algorithm are numerous. MobileNet-V2 can obtain a higher accuracy without too much calculating time and computing resource. Therefore, the MobileNet-V2 model was introduced in this diagnostic scenario as the pre-trained model for transfer learning.

LSTM-Based Transformer DGA Diagnostic Method
Due to the length limitation, the principle of LSTM, which can be found in the [46], will not be described in detail. Let us focus on the deep learning framework built based on LSTM in this paper. It is a five-layer network, as shown in Figure 4, which includes the input layer, the LSTM layer, the fully connected layer, the softmax layer, and the classification output layer. The state excitation function of the LSTM network node is 'tanh' and the gate excitation function is 'sigmoid'.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 18 shortcomings of MobileNet-V1 and proposed two innovative design ideas: inverted residuals and linear bottlenecks. Inverted residuals: the depth-wise convolution of MobileNet-V2 was preceded by a 1 × 1 "expansion" layer. Its purpose is to expand the number of channels in the data before the data enters deep convolutions, enrich the number of features, and improve accuracy.
Linear bottlenecks: MobileNet-V2 proposed to replace the ReLU activation function with a linear activation function after layers with fewer channels. Due to the introduction of the "expansion layer", a large number of features output by the convolution layer need to be "compressed" to reduce the amount of calculation. As the number of channels decreases, if the activation function is still ReLU, the features will be destroyed. This is because ReLU's outputs for negative inputs are all zeros; the original features are already "compressed" and if they pass through ReLU, a lot of them will be lost.
Due to the large cost of accidents in large power transformers, the efficiency requirements for the fault diagnosis algorithm are numerous. MobileNet-V2 can obtain a higher accuracy without too much calculating time and computing resource. Therefore, the MobileNet-V2 model was introduced in this diagnostic scenario as the pre-trained model for transfer learning.

LSTM-Based Transformer DGA Diagnostic Method
Due to the length limitation, the principle of LSTM, which can be found in the [46], will not be described in detail. Let us focus on the deep learning framework built based on LSTM in this paper. It is a five-layer network, as shown in Figure 4, which includes the input layer, the LSTM layer, the fully connected layer, the softmax layer, and the classification output layer. The state excitation function of the LSTM network node is 'tanh' and the gate excitation function is 'sigmoid'.

A Deep Parallel Fusion Diagnostic Method
There is usually a softmax layer in deep learning frameworks based on CNN and LSTM when they are used to diagnose. After the training has been done and the testing dataset is input into the diagnostic network, the softmax layer will output a confidence matrix of the fault categories for the testing dataset. And the softmax functions as shown in Equation (7) So f tmax( . . . where x represents the input of the softmax layer which is exactly the output of the previous layer of the softmax layer; θ is the weight parameter matrix, of which θ i (i = 1, 2, . . . , N) is the element; H represents the fault label; N is the number of fault labels; ξ(H = η θ) is the probability value of H = η.
Using the DS evidence theory algorithm to fuse the fault confidence matrices output by the softmax layers of the two diagnostic frameworks based, respectively, on CNN and LSTM, can enable one to make full use of the diagnostic advantages of the two deep learning methods and finally improve the diagnostic accuracy.
The calculation process of DPF is as follows: (1) Obtain the confidence matrices output by the softmax layer of the deep learning diagnostic frameworks: that is, a group of the diagnostic support vectors of the deep learning framework for different fault labels. For each set of fault data, the deep learning diagnostic framework's diagnosis support vector for the fault labels can be noted as · ξ k,l = [ξ k,1 , ξ k,2 , . . . , ξ k,γ , . . . , ξ k,l ]. (The value of k represents different methods. In this paper, k = 2 (CNN, LSTM) and l is the total number of fault labels, that is, γ = 1, 2, . . . , l, in this paper, l = 6); (2) Using different methods k as the rows and the diagnostic supports ξ k,γ as columns to form a support matrix, that is ξ k,γ , k = 1, 2, γ = 1, . . . , l . Each element of the support matrix indicates that the k-th diagnostic method's support for the fault label H γ is ξ k,γ .
(3) Treat each column in the support matrix as the identification framework Θ of DS evidence theory, so Θ = {H} = H γ γ = 1, 2, . . . , l = {H 1 , H 2 , . . . , H l }; (4) Integrate the diagnostic support information of different diagnostic methods into the same recognition framework Θ and calculate the basic probability assignment (BPA): where ω k is the weight value of each diagnostic method k (k = 1, 2); m k,γ represents the basic probability assignment of the k-th method for the evaluation target, which is the transformer's fault state H γ . m k,H represents the remaining probability that is assigned to the entire fault set H, rather than to a specific transformer's fault state; m k,H , m k,H are two intermediate variables for calculating basic probability assignment.
(5) Composite probability assignment (CPA): where m γ represents the composite probability assignment for the fault label H γ ; K, m H , m H , are three intermediate variables for calculating composite probability assignment. Then, they will be normalized to obtain comprehensive diagnostic results: where ξ γ , γ = 1, · · · , l is the fused and normalized confidence result of different fault labels. ξ γ is the normalized value of the uncertainty distribution.
Finally, corresponding to the maximum confidence, the fault label will be output as the final diagnostic result.

Results of the Data Augmentation
In order to intuitively observe the effectiveness of the ADASYN method, the distribution of fault samples before and after the data augmentation has been visualized by the t-SNE method, as shown in Figure 5. T-SNE can visualize high-dimensional data by specifying a location for each sample in a 2-D or 3-D map [47]. As can be seen from Figure 5, the ADASYN technology does not simply copy the fault samples, but generates entirely new fault samples. In addition, it can be seen that these fault samples are not easy to separate. The new dataset distribution is shown in Table 3. Table 3. Distribution of the new data set. Augmentation  1  High temperature overheating  232  2  Medium temperature overheating  240  3  Low temperature overheating  245  4 Partial discharge 237 5

Fault Label Code Fault Label Total Number of Samples After Data
Low energy discharge 240 6 High energy discharge 243 Total 1437 Since the data augmentation was implemented and achieved satisfying results, the new dataset is ready for the study of the deep parallel diagnostic framework.

Data Visualization and Deep Transfer Learning
After the image augmentation and normalization process, turn the fault characteristic index into images by data visualization technique to provide effective inputs for the CNN framework. This paper uses two methods for the data visualization process. One is using changes in height to represent the difference in values which obtains the dataset A, as shown in Figure 6a. The other is to use changes in color to represent the difference in values which obtains the dataset B, as shown in Figure 6b. Each dataset is divided into a training dataset and a testing dataset according to a ratio of 4:1. Next, import the MobileNet-V2 network and replace the last layer of the network with a fully connected layer (FL) which has learnable weights and the same output number equal to the number of fault labels. Adjust the hyperparameters as shown in Table 4; the computer configuration is also As can be seen from Figure 5, the ADASYN technology does not simply copy the fault samples, but generates entirely new fault samples. In addition, it can be seen that these fault samples are not easy to separate. The new dataset distribution is shown in Table 3. Since the data augmentation was implemented and achieved satisfying results, the new dataset is ready for the study of the deep parallel diagnostic framework.

Data Visualization and Deep Transfer Learning
After the image augmentation and normalization process, turn the fault characteristic index into images by data visualization technique to provide effective inputs for the CNN framework. This paper uses two methods for the data visualization process. One is using changes in height to represent the difference in values which obtains the dataset A, as shown in Figure 6a. The other is to use changes in color to represent the difference in values which obtains the dataset B, as shown in Figure 6b. Each dataset is divided into a training dataset and a testing dataset according to a ratio of 4:1. Next, import the MobileNet-V2 network and replace the last layer of the network with a fully connected layer (FL) which has learnable weights and the same output number equal to the number of fault labels. Adjust the hyperparameters as shown in Table 4; the computer configuration is also listed in this table. Then, the two datasets A and B are respectively input into the modified MobileNet-V2 network for transfer training and testing.

Construction and Training of the LSTM-Based Network
After the image augmentation and normalization process, construct a five-layer network, as shown in Figure 4, including the input layer, LSTM layer, fully connected layer, softmax layer, and classification output layer. The state excitation function of the LSTM network node is 'tanh' and the gate excitation function is 'sigmoid'. The hyperparameters of the network and the computer configuration parameters are as shown in Table 5. Then, the numerical dataset is input into the LSTMbased network for training and testing. After the training progress, the diagnostic accuracy rate was  After the training progress, the diagnostic accuracy rate based on dataset A can reach 92.1%, while the one based on dataset B is only 80.3%. Exploring the mechanism of CNN, it can be known that CNN is mainly based on the discrimination of image texture and its ability to distinguish colors is relatively poor. Therefore, it is more appropriate to distinguish the data difference by the height difference in the transformer DGA fault diagnostic scenario.

Construction and Training of the LSTM-Based Network
After the image augmentation and normalization process, construct a five-layer network, as shown in Figure 4, including the input layer, LSTM layer, fully connected layer, softmax layer, and classification output layer. The state excitation function of the LSTM network node is 'tanh' and the gate excitation function is 'sigmoid'. The hyperparameters of the network and the computer configuration parameters are as shown in Table 5. Then, the numerical dataset is input into the LSTM-based network for training and testing. After the training progress, the diagnostic accuracy rate was 93.6%. It seems the LSTM-based network performs better than the CNN-based network in this transformer DGA diagnosis. However, this is not the case in every diagnostic scenario. Still, it is worth finding a way to combine the advantages of the two widely used deep networks.

Implementation of the DPF
After finishing the training of CNN-based and LSTM-based deep networks, respectively, the testing dataset was input into the two networks for testing. Then, the output confidence matrices of the softmax layers in the two frameworks were extracted to implement the DPF according to the procedures described in detail in Section 2.2.3. Because the amount of testing data is too large to illustrate in a graph, we just drew a schematic diagram of the fusion as shown in Figure 7. The output of the softmax layer was digital data. In order to visualize the data intuitively, this paper uses color to distinguish the value of data. As shown in Figure 7, each column in the output matrix of the softmax layer represents a set of testing data and its confidence level for the six different fault labels. The DPF diagnostic result and the diagnostic result of the single deep learning method are compared in Table 6.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 18 93.6%. It seems the LSTM-based network performs better than the CNN-based network in this transformer DGA diagnosis. However, this is not the case in every diagnostic scenario. Still, it is worth finding a way to combine the advantages of the two widely used deep networks. After finishing the training of CNN-based and LSTM-based deep networks, respectively, the testing dataset was input into the two networks for testing. Then, the output confidence matrices of the softmax layers in the two frameworks were extracted to implement the DPF according to the procedures described in detail in Section 2.2.3. Because the amount of testing data is too large to illustrate in a graph, we just drew a schematic diagram of the fusion as shown in Figure 7. The output of the softmax layer was digital data. In order to visualize the data intuitively, this paper uses color to distinguish the value of data. As shown in Figure 7, each column in the output matrix of the softmax layer represents a set of testing data and its confidence level for the six different fault labels. The DPF diagnostic result and the diagnostic result of the single deep learning method are compared in Table 6.
As can be seen from Table 6, the effectiveness of the deep parallel diagnostic method is further improved, which means that it can effectively extract the complex transformer fault information contained in the transformer DGA data.   As can be seen from Table 6, the effectiveness of the deep parallel diagnostic method is further improved, which means that it can effectively extract the complex transformer fault information contained in the transformer DGA data.

Comparison to the State of Arts
We used eight types of models, including SVM, KNN, gradient boosting descent tree (GBDT), NN, the fuzzy c-means algorithm (FCM), CNN, LSTM and deep parallel diagnosis to diagnose the same dataset before and after the data augmentation in the same computer environment. In the SVM model, the kernel function is Gaussian, the kernel scale is 0.56, and the box constraint level is 1. In the KNN model, the number of neighbors is 1, the distance metric is Euclidean, and the distance weight is equal. In the GBDT model, the maximum number of splits is 20, the number of learners is 30, and the learning rate is 0.1. As for the structure of the NN model, it contains an input layer of five neurons, a hidden layer of 10 neurons, and an output layer of six neurons. The training epoch is 1000 and the learning rate is 0.01. In the FCM model, the exponent for the fuzzy partition matrix is 2.0, the maximum number of iterations is 1000, and the minimum improvement in objective function between two consecutive iterations is 1 × 10 −5 .
The comparison results are shown in Table 7. Due to the uncertainty of the results of some methods, the diagnostic accuracy of this article is the average result of five independent experiments. The diagnostic accuracy of all the eight diagnostic methods increased after the data augmentation. It can be observed that the imbalance of the former dataset limits the effectiveness of the fault diagnostic method to a certain extent. Moreover, the deep parallel diagnosis performs better than the traditional intelligent algorithms and single deep learning method in the transformer diagnostic scenario without conducting a complex feature extraction and selection process.

The Anti-Interference Ability of the Deep Parallel Diagnostic Method
In order to test the anti-interference ability of the models, 3% random noises were superimposed on the dataset, which has been augmented. Then, the processed dataset was used to train and test the same models. The results are shown in Table 8. As can be seen from Table 8, the anti-interference performance of traditional machine learning methods is relatively worse than the deep learning methods. In the application scenario of this article, only the anti-interference performance of KNN can compete with the deep learning methods. Among the two deep learning methods, CNN has the best anti-interference performance. In addition, the deep parallel diagnosis inherits the powerful anti-interference ability of CNN and retains a high diagnostic accuracy.

Conclusions
In this paper, a deep parallel diagnostic method was proposed into the field of transformer DGA fault diagnosis. ADASYN and transfer learning were introduced to deal with the problem of data imbalance and insufficiency. The conclusions are as follows: • In view of the situation that the actual fault samples of the transformer DGA are imbalanced and inadequate, which will cause the ineffectiveness of the fault diagnostic method, ADASYN and transfer learning technology were used to effectively improve the diagnostic ability, especially the ability to troubleshoot the minor latent faults. The accuracy of the deep parallel diagnostic method before and after data augmentation was improved by 3.64%.

•
The complex non-linear mapping relationship between dissolved gas in transformer oil and transformer fault states was modeled with the help of two important deep learning methods, LSTM and CNN. The diagnostic accuracy of the LSTM network was 93.6%, while that of the CNN network was 92.1%.

•
Using DPF, which fully considers the sensitivity differences of deep learning methods to different fault features, retains the independence of different deep learning diagnostic methods to a large extent. As a result, the final diagnostic result was 96.9%.

•
The anti-interference ability of the proposed method is better than most of the methods compared in this paper. When the dataset was superimposed with 3% random noises, the diagnostic accuracy of the proposed method only decreased by 0.62%, which shows it has strong adaptability and generalization ability.