An Inspired Machine-Learning Algorithm with a Hybrid Whale Optimization for Power Transformer PHM

: The burgeoning prognostic and health management (PHM) engineering technology with superior performance has lately received extensive attention in the academic circle. Nevertheless, the various types of faults of the power transformer often lead to less accurate predictions and the instability of the power system. To address these problems, a power transformer PHM model with a hybrid machine learning method-approach is proposed in this paper. The model uses intelligent sensors to obtain dissolved gas analysis (DGA) data for fault diagnosis of the power transformer system, so as to compress the complexity of features (gas types) in the power transformer. In particular, to enhance the robustness of the model, we adopt a modiﬁed differential evolution whale optimization algorithm (MDE-WOA) to optimize the probabilistic neural network (PNN), namely, the classiﬁcation performance of the model is improved by updating the smoothing factor ( σ ) of PNN. In addition, compared with other optimization algorithms, the MDE-WOA algorithm has a lower complexity and more stable optimization process. Finally, we evaluate this model with real world data from the power transformer sensor in Jiangxi province, China. The results indicated that the proposed algorithm could achieve the highest diagnostic accuracy in the fourth iteration, its accuracy having reached 98.86%. Therefore, the proposed PNN parameter optimization meta heuristic algorithm could effectively enhance the accuracy and efﬁciency of the power transformer fault diagnosis.


Introduction
As a major equipment of the power system in an era of increasing power demand [1][2][3][4][5][6], the power transformer is indispensable to the transmission of electric energy, the connection between the main system and each sub-system of the power grid, so it is obvious that their performance directly affects the reliable operation of the power grid [7]. With the rapid development of high-voltage and ultra-high-voltage transmission technologies, the capacity of the power grid is continuously increasing, while the coverage is persistently expanding. Consequently, if the fault of the power transformer is not timely and accurately detected, it will have a seriously negative effect on the power grid paralysis hurting the normal development of social economy [8]. Therefore, the study on fault diagnosis of power transformers is highly significant to the development of the power system.
At present, oil-immersed power transformers are used primarily in the power grid. Due to the influence of environmental factors such as electricity, machinery, and chemistry, the mineral oil and insulating cellulose paper inside the traditional power transformer will gradually undergo qualitative change, which will produce carbon monoxide (CO), carbon dioxide (CO 2 ), and a series of low molecular hydrocarbons such as hydrogen (H 2 ), methane (CH 4 ), ethane C 2 H 6 ), and other gases. When potential faults exist in power transformers, the contents of various gases will change significantly and gradually dissolve into oil. Therefore,the composition of dissolved gas in transformer oil can reflect the operation state of a power transformer to a great extent [9]. Currently, dissolved gas analysis (DGA) has been extensively used as an effective approach in power transformer fault diagnosis [10][11][12]. Based on DGA data, traditional power transformer fault diagnosis methods [13], for instance characteristic gas method, three-ratio method, and an improved three-ratio method have been developed by the international electrotechnical commission (IEC). However, due to the complex gas generation mechanism of the power transformer, there is no clear corresponding relationship between gas content and ratio in oil and fault type, the traditional fault detection methods of the power transformer often rely on the diagnosis experience of experts and are difficult to be realized through the program. When the sample data tested is too little or there are abnormal values in the sample data tested, the accuracy of the test cannot meet the requirements of industrial production [14].
Recently, aiming at correcting the flaws of the traditional method of fault diagnosis for the above-mentioned power transformer, more and more scholars have tried to apply artificial intelligence (AI) methods to the power transformer fault detection model based on DGA dataset, such as fuzzy theory [15], support vector machine (SVM) [16], and artificial neural network (ANN) [17][18][19][20]. The power transformer fault diagnosis approach with artificial neural network has been widely applied. It studies the sampled data of the power transformer under different working conditions, continuously adjusting the connection weights and bias (significant parameters) of the network model, establishing the corresponding mapping relationship between specific fault characteristics and fault types and a fault diagnosis model. Therefore, the application of ANN improves the accuracy of fault diagnosis. However, it has the disadvantages of a quite small convergence area and easy to fall into local optimal. In addition, many intelligent optimization algorithms with excellent fault diagnosis effects are also widely applied in power transformer fault diagnosis, such as particle swarm optimization algorithm (PSO) [21], cuckoo search (CS) algorithm [22], etc.
For different studies in the fields of fault detection and prediction in power transformer, Zhang, Y. et al. [23] presented a novel neural network with two steps. They enhanced the accuracy of fault detection by using two artificial neural networks to detect the fault type and the condition of cellulose respectively. Dong, M. et al. [14] proposed a power transformer fault diagnosis model using SVM as hierarchical decision making. The experimental results indicate that this model can settle the problem of parameter selection of support vector classifier and has strong generalization.
Based on the strong ability of deep learning, system features can be obtained from a small amount of sample data and represent complex relationships, therefore, the research of fault diagnosis based on deep learning has attracted many scholars' attention. For example, Zhang, C. et al. [24] developed a deep learning method for fault diagnosis of rotating equipment. By setting appropriate network parameters, the extraction time of fault feature data can be saved. In addition, this method can classify faults accurately even when the number of sample data is small. On the other hand, the method has some defects, that is, the convergence speed is slow. Zhang, L. et al. [25] used deep belief network for the fault classification and identification of a vehicle transmission system. Firstly they propose calculating the spectrum of the original signal, then carrying out data fusion, and finally establishing a pattern recognition model based on deep learning. The classification results indicate that the method has a good recognition accuracy. The main advantages of this method are as follows: First, based on deep learning, it can extract features from the spectrum of sample data. Second, this method can combine the sample data of multiple sensors to extract features. Compared with the sample data of a single sensor, its data structure is more complete, which leads the classification to become more accurate. However, this method still has a shortcoming, that is, the model structure is complex, which requires the model to take a long time to train the sample data. Ji, X. et al. [26] proposed a new method for power transformer fault diagnosis utilizing deep learning and soft maximum classification. The method utilizes the superposition of the encoder and soft maximum return to the power transformer fault detection and prediction model, using tagged without supervision and the training of a mass of samples, by the method of step k contrast differences, the parameters of the fault diagnosis model are optimized, and using the supervised algorithm to adjust the parameters of the fault diagnosis model, then the soft biggest regression method is used to determine the fault type of power transformers. Finally, through the comparative analysis, the accuracy and adaptability of this method for fault detection and prediction is superior to the methods of back-propagation neural network and SVM. In a word, deep learning has been gradually applied to the field of fault diagnosis.
It can be seen from the previous paragraphs that ANN has the advantages of a high classification accuracy and strong parallel distributed processing ability. As a branch of ANN, probabilistic neural network (PNN) not only has the advantages of ANN, but also has the advantages of easy training, fast convergence speed, and arbitrary nonlinear approximation. Thus based on the advantages of PNN, our use PNN to build the basic model of power transformer fault detection and prediction.
In view of the status quo of the power transformer fault diagnosis technology mentioned above, this paper proposes a new power transformer fault diagnosis method to improve the efficiency and accuracy of diagnosis. In addition, the other purpose of this paper is to provide a new way of thinking for the research of diagnosis methods combined with artificial intelligence technology. The main contributions of this paper are as follows: First of all, one embed modified differential evolution (MDE) operator into the whale optimization algorithm (WOA) based on life mechanism to overcome the vulnerability of WOA to drop into local optimum. Secondly, the structure parameter of PNN is optimized by using the combinatorial optimization algorithm, which leads the detection ratio of power transformer fault diagnosis to a higher level. Finally, a fault diagnosis model of power transformer is constructed, which provides a new idea for the development of fault technology.
Following introduction, Section 2 introduces the proposed method and describes the prognostic and health management (PHM) model of the power transformer based on the method; Section 3 describes the process of the experiment; Section 4 introduces and analyses the experimental results; Section 5 discusses our research work of this time; and Section 6 draws the conclusions.

Whale Optimization Algorithm
WOA is a swarm optimization algorithm following a kind of special hunting way of humpback whales, developed by Mirjalili et al. [27]. WOA simulates the hunting behavior of humpback whales in the natural world including the whales search, encircle attack prey, and so on the process to achieve optimization search. Lots of studies in the past showed that WOA has advantages of simple principle, easy implementation, and fewer parameter settings [28][29][30]. The algorithm consists of three stages: Random search for prey, encircle prey, and bubble-net attack. First, whales hunt for prey at random. In this process, groups of whale search for better prey by moving away from each other. The mathematical model of this process is described as: Here, X rand (t) is the position of randomly selected whale individuals in the whale population, t is the current number of iterations, and A and C are coefficient vectors. At this stage, the algorithm sets |A| ≥ 1 to reach the location of the search agent away from the reference whale, so as to achieve the purpose of exploring a broader field.
The coefficient vectors A and C can be found as: Here, r is a random vector between 0 and 1, and t max is the maximum number of iterations. From the above equation, it can be observed that there is a linear decrease from 2 to 0 with an increase of the number of iterations. Then, the whales move closer to the prey after they have found it. Namely, WOA s shrinkage enveloping mechanism, in which the mathematical formula of individual position updating of whales is as follows: It is worth noting that |A| is less than 1, X g_best is the current best search agent, if X(t + 1) has a better positional advantage, X g_best will automatically update the current position to surround the prey. Finally, the whales use spiraling and narrowing enclosure to achieve prey hunting. The mathematical model to realize this process is: The D is the distance between the individual whale and prey before it updates its position. b is a constant that determines the shape of the helix, and l is a random value between 0 and 1. It is important to note that in the WOA algorithm, in order to ensure that whales encircle prey and spiral upward simultaneously, it is assumed that the probability of both is equal, i.e., p = 0.5.
From the above description, the WOA algorithm may be roughly described as follows: Step 1: Set algorithm parameters, namely the total whale group size N, the maximum number of iterations t max , and the dimension dim; Step 2: Generate the initial individuals randomly and record their current position; Step 3: Calculate the fitness value f (X i ) of each individual and preserve the current optimal solution with its features as position; Step 4: Judge whether the updating process finished: If t = t max , output the optimal solution and end; if t < t max , update a, A, and C according to Equation (2); Step 5: Generate a random number p between [0,1]. If p ≥ 0.5, update the individual position according to Equation (4), and turn into Step 3. If p < 0.5, then determine the size of |A| and 1: If |A| ≥ 1, update the individual position via Equation (1) and turn into Step 3; if |A| < 1, update the individual position by Equation (3) and go to Step 3.
Note: Fitness value f (X i ) refers to the objective function value calculated during iteration.

Hybrid Whale Optimization Algorithm with Modified Differential Evolution Operators
In WOA, the Equation (1) requires whale populations to separate from prey and randomly move with different individuals in the beginning of the iteration. This process makes gives WOA a good global optimization ability. However, as a general swarm intelligence optimization algorithm, WOA also has common shortcomings. Increasingly the number of iterations, the population will continue to move closer to an optimal individual region, thus losing the opportunity to explore other locations in the space, which will cause a loss of diversity in the population. According to Equation (2), a decreases linearly with the increase of the number of iterations. It is this linear reduction that leads to |A| < 1 and the positions of all search agents in the algorithm can only be updated by the Equations (3) and (4) in the later period of iteration, leading the algorithm to easily fall into the local optimization. Therefore, we propose a whale optimization algorithm based on the modified differential evolution (MDE) operator to address the problem of easily falling into local optimality (the pseudocode for MDE-WOA is shown in Algorithm 1).

Algorithm 1: MDE-WOA
1 Initialize whales population (i=1,2, ..., NP), scaling factor F, crossover rate Cr, lifespan S; 2 Compute the fitness of each search factor (solution), X g_best = the best search factor, s = 0; 3 while t < maximum iterations do 4 for each search factor do 5 Updating a, A, C, l, and p if p < 0.5 then 6 if (|A| < 1)&(s < S) then 7 Update the position of the current search factor by Equation (3); Select a random search agent; 10 Update the position of the current search factor by Equation (1); 11 else if (s = S) then 12 Local and global neighborhood-based mutations; 13 Generate a donor vector by Equation (8);   14 Crossover; 15 Generate a trial vector by Equation (9) In the MDE-WOA, MDE shares a population with WOA, and the improved differential evolution operator is used as a component of WOA based on the lifetime mechanism. The use of the lifetime mechanism determines when the improved differential evolution operator is embedded in WOA. In this paper, S is taken as the life span of the individual and the current age of the individual is s. Here is the formula for updating s: Here, t is the number of iterations, δ X g_best = f (X g_best , t) − f (X g_best , t − 1). When s = S, it means that X g_best has not been updated for s times, and then Equation (3) will be optimized by the modified differential evolution strategy.
In this paper, MDE operator is embedded into WOA by using the concept of neighborhood mutation operator. Similar to the traditional differential evolution algorithm, MDE is mainly composed of mutation, crossover, and selection. In the mutation operation, MDE combines the local model with the global model and adds a weighting factor to obtain the desired donor vector. The local donor vector is composed of the optimal solution in the neighborhood of X i,t and two vectors randomly selected. This model showed as: where X l_best,t is the best solution in the neighborhood of X i,t , p, q ∈ [i − k, i + k](p = q = i), here k is a non-zero integer number in the codomain [1, (NP − 1)/2] (NP is the population size). α l and β l are disturbances selected randomly based on the fixed scaling factor F, α l = β l = λ × rand(NP, D) + F. The increase of the disturbance can reduce the chance of the solution dropping into the local optimum.
Similarly, the mathematical model of the global donor vector can be expressed as: Here, X g_best,t is the best vector captured in the i − th iteration. r 1 and r 2 are random numbers on a whole scale. The first term of the model uses the global optimal vector to replace X i,t to enhance the performance of convergence. Finally, the local donor vector is combined with the global donor vector to obtain the final donor vector, which can be expressed as: where w is the weighting factor, which ranges from 0 to 1. To reduce the parameters and control the balance, w is set here as the middle value of its range, namely 0.5. After the donor vector was obtained by mutation operation, crossover operation was carried out to further boost the diversity of the population. In existing differential evolution algorithms, exponential crossover and binomial crossover are widely used. We adopt the binomial crossing approach, which is introduced as: Here j rand ∈ [1, 2...D] is a random dimension index, ensuring that the test vector U i,j has at least one element provided by the mutation vector V i,j , and Cr controls the crossover probability.
Selection operation is to compare the experimental individuals generated by mutation and crossover operation with the target individuals, and then the better individuals are selected to enter the next generation of the population. The selection process can be described as: According to the equation, if the evaluation value of U i,t of the test individual is less than or equal to that of the corresponding target individual, then U i,t of the test individual will replace the corresponding target individual and enter the next generation of the population; otherwise, the individual X i,t will remain unchanged.

Overview of the Probabilistic Neural Network (PNN)
PNN, a feedforward neural network with the radial basis function (RBF), presented by Dr. Specht in 1989 [31]. The application of the Bayesian decision theory and RBF in PNN and the consideration of the cross effect of different pattern types give it a certain competitive strength over other neural network models. When there is an increasing amount of enormous data, PNN is capable of converging to the Bayesian classifier without falling into local minima. Additionally, PNN is popular in pattern classification and fault detection and prediction.
Different from the structure of the back propagation (BP) neural network, PNN is typically a parallel 4-layer structure, indicated in Figure 1. The function of each layer and corresponding equation are described as follows: The input layer is made use of a pre-processing data set of the training sample and transmit characteristics of the sample to the network, so the number of its neurons should be the same as the dimension of all the sample.
For the pattern layer, the Euclidean distance between the feature vector of training sample X and radial center x ij is used to realize the matching between the input feature vector and various types of training set. It can be expressed as follows: Here, X = [x 1 , x 2 , x 3 , . . . , x n ] T , n = 1, 2, . . . , l. l is for all types of training, d is the dimension of eigenvector, x ij is the j − th center of the i − th training sample, and σ is a smoothing factor. The function of summation layer is weighted to average the output of the same type of pattern layer. It is expressed as: Here, v i is the output of class i neurons, and L is the number of class i neurons. The type corresponding to maximum output in the summation layer is the output type of the output layer, and its equation:

PNN Optimized by MDE-WOA Power Transformer PHM Model
For the defect of PNN, the hidden layer of calculation by the smoothing factor (σ) great influence. If the σ incorrect value is too large or too small, the network convergence falls into local optimum too quickly or easily. As an improved intelligent optimization algorithm, MDE-WOA has strong global optimization and rich population diversity. It can be extracted by selecting a suitable σ number set, to improve the performance of PNN.
In this model, the input data are as follows: The flow chart of the PNN network model optimized by MDE-WOA is shown in Figure 2, and the specific steps can be summarized as follows:  Step 1: Randomly generate initialization sample X; Step 2: Initialize the parameters and structures of PNN and define the random smoothing factor as: Step 3: Set the current life s = 0 and the current number of iterations t = 1. Initialize the size (NP), proportional factor (F), cross control parameter (Cr), life span (S) of the whale population, and the fitness function f (x). It is worth noting that the mean square error (MSE) is taken as the corresponding value of fitness function in our study.
Here, Y i is actual results and O i is the expected result.
Step 4: Compute the fitness value of factor and record the position of the optimal individual; Step 5: Update algorithm parameters: a, A, C, l, and p; Step 6: Determine the size relationship between random number p and 0.5 between [0,1]: If p ≥ 0.5, the factor updates position by spiraling through Equation (4). If p < 0.5, the size relation between s and S is determined: If s < S, the current search agent searches and encircles the prey, and updates the position via Equations (1) and (3) respectively. If s = S, the MDE operator is introduced to optimize the search strategy; Step 7: Compute the fitness value of the factor again, and update the best search factor if there is a better solution; Step 8: Update the current life according to Equation (5); Step 9: When the number of iterations t reaches the maximum number of iterations t max and other parameters of the algorithm reach the preset conditions, the algorithm goes to the next step; if not, it returns to step 5; Step 10: Optimal search agent instead of PNN in training smoothing factor σ to gain better fault diagnosis model; Step 11: Test samples are substituted into the network to obtain the corresponding analysis data.

Data Collection
In terms of the effect of transformer model capacity, the environmental humidity and temperature on transformer performance, this paper collected sorts of gas data from real power transformer equipment oil from power supply companies (PSC) in the Jiangxi province, China in 2019 as experimental data samples. After screening all the data, the featured gas content data samples of 555 were obtained from the power transformer, including 65 cases of partial discharge (PD), 361 cases of low temperature overheating (LT) (<150 • C), 40 cases of low temperature overheating (LT) (150 • C-300 • C), and 89 cases of arc discharge (AD). For these data samples, 400 sets of data were used as the training set in this paper, and the remaining data sets were used as our test set.
The power transformer studied is a kind of power transformer with dissolved gas data in oil for fault diagnosis. The corresponding diagram is shown in Figure 3. The collected 555 sets of power transformer fault data were simulated by MATLAB (R2019a), and the simulation results can be seen in Figure 4. It can be clearly seen from the three subgraphs that each fault type has its own ratio data distribution. In addition, combined with the collected fault types for data analysis, LT (150 • C-300 • C) and AD have obvious differences in the ratio data distribution compared with other fault types, however, for LT (<150 • C) and PD the distribution of three ratio data are close, which is a challenge of power transformer fault classification technology.  According to the ratio of dissolved gas content in transformer oil, the corresponding fault type can be obtained. Table 1 is part of the original data of the power transformer fault type judged by China electric power research institute with the DGA method. PSC means power supply company, TH is the total hydrocarbon of transformer oil and during the judge, the normal operational temperature is 25 • C and the setting humidity is 50%.

Algorithm Setting
To further assess the stability and performance of the MDE-WOA-PNN algorithm for power transformer, we adopt different methods to compare our model with BA-BP, MCS-BP [32], GA-BP, and so on. The setting of these algorithms can be seen in Table 2. Table 2. Parameter setting in our study of various approaches.

Experimental Results
To assess the effectiveness of our approach in the power transformer fault diagnosis, we compared the classification accuracy of the method with four methods (BA-BP, CS-BP, GA-BP, and PNN). We used MATLAB for simulation experiments. The classification accuracy calculated by the experiment is shown in Table 3. Table 3 demonstrates that the accuracy of fault detection and prediction of MDE-WOA-PNN model was best among all of the diagnostic models, which further indicates that MDE-WOA obviously improved optimizing PNN model. With regard to four types of fault diagnosis results, for instance MDE-WOA-PNN, LT (<150 • C) was 100% (106/106), low temperature overheating (LT) 150 • C-300 • C was 100% (13/13), partial discharge (PD) was 100% (14/14), and arc discharge (AD) was 95.46% (21/22). Therefore, compared with other diagnostic models, this model is more suitable for power transformer fault detection and prediction. As another important diagnosis index of the model, MSE can directly express the error between PNN output and ideal output of the model. Therefore, to explore the superiority of our presented method, we compare MSE with the above four methods. As can be seen from Table 4 the MSE of the test set of MDE-WOA-PNN model was minimal. When using MDE-WOA to optimize PNN and the test sample was used as the input of PNN, the MSE of this test set was only 0.058. The performance was far superior to other models. Due to the existence of some noise data, the MSE performance of training samples was not very excellent. However, combined with Table 3, we know that the model still obtained a competitive diagnostic accuracy, which also proved that this model had a very high robustness from the side. In addition, to know more about the impact of the number of iterations on the MSE of the proposed model, we selected iteration times of 2, 4, 6, and 8 to conduct the exploratory experiment. The experimental results are given in Figure 5.  To explore the impact of the number of iterations on the diagnostic accuracy of the the presented algorithm, we calculated the accuracy of the model's fault diagnosis when the number of iterations was 2, 4, 6, and 8, respectively. The variation of fault diagnosis accuracy of various types of test samples in the network model is shown in Figure 6.  Figure 7 suggests the classification results of MDE-WOA-PNN training results and test sets. As can be seen from the figure, when the iteration was 4, the diagnostic accuracy of the test set of the model was highest. Therefore, it can be seen from the analysis that when the number of iterations was 2, there was an under fitting phenomenon in the network model; when the number of iterations was 6 or 8, the phenomenon of over fitting existed in the network model, which proves the efficiency of the developed approach.  The variation of the average accuracy with the number of iterations is given in Figure 8. According to the comprehensive analysis, MDE-WOA-PNN has a very high average accuracy for the problem studied in this paper, up to 98.86%. Therefore, MDE-WOA-PNN was quite appropriate for fault detection and prediction of power transformer. The value and variance of the optimal search agent for MDE-WOA-PNN algorithm are given in Figure 9. Analysis together with Figure 8 indicates that the model had the highest diagnostic accuracy when the value of the best search agent was 0.047265. The fitness curve of this model shown in Figure 10 shows the convergence speed of MDE-WOA-PNN algorithm was very fast. Meanwhile, it can be seen that the proposed method could jump out of local optimality quickly, which shows the high efficiency of the algorithm. Besides, it is worth noting that the initial error of the algorithm was small, indicating that the initial value of the algorithm was close to the global optimal value.

Fitness
The fitness for the Max_iteration=4 The fitness for the Max_iteration=6 The fitness for the Max_iteration=8 Figure 10. The adaptive curve of different iteration times.

Discussion
In this work, a PHM model for a power transformer was established via PNN. An intelligent optimization algorithm MDE-WOA was introduced to optimize unknown parameters, i.e., smoothing factor (σ) in the PNN model, so as to enhance the performance of the PNN algorithm. Optimized by the MDE-WOA algorithm, the global convergence of PNN network was significantly enhanced and could jump out from local optimal quickly, which made the network more efficient. Compared with other optimization algorithms, the efficiency of this model was superior. In the process of optimizing PNN by the MDE-WOA algorithm, the optimization effect was not affected when the initial parameters changed slightly, which could not be realized by many algorithms. Furthermore, this paper proved that the MDE-WOA algorithm could accelerate the speed of the convergence and efficiency of the network and then up-grading the fault prediction ratio of power transformers. Therefore, the MDE-WOA-PNN model had an obvious superiority in fault detection and prediction of the power transformer.

Conclusions
In the course of this study, there was no study on the influence of the number of sample data on the model diagnosis results. In addition, MDE-WOA-PNN, as a novel power transformer fault diagnosis method also had some defects, that is, the parameter setting had a greater impact on the performance of the model. Compared with other methods of similar publications, such as the power transformer fault diagnosis method combining hypersphere multiclass SVM and improved D-S evidence theory proposed by Shang, H. et al. [34] and the traditional diagnosis method of BA-BP, the MDE-WOA-PNN not only had a higher accuracy and faster convergence speed, but also improved the defect whereby traditional fault diagnosis method is easy to fall into local optimum.
For future work, our team will increase the experimental sample data to further study the impact of the number of experimental sample data on the accuracy of fault diagnosis. Meanwhile, we will improve the intelligent optimization algorithm to adapt to a more complex fault diagnosis. It is worth noting that the algorithm is also applicable to fault diagnosis of diesel engines and sensors.