Research on Transformer Voiceprint Anomaly Detection Based on Data-Driven

: Condition diagnosis of power transformers using acoustic signals is a nonstop, contactless method of equipment maintenance that can diagnose the transformer’s type of abnormal condition. To heighten the accuracy and efﬁciency of the abnormal method of diagnosing abnormalities by sound, a method for abnormal diagnosis of power transformers based on the Attention-CNN-LSTM hybrid model is proposed. This collects the sound signals emitted by the real power transformer in the normal state, overload, and the discharge condition. It preprocesses the sound signals to obtain the MFCC characteristics of the sound signals. It is then grouped into a set of sound feature vectors by the ﬁrst-and second-order differences, and enters the Attention-CNN-LSTM hybrid model for training. The training results show that the Attention-CNN-LSTM hybrid model can be used for the status sound detection of power transformers, and the recognition of the three states can achieve an accuracy rate of more than 99%.


Introduction
The power transformer is one of the most important pieces of equipment in the power system.The state in which it operates has a direct impact on the power supply and the safety of the power system.With the increase in user electricity consumption, more and more transformers are invested in the power grid, so transformer monitoring and faultdetection technology play a vital role in the power grid's fault-prevention ability and safe and steady operation.
The failure of power transformers is mainly based on insulation failures, and some noninsulating primary faults can be converted into insulation faults.A variety of factors cause the factors that lead to insulation deterioration of transformers [1,2].Currently, the primary methods for transformer abnormality and fault diagnosis are oil chromatography diagnosis, vibration diagnosis, infrared thermal imaging diagnosis, acoustic diagnosis, and spectral diagnosis [3][4][5][6][7][8][9][10][11].Among these diagnostic methods, acoustic diagnosis has the advantages of easy assembly, fast diagnosis, and no direct contact with equipment compared with other diagnostic methods.Usually, sound methods for abnormalities and fault diagnosis are judged mainly by experienced people through the human ear.However, this method has a large human impact and is only suitable for more obvious failure occurrences.
Deep learning machine learning models based on neural networks have emerged as the prevalent trend as machine learning gains popularity.The use of deep learning to judge faults has also been applied to many fields and has gained excellent results [12][13][14][15].In the research of transformer voiceprint fault detection, the literature [16] proposes a model based on Mel time spectrum-convolutional neural network transformer core voiceprint recognition, through the vibration signal and sound data of the iron core under different operating states to achieve the identification of three voltage conditions.Although the recognition accuracy of this method for three working conditions has reached 99%, it is necessary to install vibration sensors and sound sensors, which are more complex in practical applications, and the installation position has a more significant impact on experimental results.The literature [17] proposes a backpropagation (BP) neural network diagnostic model based on transformer vibration and noise, by acquiring transformer vibration and noise signals, obtaining eigenvalues after fast Fourier transformation.Entering them into the BP neural network for fault diagnosis, this method is more accurate for obvious mechanical fault identification, but the recognition accuracy for transformer discharge, overload, and other abnormal phenomena is low.The literature [18] proposes a transformer voiceprint recognition model based on improved Mel-frequency cepstral coefficients (MFCC) and vector quantization (VQ) algorithms, first used for computational recognition by principal component analysis and VQ algorithm, and the recognition accuracy rate reaches 93%.Although this method retains most of the MFCC characteristics, the difference between the sounds of different operating conditions of the transformer may exist in the discarded MFCC, so this method is less accurate in identifying abnormalities when the transformer's sound is not obvious.
As to the above problems, a hybrid transformer abnormal voiceprint recognition model that uses MFCC combined with convolutional neural networks (CNNs) and long short-term memory (LSTM) is proposed.It collects the normal operation of the substation 10-kV oil-immersed transformer and the sound of abnormal (overload and discharge as an example).These three states are samples collected under load on the transformer, and the two abnormal states of discharge and overload are samples recorded by the substation during the previous operation of the transformer; an abnormal discharge state refers to a partial discharge.MFCC is used to feature the collected sound, and after that the extracted sound features are introduced to the CNN-LSTM hybrid model, and the attention mechanism is introduced to identify the three working conditions of the transformer accurately.

Time Domain Analysis
Transformer vibrations through the windings, core, insulating oil, and other accessories, move outward in the form of sound through the transformer tank.The sound contains a wealth of equipment status information.In terms of the sound acquisition, a computer is used to connect the DAC sound card and microphone, a microphone is fixed close to the core of the transformer, and the cycle acquisition is set with 10 s as a sample.The sensor for the microphone is an electret condenser with a sensitivity of −30 dB +/−3 dB and a signal-to-noise ratio of 74 dB SPL.The sound card adopts a no-noise reduction card, the acquisition frequency band covers 0-22,000 Hz, the sampling frequency is 44,100 Hz, and the sampling channel is mono.Figures 1-3 show the time domain waveform diagram of the transformer under normal, overload, and discharge conditions.The transformer operates in a normal state, and the AC will generate alternating magnetic flux through the winding.This magnetic flux has a periodicity that will cause periodic vibration of the iron core [19].This sound is regular, as shown in the time domain waveform diagram in Figure 1.If the transformer is discharged, the sound of the engine operation will be mixed with the sound of discharge, and the regularity of the sound is not obvious in the normal state, as shown in Figure 2. In the event of an overload, the engine hums louder than during normal operation [20] as shown in Figure 3.

Grammatic Analysis
Figures 4-6 show the spectrogram of the transformer during normal operation, discharge, and overload.In contrast to the waveform graph, which is represented by a single time domain, the spectrogram is a representation of sound in the time-frequency domain that expresses deeper voiceprint characteristics while also fully describing the frequency and speech energy information in the direction of time.This is advantageous for the model's full learning process [21].The color represents the intensity of sound at a particular frequency and moment, with yellow representing high intensity and green representing low intensity.The spectrogram's horizontal and vertical axes represent frequency and time in seconds, respectively.The spectrogram shows the composition of the spectrum from three dimensions, has the characteristics of sound data representation and image form processing, and uses two-dimensional images to express three-dimensional information.Assuming that the speech waveform time-domain signal is x(l), the spectrogram calculation formula is X n (e jw ) = x n (m)e    The figure demonstrates that the frequency range of the sound when the transformer is discharged covers the high-frequency band, whereas the sound during normal operation is primarily concentrated in the low-frequency band.When the transformer is overloaded, it can be seen that in the range of low-and medium-frequency bands, the intensity of the sound is greater than the sound intensity during normal operation.From the time and frequency domain analysis, it is feasible to use sound signals for abnormal transformer diagnosis.

Preprocessing of Sound Signals
By preprocessing the sound, the effects of aliasing, high-order harmonic distortion, high frequency, and other issues on the energy and frequency of the sound signal can be eliminated [21].Additionally, high-quality parameters can be input for the subsequent feature extraction step, enhancing the effect of sound signal feature extraction.
Although the same device is used to collect samples, due to various factors, there are also many differences between the individual sound samples collected.In order to narrow the impact of these differences on sound quality, the data must first be normalized as where X min and X max are the minimum and maximum values of the sound signal.
Any sound signal must be analyzed and processed by using "short-time", or "shorttime analysis", because the sound signal is thought to be stable for a short time.As a result, the sound signal is framed.In voiceprint detection, the frame length will lead to poor representation of the feature vector, and too long a length will affect the accuracy of the feature vector, so generally take 20-30 ms as a frame [18].This paper takes 25 ms as a frame and the frame shifts to 10 ms.In order to ensure the continuity between adjacent frames, the overlapping part between the two frames is set up, and the relationship between the overlapping part and the frame signal is where the number of frames is M, l is the length of the sound signal, L is the frame length, and b is the overlap rate.
In order to facilitate the calculation and make the sound have good continuity, the overlap rate is 30% in this article.After framing the sound, a discrete Fourier transform is required, and directly transforming the sound signal will cause signal distortion.Therefore, a Hamming window must be added to the frame signal to increase continuity at both ends and make the low-pass characteristics smoother and less distorted.The Hamming function is where Z is the window length.

Feature Extraction of Sound Signals
A cepstral parameter derived from the Mel scale frequency domain is the MFCC coefficient.It involves the nonlinear properties of the frequency that the human ear hears [22].The following equation can approximate the MFCC coefficient's relationship to frequency, where B is the Mel frequency and h is the frequency.The first-order differences and second-order differences of the MFCC coefficients can reflect the variability of adjacent frames, so this paper uses the MFCC coefficient combined with the difference as the feature vector of the sound signal.In most cases, the signal is transformed into an energy distribution in the frequency domain by using a fast Fourier transform (FFT) for characteristic observation because it is challenging to observe the signal's characteristics in the time domain.FFT conversion is performed on each preprocessed sound signal, and the calculation formula is: where S(n) is the input sound signal, and P is the number of Fourier conversion points.
Here, take 512.After FFT transformation of the framed signal, and then Mel filtering, Mel filtering is achieved by a filter bank composed of multiple triangle bandpass filters.Set the number of filters to p, and then set the sound signal after Mel filtering to obtain p parameters m i (i = 1, 2, ..., p), and the calculation formula is where H i (q) is the parameter of the filter, which could be summed up as where f (c) is the center frequency of the triangulation filter.According to the calculation of m i , take the logarithm to perform a discrete cosine transformation, and the transformation formula is where c(i) is the MFCC feature of the frame signal, and it is combined into a first-order and second-order differential as the feature vector of the frame signal.

Long Short-Term Memory
Long short-term memory (LSTM) is a unique RNN type of memory.LSTM adds gating devices, which can remember information through cell state.The forgetting gate can avoid letting too many memories affect the neural network's processing of the current input, and each time a new input is entered-based on the latest moment's input and output-the LSTM will first select which previous memories to erase.A memory gate is a control unit that determines whether the data at t (now) is included in the state.It can filter out invalid data from the current input and extract valid data from it.The neural layer that the LSTM unit uses to determine the current value of the output is the output gate.After integrating the current input value with the output value of the moment before it with the sigmoid function, the output layer will first extract the information from the vector, and then use the tanh function compression to map the current unit state to the interval (−1, 1).LSTM introduces the sigmoid function through its three gatings and combines it with the tanh function to increase the summation steps, reduce the possibility of gradient vanishing and gradient explosion, and solve both short-term and long-term dependence problems [23][24][25][26].The structure of the LSTM element is shown in Figure 8, and its calculation formula is show in Equations ( 14)-( 19), ) where x t is the network input matrix, and σ is the activation function.V t−1 is the old cell state, updated to the new cell state V t by Equation (16).tanh is the double tangent activation function,

Convolutional Neural Networks
CNN is one of the most widely used neural networks for image recognition, pattern recognition, feature extraction, and natural language processing.The convolutional layer, pooling layer, fully connected layer, and softmax layer make up CNN's network structure, which is a feedforward neural network with deep structure and convolutional operation [24].The functions of its layer structure are as follows.
The convolutional layer is the heart of the convolutional neural network.It abstracts the implied correlation in the input data by using the convolutional kernel matrix and extracts features.Each layer's convolution operation is carried out with a rectified linear unit (ReLU) activation function [27] in the following ways: After the completion of the activation function process, the filter generates the following characteristics, where, in the convolutional layer, j, y l j is the result of the l filter, f represents the nonlinear function, operator * represents convolution, w l ij is the lth layer convolution kernel between the i input map and the j output map, and b l j is the bias.With regard to the pooling layer, the convolutional layer extracts a large number of features of the input data, and the calculation efficiency is relatively low when performing feature operations, so it is necessary to solve this problem through the pooling layer.The pooling layer is responsible for screening the features in the sensory domain and extracting the most representative features in the region.This can effectively reduce the output feature's dimension and the number of required model parameters.Pooling is divided into average pooling and maximum pooling.Average pooling can keep more background information about the object and reduce the excessive variance in the estimated value caused by neighborhood limitations.Maximum pooling, on the other hand, can keep more texture information about the object while reducing the estimated mean shift caused by convolutional layer parameter error.This article uses voiceprint information for transformer condition monitoring, so the method of maximum pooling is used.
The model's final layer is the fully connected layer, which connects each neuron with the neurons before and after it is used and calculates the weight and deviation of the features to obtain the output of feature information.

Attention Mechanism
The ability to selectively select significant information from a large amount of information is at the heart of the attention mechanism, capture important information useful for the current task, highlight important features that affect the impact, reduce the impact of useless features, make the model make the optimal choice, and improve the accuracy of the model.Its pith is to gain proficiency with a weight dissemination of information highlights and afterward apply this weight conveyance to the first elements so the undertaking principally centers around a few key highlights, disregards irrelevant highlights, and further develops task effectiveness [28], the design of the consideration component is displayed in Figures [29][30][31]: In Figure 9, x 1 , x 2 , ..., x i is the input feature value, h 1 , h 2 , ..., h i is the input featurespecific hidden layer state value, and a t is the weight value of the current input that is equivalent to the state of the historical input's hidden layer.h t is the value of the hidden layer's state that the final node outputs.The attention mechanism is calculated as where w and b are the weight parameters and biases, e i is the attention probability distribution value determined by the input vector h i at the i moment, and s i is the feature of the final output.

CNN-LSTM Hybrid Model Based on Attention Mechanism
The feature vector composed of voiceprint signals after feature extraction cannot reflect the potential relationship between features, so the CNN network is used to mine the potential relationship between features, extract the rules between continuous data and discontinuous data, and form vectors, and then pass them into the LSTM layer in chronological order to capture long-term components.However, the CNN-LSTM model may lose data if the time series data is input for an excessive amount of time.Additionally, the CNN-LSTM model only takes into account the selection of input features and does not take into account the impact of any one feature on the results.As a result, the attention mechanism is used in this paper to add various weights to the model's input features, enhance the features that have a greater impact on the results, and suppress the features that have a small impact on the results.
As can be seen in Figure 10, the input layer, the CNN layer, the LSTM layer, the attention layer, and the output layer make up the majority of the CNN-LSTM hybrid model that is based on the attention mechanism.The hybrid model structure and flow are as follows: 1.
Input layer: The MFCC features of the sound samples after feature extraction is passed into the model through the input layer.If the input length is t, X = [x 1 , x 2 , ..., x t ] can be used to represent the input direction.2.
CNN layer: The CNN layer mainly includes the convolutional layer and the pooling layer, which is to feature further extraction of the feature vector input of the input layer and extract and screen out the important feature vectors into the LSTM layer.According to the data structure of the voiceprint sample, this paper uses two-dimensional convolution, the convolution kernel is 9, and the activation function is ReLU.In order to retain more features, this paper uses the maximum pooling, and the pool size is 2.After the CNN layer processes the input vector, the incoming fully connected layer is transformed into a new feature vector (26).The output of the CNN layer is and the calculation formula is where C is the convolutional layer's output, W C and b C are the weights and biases of the convolutional layer, respectively, ⊗ is the convolution operator, P is the pooling layer's output, max is the maximum pooling mode, and b P is the bias of the pooling layer.The fully connected layer's activation function is called f .The fully connected layer's weights and biases are W H and b H .

3.
LSTM layer: To understand how the data feature time series are related, the CNN layer passes the extracted feature vectors onto the LSTM layer.In this paper, the LSTM structure of bidirectional transmission is adopted, and the number of hidden units in each layer is 120.The activation function is the RULE function, and the LSTM layer's output vector is Attention layer: In accordance with the weight distribution principle, we input the vector output of LSTM into the attention layer and assign distinct parameters to distinct characteristic parameters to create the ideal weight parameter matrix.The output of this layer is S = [s 1 , s 2 , ..., s k ] T . 5.
Output layer:The output from the attention layer goes into the output layer, which then sends the status data for the transformer through the full connection layer.The output is Y, and the following formula calculates it as where W Y and b Y are the weights and biases of the output layer.

Real-Time Transformer Condition Monitoring Process
The normal operation is diagnosed by using the Attention-CNN-LSTM hybrid model in this paper, discharge and overload of the transformer running in real time, and the overall diagnostic flow chart is shown in Figure 11.The specific steps for detection are as follows.

1.
The sound of the transformer operation is collected in real time through the microphone and converted into data.

2.
Preprocess the data collected by the microphone and extract MFCC features to form a feature vector.

3.
Input feature vectors into the trained Attention-CNN-LSTM model for discrimination.

4.
If the discrimination result is normal, continue monitoring.If the discrimination result is the abnormal state (discharge, overload), push the abnormal information and occurrence time, and continue monitoring.

Analysis of Experimental Results
In order to evaluate the Attention-CNN-LSTM model's superiority and accuracy in comparison to three other prevalent detection and classification models-CNN, LSTM, and CNN-LSTM-we set them up for comparative analysis.The results are further analyzed by using the confusion matrix, which intuitively shows the impact of these four models for normal, discharge, and the detection effect of these three states of overload.

Sample Settings
The sound samples are divided into 2-s units, the samples of the three states in the quiet environment and the three state samples under the loud ambient sound are randomly sorted in order and then put into the model for training in turn, and the training set and the test set are randomly divided into 8:2 ratio, and the number of samples in each environment is shown in Table 1.The confusion matrix M, precision ratio (P), recall (R), and F1-score (F1) are all utilized in the process of assessing the model's detection performance, where precision is the expected outcome of the label sample, which is actually the proportion of the label.The recall rate is the proportion of the label that is actually the sample of the label, and the predicted result is the proportion of the label.F1 is defined based on the harmonic average of accuracy and duplicate check rate.The specific evaluation formula is as follows [32,33] where y TP is the number of data points whose actual abnormal state data points are detected as abnormal points.The number of data points that are found to be normal in the actual abnormal state is y FN .The number of data points identified as abnormal by actual normal operation is y FP ; y TN is the number of data points that are detected as normal data points in actual normal operation.
P and R can intuitively show the quality of normal points and abnormal points detected by the hybrid model through percentages, P and R are proportional to the performance of the detection model when evaluating the performance of the detection model, and when P and R are high, the F1 value will be high.The value of F1 has perfect precision and recall at a value of 1, and its worst value is 0.  From the Figure 13, Figure 13a is the CNN-LSTM model status diagnostic results, Figure 13b is the Attention-CNN-LSTM model status diagnostic results, Figure 13c is the CNN model status diagnostic results, and Figure 13d is the LSTM model status diagnostic results.In this experiment, 25 test sets were randomly selected from the samples of the four states, 0 represents the ambient noise state when the transformer is not started, 1 represents the normal operation of the transformer, 2 represents the partial discharge state of the transformer, and 3 represents the overload state of the transformer.From the experimental results, it can be seen that the CNN-LSTM model identifies a sample in normal operation state and overload state as overload state and normal operation state respectively.The Attention-CNN-LSTM model only identifies an overloaded sample as a normal operating state.The CNN model identifies two samples in normal operation as overload, two samples in overload as normal operation and one partial discharge sample as normal operation.The LSTM model identifies three samples of normal operation as overload, one partial discharge as normal operation, and two overload samples as normal operation.2. The table demonstrates that, in terms of detecting the three states of the transformer on precision, recall, and F1, the Attention-CNN-LSTM hybrid model performs best, with an accuracy that can exceed 99%, followed by the CNN-LSTM hybrid model.Among these four models, the CNN model and the LSTM model perform poorly.Due to the continuity of timeline and space of voiceprint features, a single model has limitations in voiceprint detection, resulting in unsatisfactory detection results.
In summary, the Attention-CNN-LSTM hybrid model has the highest detection performance, which can provide auxiliary decision-making for the real-time detection of transformers and provide a reference for reducing the losses caused by transformers and the abnormal detection of electrical equipment.

Discussion and Conclusions
The production and life processes will generate a significant amount of data once we enter the era of big data.Neural networks and artificial intelligence have been used in equipment monitoring to ensure work efficiency, effectively reducing the need for human resources and increasing the accuracy of equipment diagnosis [34].Sound, as one of the most critical characteristics of equipment operation, contains a lot of information about how the equipment is used.This paper, through the collection of transformer sound in the real scene combined with deep learning in the field of the voiceprint, proposed a transformer anomaly diagnosis method based on the Attention-CNN-LSTM hybrid model.We input the feature vector of sound samples, through the Attention-CNN-LSTM hybrid model for feature learning training, and achieved high accuracy.Therefore, combining sound and deep learning to monitor equipment operating status may become a future research direction in the field of voiceprint recognition.

Figure 1 .
Figure 1.Normal state time domain waveform plot.

Figure 2 .
Figure 2. Discharge state time domain waveform plot.

Figure 3 .
Figure 3. Overload state time domain waveform plot.
where x n (m) is the nth frame sound signal obtained after framing the window, W(m) is the window function, X n (e jw ) is a short-term Fourier change of the framed signal, w is the angular frequency, P is the number of Fourier conversion points, |X n (k)| is a short-term amplitude spectrum estimate of x n (m), and T(n, k) is the spectral energy density function at time.T(n, k) is a nonnegative real matrix, with time n as the abscissa and k as the ordinate.A heat map can be drawn, and a color spectrogram can be derived from the transformed matrix fine image and color mapping.The ordinate of the spectrogram represents the frequency in (HZ).The abscissa represents the time in (S).
Figure 7 depicts the process flow of feature extraction.

Figure 7 .
Figure 7.The process of feature extraction.
the parameter of the network model, and (b g , b i , b V , b o ) is the offset vector of the network.The model updates the weights and biases by minimizing the objective function.

Figure 11 .
Figure 11.The transformer monitors the overall flow chart in real time.

5. 2 .
Detection Performance Analysis One needs to train the model after setting the samples and the evaluation indicators.One needs to set the number of iterations of the model to 800, the number of batch samples to 64, the loss function to MSE, and the optimizer to Adam.You then need to put the training samples into the model for training and the test samples into the model for testing.The four results of training the model are shown in Figure 12.

Figure 12 .
Figure 12.Four model training results.From the figure, Figure 12a is the CNN-LSTM model training results, Figure 12b is the Attention-CNN-LSTM model training results, Figure 12c is the CNN model training results, and Figure 12d is the LSTM model training results.It can be seen that the accuracy (train_acc) of the four models on the training set can reach 100%, but the accuracy (val_acc) on the test set is quite different.The Attention-CNN-LSTM hybrid model can reach 99.7% accuracy on the test set, the accuracy of the CNN-LSTM hybrid model reaches 97%, and the accuracy of the CNN model and the LSTM model on the test set is 90% and 92%, respectively.From the Figure13, Figure13ais the CNN-LSTM model status diagnostic results, Figure13bis the Attention-CNN-LSTM model status diagnostic results, Figure13cis the CNN model status diagnostic results, and Figure13dis the LSTM model status diagnostic results.In this experiment, 25 test sets were randomly selected from the samples of the four states, 0 represents the ambient noise state when the transformer is not started, 1 represents the normal operation of the transformer, 2 represents the partial discharge state of the transformer, and 3 represents the overload state of the transformer.From the experimental results, it can be seen that the CNN-LSTM model identifies a sample in normal operation state and overload state as overload state and normal operation state respectively.The Attention-CNN-LSTM model only identifies an overloaded sample as a normal operating state.The CNN model identifies two samples in normal operation as overload, two samples

Figure 13 .
Figure 13.Four model status diagnostic results.The above experimental results show that Attention-CNN-LSTM has a smaller loss in the test set and the highest accuracy in the random test set, so the training effect and accuracy of the Attention-CNN-LSTM hybrid model are the best among the four models.The evaluation parameters for the detection performance of the four models are shown in Table2.

Table 1 .
Number of sound samples for each condition of 110-kV transformer. ,

Table 2 .
Parameters of performance evaluation.