Research on Unstructured Text Data Mining and Fault Classiﬁcation Based on RNN-LSTM with Malfunction Inspection Report

: This paper documents the condition-based maintenance (CBM) of power transformers, the analysis of which relies on two basic data groups: structured (e.g., numeric and categorical) and unstructured (e.g., natural language text narratives) which accounts for 80% of data required. However, unstructured data comprised of malfunction inspection reports, as recorded by operation and maintenance of the power grid, constitutes an abundant untapped source of power insights. This paper proposes a method for malfunction inspection report processing by deep learning, which combines the text data mining–oriented recurrent neural networks (RNN) with long short-term memory (LSTM). In this paper, the effectiveness of the RNN-LSTM network for modeling inspection data is established with a straightforward training strategy in which we replicate targets at each sequence step. Then, the corresponding fault labels are given in datasets, in order to calculate the accuracy of fault classiﬁcation by comparison with the original data labels and output samples. Experimental results can reﬂect how key parameters may be selected in the conﬁguration of the key variables to achieve optimal results. The accuracy of the fault recognition demonstrates that the method we proposed can provide a more effective way for grid inspection personnel to deal with unstructured data.


Introduction
With the higher requirements of the economy and safety of the power grid, online condition-based maintenance (CBM) of power transformers without power outages is an inevitable trend for equipment maintenance mode [1,2].The transformer CBM analysis relies on two basic groups: structured (e.g., numeric and categorical) and unstructured (e.g., natural language text narratives).Using structured data analysis, researchers have proposed a variety of transformer fault diagnosis algorithms such as the Bayesian method [3][4][5], evidence reasoning method [6], grey target theory method [7], support vector machine (SVM) method [8][9][10], artificial neural network method [11,12], extension theory method [13], etc.These algorithms have achieved good results in engineering practice.However, for more unstructured data found in practice [14][15][16][17] (i.e., the malfunction inspection report), traditional artificial means of massive original document annotation and classification are not only time-consuming, but are also unable to achieve the desired results.For this reason, it has not been possible to adapt development of network information needs to the grid.Therefore, compared with structured data processing, it is more comprehensive for grid inspection personnel to effectively identify the massive unstructured text in the inspection malfunction report.
Deep learning achieved great success in speech recognition, natural language processing (NLP), machine vision, multimedia, and other fields in recent years.The most famous was the face recognition test set Labeled Faces in the Wild [18], in which the final recognition rate of a non-deep learning algorithm was 96.33% [19], whereas deep learning could reach 99.47% [20].
Deep learning also consistently exhibits an advantage in regards to NLP.After Mikolov et al. [21] presented language modeling using recurrent neural networks (RNNs) in 2010, he then proposed two novel model architectures (Continuous Bag-of-Words and Skip-gram) for computing continuous vector representations of words from very large data sets in 2013 [22].However, the model is not designed to capture the fine-grained sentence structure.When using the back-propagation algorithm to learn the model parameters, RNN needs to expand into the parameter-sharing multi-layer feed-forward neural network with the length of historical information corresponding to the number of layers to expand.Many layers not only make the training speed become very slow, but the most critical problem is also the disappearance of the gradient and the explosion of the gradient [23][24][25].
Long short-term memory (LSTM) networks were developed in [26] to address the difficulty of capturing long-term memory in RNNs.It has been successfully applied to speech recognition, which achieves state-of-the-art performance [27,28].In text analysis, LSTM-RNN treats a sentence as a sequence of words with internal structures, i.e., word dependencies.Tai et al. [29] introduce the Tree-LSTM, a generalization of LSTMs to tree-structured network topologies.Tree-LSTMs outperform all existing systems and strong LSTM baselines on two tasks: predicting the semantic relatedness of two sentences and sentiment classification.Li et al. [30] explored an important step toward this generation task: training an LSTM auto-encoder to preserve and reconstruct multi-sentence paragraphs.They introduced an LSTM model that hierarchically builds a paragraph embedding from that of sentences and words, and then decodes this embedding to reconstruct the original paragraph.Chanen [31] showed how to use ensembles of word2vec (word to vector) models to automatically find semantically similar terms within safety report corpora and how to use a combination of human expertise and these ensemble models to identify sets of similar terms with greater recall than either method alone.In Chanen's paper, an unsupervised method was shown for comparing several word2vec models trained on the same data in order to estimate reasonable ranges of vector sizes to induce individual word2vec models.This method is based on measuring inter-model agreement on common word2vec similar terms [31].Palangi [32] developed a model that addresses sentence embedding, a hot topic in current NLP research, using RNN with LSTM cells.
In the above papers, the RNN-LSTM is continuously improved and developed in the process of deep learning applied to NLP.At present, RNN-LSTM has many applications in NLP, but it has not been applied to the unstructured data in the grid.Unstructured data accounts for about 80% of power grid enterprises, which contain important information about the operation and management of the grid.In the malfunction inspection report, the unstructured data processing method is urgently needed to analyze the effective information.
The primary objective of this paper is to provide insight on how to apply the principles of deep learning via NLP to the unstructured data analysis in grids based on RNN-LSTM.The remaining parts in the paper are organized as follows.In Sections 2 and 3, the description of the text data mining-oriented RNN and LSTM model are presented.In Section 4, the malfunction inspection report analysis method based on RNN-LSTM is proposed.Experimental results are provided to demonstrate the proposed method in Section 5. Conclusions are drawn in Section 6.

Text Model Representation
For the vector form of voice and text in the mathematical model, each word represents a vector, and each dimension of the vector represents a single word.If the word appears in the text, it is set to 1; otherwise, it is set to 0. The number of vectors is equal to the dimension of the vocabulary words.

Recurrent Neural Networks
In the past, voice text processing was usually a combination of a neural network and a hidden Markov model.Taking advantage of algorithms and computer hardware, the acoustics model established through deep forward-propagation networks has made considerable progress in recent years.Taking sound into account, text processing is an internal dynamic processing, and a RNN can be used as one of its candidate models.Dynamic means that the currently processed text vector is associated with the context of the content, and it cannot be an independent analysis of the current sample, but should be set before and after the memory unit of the text information for a comprehensive analysis of the semantic information.This approach applies a larger data state space and a more abundant model dynamic performance.
In a neural network, each neuron is a processing unit which is connected to the output of its node as the input.Before the output is issued, each neuron will first apply a nonlinear activation function.It is precisely because of this activation function that neural networks have the ability to model nonlinear relationships.However, the general neural model cannot clearly simulate the time relationship.All data points are composed of a fixed-length vector hypothesis.When there is a strong correlation with the input phasor, the model will greatly reduce the processing effect.Therefore, we introduce the recurrent neural network by given the ability of modeling with explicit time.The explicit time can not only into the output, but also in the next time step hidden layer, by adding across time points from the hidden layer and hidden layer feedback connection.
The traditional neural network has no middle layer of the cycle process.When the specified input x 0 , x 1 , x 2 , ..., x t , after the process of neurons there will be some corresponding output, h 0 , h 1 , h 2 , ..., h t .
In each training, no information needs to transfer between the neurons.The difference between RNNs and traditional neural networks is that in every training for RNN, neurons need to transfer some information.In this training, the neurons need to use the role of the last neuron after the stated information, similar to the recursive function.The basic structure of the RNN is shown in Figure 1, and the expansion is shown in Figure 2, where A is the hidden layer; x i is the input vector; h i is the output of the hidden layer.As can be seen from Figure 2, the output of each hidden layer is input as an input vector to the next hidden layer.The output on the impact of the following can be considered in the next paragraph of the unstructured text.The algorithm of the model is analyzed in the Appendix A.

Text Model Representation
For the vector form of voice and text in the mathematical model, each word represents a vector, and each dimension of the vector represents a single word.If the word appears in the text, it is set to 1; otherwise, it is set to 0. The number of vectors is equal to the dimension of the vocabulary words.

Recurrent Neural Networks
In the past, voice text processing was usually a combination of a neural network and a hidden Markov model.Taking advantage of algorithms and computer hardware, the acoustics model established through deep forward-propagation networks has made considerable progress in recent years.Taking sound into account, text processing is an internal dynamic processing, and a RNN can be used as one of its candidate models.Dynamic means that the currently processed text vector is associated with the context of the content, and it cannot be an independent analysis of the current sample, but should be set before and after the memory unit of the text information for a comprehensive analysis of the semantic information.This approach applies a larger data state space and a more abundant model dynamic performance.
In a neural network, each neuron is a processing unit which is connected to the output of its node as the input.Before the output is issued, each neuron will first apply a nonlinear activation function.It is precisely because of this activation function that neural networks have the ability to model nonlinear relationships.However, the general neural model cannot clearly simulate the time relationship.All data points are composed of a fixed-length vector hypothesis.When there is a strong correlation with the input phasor, the model will greatly reduce the processing effect.Therefore, we introduce the recurrent neural network by given the ability of modeling with explicit time.The explicit time can not only into the output, but also in the next time step hidden layer, by adding across time points from the hidden layer and hidden layer feedback connection.
The traditional neural network has no middle layer of the cycle process.When the specified input 0 1 2 , , ,..., t x x x x , after the process of neurons there will be some corresponding output, , , ,..., t h h h h .In each training, no information needs to transfer between the neurons.The difference between RNNs and traditional neural networks is that in every training for RNN, neurons need to transfer some information.In this training, the neurons need to use the role of the last neuron after the stated information, similar to the recursive function.The basic structure of the RNN is shown in Figure 1, and the expansion is shown in Figure 2, where A is the hidden layer; i x is the input vector; i h is the output of the hidden layer.As can be seen from Figure 2, the output of each hidden layer is input as an input vector to the next hidden layer.The output on the impact of the following can be considered in the next paragraph of the unstructured text.The algorithm of the model is analyzed in the Appendix A.

Long Short-Term Memory Model
Although the RNN performs the transformation from the sentence to a vector in a principled manner, it is generally difficult to learn the long-term dependency within the sequence due to the vanishing gradients problem.The RNN has two limitations: first, the text analysis is in fact associated with the surrounding context, while the RNN only contacts the previous text, but not the following text; second, compared to the time step, RNN has more difficulties in the learning time correlation.A bidirectional LSTM (BLSTM) network can be used in the first problem, while the LSTM model can be used for the second.The RNN repeats the module as shown in Figure 3, which only contains one neuron.The LSTM model is an improvement of the traditional RNN model; based on the RNN model, the cellular control mechanism is added to solve the long-term dependence problem of the RNN and the gradient explosion problem caused by the long sequence.The model can make the RNN model memorize long-term information by designing a special structure cell.In addition, through the design of three kinds of "gate" structures, the forget gate layer, the input gate layer, and the output gate layer, it can selectively increase and remove the information through the cell structure when controlling information through the cell.These three "gates" act on the cell to form the hidden layer of the LSTM, also known as the block.The LSTM repeat module is shown in Figure 4, which contains four neurons.

Long Short-Term Memory Model
Although the RNN performs the transformation from the sentence to a vector in a principled manner, it is generally difficult to learn the long-term dependency within the sequence due to the vanishing gradients problem.The RNN has two limitations: first, the text analysis is in fact associated with the surrounding context, while the RNN only contacts the previous text, but not the following text; second, compared to the time step, RNN has more difficulties in the learning time correlation.A bidirectional LSTM (BLSTM) network can be used in the first problem, while the LSTM model can be used for the second.The RNN repeats the module as shown in Figure 3, which only contains one neuron.The LSTM model is an improvement of the traditional RNN model; based on the RNN model, the cellular control mechanism is added to solve the long-term dependence problem of the RNN and the gradient explosion problem caused by the long sequence.The model can make the RNN model memorize long-term information by designing a special structure cell.In addition, through the design of three kinds of "gate" structures, the forget gate layer, the input gate layer, and the output gate layer, it can selectively increase and remove the information through the cell structure when controlling information through the cell.These three "gates" act on the cell to form the hidden layer of the LSTM, also known as the block.The LSTM repeat module is shown in Figure 4, which contains four neurons.

Long Short-Term Memory Model
Although the RNN performs the transformation from the sentence to a vector in a principled manner, it is generally difficult to learn the long-term dependency within the sequence due to the vanishing gradients problem.The RNN has two limitations: first, the text analysis is in fact associated with the surrounding context, while the RNN only contacts the previous text, but not the following text; second, compared to the time step, RNN has more difficulties in the learning time correlation.A bidirectional LSTM (BLSTM) network can be used in the first problem, while the LSTM model can be used for the second.The RNN repeats the module as shown in Figure 3, which only contains one neuron.The LSTM model is an improvement of the traditional RNN model; based on the RNN model, the cellular control mechanism is added to solve the long-term dependence problem of the RNN and the gradient explosion problem caused by the long sequence.The model can make the RNN model memorize long-term information by designing a special structure cell.In addition, through the design of three kinds of "gate" structures, the forget gate layer, the input gate layer, and the output gate layer, it can selectively increase and remove the information through the cell structure when controlling information through the cell.These three "gates" act on the cell to form the hidden layer of the LSTM, also known as the block.The LSTM repeat module is shown in Figure 4, which contains four neurons.

Long Short-Term Memory Model
Although the RNN performs the transformation from the sentence to a vector in a principled manner, it is generally difficult to learn the long-term dependency within the sequence due to the vanishing gradients problem.The RNN has two limitations: first, the text analysis is in fact associated with the surrounding context, while the RNN only contacts the previous text, but not the following text; second, compared to the time step, RNN has more difficulties in the learning time correlation.A bidirectional LSTM (BLSTM) network can be used in the first problem, while the LSTM model can be used for the second.The RNN repeats the module as shown in Figure 3, which only contains one neuron.The LSTM model is an improvement of the traditional RNN model; based on the RNN model, the cellular control mechanism is added to solve the long-term dependence problem of the RNN and the gradient explosion problem caused by the long sequence.The model can make the RNN model memorize long-term information by designing a special structure cell.In addition, through the design of three kinds of "gate" structures, the forget gate layer, the input gate layer, and the output gate layer, it can selectively increase and remove the information through the cell structure when controlling information through the cell.These three "gates" act on the cell to form the hidden layer of the LSTM, also known as the block.The LSTM repeat module is shown in Figure 4, which contains four neurons.

Core Neuron
LSTM is used to control the transmission of information, it is usually expressed by sigmoid function.
The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.The cell state is kind of like a conveyor belt.It runs straight down the entire chain, with only some minor linear interactions.It is very easy for information to just flow along it unchanged.The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.Gates are a way to optionally let information through.They are composed of a sigmoid neural net layer and a pointwise multiplication operation.The state of the LSTM core neuron is shown in Figure 5.

Core Neuron
LSTM is used to control the transmission of information, it is usually expressed by sigmoid function.The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.The cell state is kind of like a conveyor belt.It runs straight down the entire chain, with only some minor linear interactions.It is very easy for information to just flow along it unchanged.The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.Gates are a way to optionally let information through.They are composed of a sigmoid neural net layer and a pointwise multiplication operation.The state of the LSTM core neuron is shown in Figure 5.

Forget Gate Layer
The role of the gate layer is to determine the upper layer of input information which may be discarded, and it is used to control the hidden layer nodes stored in the last moment of historical information.The forget gate computes a value between 0 and 1 according to the state of the hidden layer at the previous time and the input of the current time node, and acts on the state of the cell at the previous time to determine what information needs to be retained and discarded.The value "1" represents "completely keep", while "0" represents "completely get rid of information".The output of the hidden layer cell (historical information) can be selectively processed by the processing of the forget gate.
The forget gate layer is shown in Figure 6, when the input is

Forget Gate Layer
The role of the gate layer is to determine the upper layer of input information which may be discarded, and it is used to control the hidden layer nodes stored in the last moment of historical information.The forget gate computes a value between 0 and 1 according to the state of the hidden layer at the previous time and the input of the current time node, and acts on the state of the cell at the previous time to determine what information needs to be retained and discarded.The value "1" represents "completely keep", while "0" represents "completely get rid of information".The output of the hidden layer cell (historical information) can be selectively processed by the processing of the forget gate.
The forget gate layer is shown in Figure 6, when the input is h t−1 and the output is x t : Energies 2017, 10, 406 5 of 23

Core Neuron
LSTM is used to control the transmission of information, it is usually expressed by sigmoid function.The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.The cell state is kind of like a conveyor belt.It runs straight down the entire chain, with only some minor linear interactions.It is very easy for information to just flow along it unchanged.The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.Gates are a way to optionally let information through.They are composed of a sigmoid neural net layer and a pointwise multiplication operation.The state of the LSTM core neuron is shown in Figure 5.

Forget Gate Layer
The role of the gate layer is to determine the upper layer of input information which may be discarded, and it is used to control the hidden layer nodes stored in the last moment of historical information.The forget gate computes a value between 0 and 1 according to the state of the hidden layer at the previous time and the input of the current time node, and acts on the state of the cell at the previous time to determine what information needs to be retained and discarded.The value "1" represents "completely keep", while "0" represents "completely get rid of information".The output of the hidden layer cell (historical information) can be selectively processed by the processing of the forget gate.
The forget gate layer is shown in Figure 6, when the input is

Input Gate Layer
The output gate layer is used to control the input of the cell state of the hidden gate layer.It can input the information through a number of operations to determine what needs to be retained to update the current cell state.First the input gate layer is established through a sigmoid function to determine which information should be updated.The output of the input gate layer is a value between 0 and 1 of the sigmoid output, and then it acts on the input information to determine whether to update the corresponding value of the cell state, where 1 indicates that the information is allowed to pass, and the corresponding value needs to be updated, and 0 indicates that it may not be allowed as the corresponding values do not need to be updated.It can be seen that the input gate layer can remove some unnecessary information.Then a tanh layer can be established by adding the candidate state of the neuron phase phasor, and the two jointly calculate the updated value.The input gate layer is shown in Figure 7.

Input Gate Layer
The output gate layer is used to control the input of the cell state of the hidden gate layer.It can input the information through a number of operations to determine what needs to be retained to update the current cell state.First the input gate layer is established through a sigmoid function to determine which information should be updated.The output of the input gate layer is a value between 0 and 1 of the sigmoid output, and then it acts on the input information to determine whether to update the corresponding value of the cell state, where 1 indicates that the information is allowed to pass, and the corresponding value needs to be updated, and 0 indicates that it may not be allowed as the corresponding values do not need to be updated.It can be seen that the input gate layer can remove some unnecessary information.Then a tanh layer can be established by adding the candidate state of the neuron phase phasor, and the two jointly calculate the updated value.The input gate layer is shown in Figure 7.

Update the State of Neurons
It is time to update the old cell state, scaled by how much we decided to update each state value.In the case of the language model, this is where we would actually drop the information about the old subject's gender and add the new information, as we decided in the previous steps.The neurons' state update process is shown in Figure 8.

Update the State of Neurons
It is time to update the old cell state, C t−1 , into the new cell state C t .The previous steps already decided what to do, and now we just need to actually do it.Multiply the old state by f t , forgetting the information we decided to forget earlier; then add i t × C t .These are the new candidate values, scaled by how much we decided to update each state value.In the case of the language model, this is where we would actually drop the information about the old subject's gender and add the new information, as we decided in the previous steps.The neurons' state update process is shown in Figure 8.

Input Gate Layer
The output gate layer is used to control the input of the cell state of the hidden gate layer.It can input the information through a number of operations to determine what needs to be retained to update the current cell state.First the input gate layer is established through a sigmoid function to determine which information should be updated.The output of the input gate layer is a value between 0 and 1 of the sigmoid output, and then it acts on the input information to determine whether to update the corresponding value of the cell state, where 1 indicates that the information is allowed to pass, and the corresponding value needs to be updated, and 0 indicates that it may not be allowed as the corresponding values do not need to be updated.It can be seen that the input gate layer can remove some unnecessary information.Then a tanh layer can be established by adding the candidate state of the neuron phase phasor, and the two jointly calculate the updated value.The input gate layer is shown in Figure 7.

Update the State of Neurons
It is time to update the old cell state, scaled by how much we decided to update each state value.In the case of the language model, this is where we would actually drop the information about the old subject's gender and add the new information, as we decided in the previous steps.The neurons' state update process is shown in Figure 8.

Output Gate Layer
The output gate layer is used to control the output of the current hidden layer node, and to determine whether to output to the next hidden layer or output layer.Through the output of the control, we can determine which information needs to be output.The value of its state is "0" or "1".The value "1" represents a need to output, and "0" represents that it does not require output.Output control information on the current state of the cell for some sort of value can be found after the final output value.
Determine the output of the neurons as ( 6) and ( 7), and the output gate layer is shown in Figure 9.

Output Gate Layer
The output gate layer is used to control the output of the current hidden layer node, and to determine whether to output to the next hidden layer or output layer.Through the output of the control, we can determine which information needs to be output.The value of its state is "0" or "1".The value "1" represents a need to output, and "0" represents that it does not require output.Output control information on the current state of the cell for some sort of value can be found after the final output value.
Determine the output of the neurons as ( 6) and ( 7), and the output gate layer is shown in Figure 9.

Malfunction Inspection Report Analysis Method Based on RNN-LSTM
To learn a good semantic representation of the input sentence, our objective is to make the embedding vectors for sentences of similar meanings as close as possible, and to make sentences of different meanings as far apart as possible.This is challenging in practice since it is hard to collect a large amount of manually labeled data that give the semantic similarity signal between different sentences.Nevertheless, a widely used commercial web search engine is able to log massive amounts of data with some limited user feedback signals.For example, given a particular query, the click-through information about the user-clicked document among many candidates is usually recorded and can be used as a weak (binary) supervision signal to indicate the semantic similarity between two sentences (on the query side and the document side).We try to explain how to leverage such a weak supervision signal to learn a sentence embedding vector that achieves the aforementioned training objective.The above objective to make sentences with similar meaning as close as possible is similar to machine translation tasks where two sentences belong to two different languages with similar meanings, and we want to make their semantic representation as close as possible.
In this paper, the algorithm, parameter setting and experimental environment are introduced as follows:

Malfunction Inspection Report Analysis Method Based on RNN-LSTM
To learn a good semantic representation of the input sentence, our objective is to make the embedding vectors for sentences of similar meanings as close as possible, and to make sentences of different meanings as far apart as possible.This is challenging in practice since it is hard to collect a large amount of manually labeled data that give the semantic similarity signal between different sentences.Nevertheless, a widely used commercial web search engine is able to log massive amounts of data with some limited user feedback signals.For example, given a particular query, the click-through information about the user-clicked document among many candidates is usually recorded and can be used as a weak (binary) supervision signal to indicate the semantic similarity between two sentences (on the query side and the document side).We try to explain how to leverage such a weak supervision signal to learn a sentence embedding vector that achieves the aforementioned training objective.The above objective to make sentences with similar meaning as close as possible is similar to machine translation tasks where two sentences belong to two different languages with similar meanings, and we want to make their semantic representation as close as possible.
In this paper, the algorithm, parameter setting and experimental environment are introduced as follows: Input The process of the training method for RNN-LSTM is presented in Appendix B.

Experimental Verification Based on Malfunction Inspection Report
The full learning formula for all the model parameters was elaborated on in the previous section.The training method was described in Section 4. In this section, we will take the power grid malfunction inspection report of a regional power grid in the China Southern Power Grid as the object of analysis.Through RNN-LSTM processing, we can use machine learning to classify and analyze the unstructured data in different situations.

Database
The description of the corpus is as follows.The malfunction inspection report records come in from the grid personnel by the inspection of the power grid equipment, lines, and protection devices during daily maintenance.The accumulation of fault-by-statement constitutes the main body of the report.Among them, the information in the malfunction inspection report is mainly composed of six main information bodies such as "DeviceInfo", "TripInfo", "Faultinfo", "DigitalStatus", "DigitalEvent", "SettingValue", and other common information.The TripInfo information body can contain multiple optional FaultInfo information.The FaultInfo information body indicates the current and voltage of the action, and it can clearly reflect and display the fault condition and the operation process through the report.The content source of DeviceInfo information can be a fixed value or a configuration file.The information of Faultinfo, DigitalStatus, DigitalEvent, and SettingValue can be different according to the type of protection or manufacturers.Faultinfo can be used as the auxiliary information of a single action message or as a fault parameter of the whole action group.The contents of each message are as follows: (1) DeviceInfo: Description information portion of the recording apparatus.
(2) TripInfo: Partially records protection action events during the failure process.
(3) FaultInfo: Records the fault current, fault voltage, fault phase, fault distance and other information in the process of recording the fault.(4) DigialStatus: Records the signal before the device into the self-test signal status.
(5) DigitalEvent: Records the change of events such as the self-test signal during the process of fault protection; all the switches are sorted according to the action time, and the action time and return time are recorded at the same time.(6) SettingValue: Records the actual value of the device setting at the fault time.
According to the dynamic fault records of the power system, we divide all the faults into the following five categories and give the corresponding labels after each record: mechanical fault; electrical fault; secondary equipment fault; fault caused by external environment; fault caused by human factors.
We have selected the malfunction inspection report of the China Southern Power Grid for nearly 10 years as the data set of this paper.In the used data set, the specific types of faults and causes of faults, and the percentage of their statistics, are shown in Table 1.A single sample data size range is between 21 kb to 523 kb.Training samples and test samples were randomly selected to ensure the versatility of the model test.
In the semantic analysis, we also analyze the data sets used.In this paper, nine categories were selected to cover the semantic relations among most pairs of entities, and they make no overlap between them.However, there are some very similar relationships that can cause difficulties in recognition tasks, such as Entity-Origin (EO), Entity-Destination (ED), and Content-Container (CC), often appearing in one sample at the same time.Similarly, there are Component-Whole (CW) and Member-Collection (MC).Nine types of relationship profiles and examples are as follows: (1) Cause-Effect (CE): Those cancers were caused by radiation exposures.
(2) Instrument-Agency (IA): Phone operator.The specific distribution of the number of samples in each category is shown in Table 2.

Result and Analysis
Based on the RNN-LSTM training with massive single-fault samples, the test set is imported for fault-type accuracy testing.In this paper, three variables were selected.By comparing the fault recognition results under three different variables and the fault report, we obtained the fault recognition accuracy.When the other two variables are fixed, the test samples are verified by using different traverse times.These three variables are: number of LSTM units, type of activation unit, batch size.Batch size is the size of each batch processing data and unique training methods of deep learning.The proper adjustment of Batch size not only can reduce the weight adjustment times to prevent over-fitting, also speeding up training.

Fault Recognition Accuracy and Number of LSTM Units
This experiment takes the activation unit and batch size at a constant rate, while the number of LSTM units gradually increases, and improves the number of traversals in the same LSTM units' number conditions.The number of LSTM training samples is 10,000, and the number of test samples is 3000; the activation unit is a sigmoid; the batch size is 20.The relationship between the accuracy rate and the number of LSTM units is shown in Table 3, and its trend is shown in Figure 10.This experiment takes the activation unit and batch size at a constant rate, while the number of LSTM units gradually increases, and improves the number of traversals in the same LSTM units' number conditions.The number of LSTM training samples is 10,000, and the number of test samples is 3000; the activation unit is a sigmoid; the batch size is 20.The relationship between the accuracy rate and the number of LSTM units is shown in Table 3, and its trend is shown in Figure 10.As can be observed from Table 3 and Figure 10, when the number of units in the LSTM remains constant, with the increase of the number of traversals, the higher the fault recognition accuracy rate is.When the number of LSTM units is the same, the greater the number of LSTM units, and the better the performances; however, a significant decline in the accuracy rate emerges when the number of LSTM units stays at 512.The reason for the decrease of the accuracy rate is that, as the required data volume increases, if more than 512 LSTM units are needed, the parameters need to be adjusted and optimized.
To further analyze the data, the receiver operating characteristic (ROC) curve system is added to the results.Due to the different performances by different numbers of LSTM units, we repeated experiments under different epoch conditions and chose three worth analyzing: 64, 128, 256.The area under the curve (AUC) reflects the ability of the recognition algorithm to correctly distinguish two types of targets.The larger the AUC is, the better the performance of the algorithm is.False negative (FN), false positive (FP), true negative (TN), and true positive (TP) are important parameters in the ROC curve.Specificity is defined as the true negative rate (TNR), and sensitivity is defined as the true positive rate (TPR).In the following experiment, the threshold was set to 0.5.The test was positive if the accuracy of the fault recognition under different activation units was higher than the threshold value.As can be seen from Table 4 and Figure 11, in the proposed algorithm, the performance of the algorithm tends to be better in a certain interval with the increase of the number of LSTM units.As can be observed from Table 3 and Figure 10, when the number of units in the LSTM remains constant, with the increase of the number of traversals, the higher the fault recognition accuracy rate is.When the number of LSTM units is the same, the greater the number of LSTM units, and the better the performances; however, a significant decline in the accuracy rate emerges when the number of LSTM units stays at 512.The reason for the decrease of the accuracy rate is that, as the required data volume increases, if more than 512 LSTM units are needed, the parameters need to be adjusted and optimized.
To further analyze the data, the receiver operating characteristic (ROC) curve system is added to the results.Due to the different performances by different numbers of LSTM units, we repeated experiments under different epoch conditions and chose three worth analyzing: 64, 128, 256.The area under the curve (AUC) reflects the ability of the recognition algorithm to correctly distinguish two types of targets.The larger the AUC is, the better the performance of the algorithm is.False negative (FN), false positive (FP), true negative (TN), and true positive (TP) are important parameters in the ROC curve.Specificity is defined as the true negative rate (TNR), and sensitivity is defined as the true positive rate (TPR).In the following experiment, the threshold was set to 0.5.The test was positive if the accuracy of the fault recognition under different activation units was higher than the threshold value.As can be seen from Table 4 and Figure 11, in the proposed algorithm, the performance of the algorithm tends to be better in a certain interval with the increase of the number of LSTM units.

Fault Recognition Accuracy and Activation Unit Type
This experiment keeps the number of LSTM units and the batch size at a constant rate, while selecting four different activation units, and improving the number of traversals in the same activation unit conditions.The number of LSTM training samples is 10,000, and the number of test samples is 3000; the number of LSTM units is 128; the batch size is 20.The relationship between the accuracy rate and the different activation units is shown in Table 5, and its trend is shown in Figure 12.

Fault Recognition Accuracy and Activation Unit Type
This experiment keeps the number of LSTM units and the batch size at a constant rate, while selecting four different activation units, and improving the number of traversals in the same activation unit conditions.The number of LSTM training samples is 10,000, and the number of test samples is 3000; the number of LSTM units is 128; the batch size is 20.The relationship between the accuracy rate and the different activation units is shown in Table 5, and its trend is shown in Figure 12.As can be observed from Table 5 and Figure 12, under the same activation unit condition, with the increase of the number of traversals, the fault recognition accuracy is higher.With the same number of traversals, the use of Sofamax and the sigmoid activation unit will obtain a better accuracy, the Relu's performance followed.It can be seen that the greater the number of traversals, the Relu and sigmoid performances become closer, but the results obtained using the tanh change are not obvious.Thus, in the choice of activation function, Sofamax and sigmoid are more suitable for text processing.We also selected the above four activation functions for ROC analysis by repeated experiments under different epoch conditions.In the following experiment, threshold was set to 0.5.The test was positive if the accuracy of the fault recognition under different activation units was higher than the threshold value.As can be seen from Table 6 and Figures 13 and 14, Sofamax and sigmoid activation performed the best.It also confirms the above conclusions.As can be observed from Table 5 and Figure 12, under the same activation unit condition, with the increase of the number of traversals, the fault recognition accuracy is higher.With the same number of traversals, the use of Sofamax and the sigmoid activation unit will obtain a better accuracy, the Relu's performance followed.It can be seen that the greater the number of traversals, the Relu and sigmoid performances become closer, but the results obtained using the tanh change are not obvious.Thus, in the choice of activation function, Sofamax and sigmoid are more suitable for text processing.We also selected the above four activation functions for ROC analysis by repeated experiments under different epoch conditions.In the following experiment, threshold was set to 0.5.The test was positive if the accuracy of the fault recognition under different activation units was higher than the threshold value.As can be seen from Table 6 and Figures 13 and 14, Sofamax and sigmoid activation performed the best.It also confirms the above conclusions.As can be observed from Table 5 and Figure 12, under the same activation unit condition, with the increase of the number of traversals, the fault recognition accuracy is higher.With the same number of traversals, the use of Sofamax and the sigmoid activation unit will obtain a better accuracy, the Relu's performance followed.It can be seen that the greater the number of traversals, the Relu and sigmoid performances become closer, but the results obtained using the tanh change are not obvious.Thus, in the choice of activation function, Sofamax and sigmoid are more suitable for text processing.We also selected the above four activation functions for ROC analysis by repeated experiments under different epoch conditions.In the following experiment, threshold was set to 0.5.The test was positive if the accuracy of the fault recognition under different activation units was higher than the threshold value.As can be seen from Table 6 and Figures 13 and 14, Sofamax and sigmoid activation performed the best.It also confirms the above conclusions.

Accuracy of Fault Recognition and Batch Size
This experiment keeps the number of LSTM units and the activation unit at a constant rate, while the single-batch processing data size gradually increases, and improves the number of traversals in the same batch size condition.The number of LSTM training samples is 10,000, and the number of test samples is 3000; the number of LSTM units is 128; and the activation unit is sigmoid.The relationship between the accuracy rate and the different batch sizes is shown in Table 7, and its trend is shown in Figure 15.

Accuracy of Fault Recognition and Batch Size
This experiment keeps the number of LSTM units and the activation unit at a constant rate, while the single-batch processing data size gradually increases, and improves the number of traversals in the same batch size condition.The number of LSTM training samples is 10,000, and the number of test samples is 3000; the number of LSTM units is 128; and the activation unit is sigmoid.The relationship between the accuracy rate and the different batch sizes is shown in Table 7, and its trend is shown in Figure 15.As can be observed from Table 7 and Figure 15, under the same batch size conditions, with the increase of the number of traversals, the fault recognition accuracy is higher.With the same number of traversals, when the batch size was valued at 20, the accuracy was higher than with the other two sizes.When the batch size was valued 10, the accuracy rate increased with the increase of the number of traversals, but the lack of continuous improvement was an under-fitting state.When the batch size was valued at 50, the accuracy rate was significantly decreased compared to the previous two sizes, as too much data in each batch processing caused an over-fitting phenomenon.
We also selected the above three batch sizes for ROC analysis by repeated experiments under different epoch conditions.In the following experiment, the threshold was set to 0.48.The test was positive if the accuracy of the fault recognition under different batch sizes was higher than the threshold value.As can be seen from Table 8 and Figure 16, when the batch size was valued at 20, it performed best.However, when the batch size was valued at 50, the overall ROC curve tended to be smoother.It should be noted that, for different datasets showing different characteristics, the batch size should not have a fixed range of selection.As the inspection report requires a certain word length that can express the corresponding characteristics, the best value for the batch size is 20.As can be observed from Table 7 and Figure 15, under the same batch size conditions, with the increase of the number of traversals, the fault recognition accuracy is higher.With the same number of traversals, when the batch size was valued at 20, the accuracy was higher than with the other two sizes.When the batch size was valued 10, the accuracy rate increased with the increase of the number of traversals, but the lack of continuous improvement was an under-fitting state.When the batch size was valued at 50, the accuracy rate was significantly decreased compared to the previous two sizes, as too much data in each batch processing caused an over-fitting phenomenon.
We also selected the above three batch sizes for ROC analysis by repeated experiments under different epoch conditions.In the following experiment, the threshold was set to 0.48.The test was positive if the accuracy of the fault recognition under different batch sizes was higher than the threshold value.As can be seen from Table 8 and Figure 16, when the batch size was valued at 20, it performed best.However, when the batch size was valued at 50, the overall ROC curve tended to be smoother.It should be noted that, for different datasets showing different characteristics, the batch size should not have a fixed range of selection.As the inspection report requires a certain word length that can express the corresponding characteristics, the best value for the batch size is 20.As can be observed from Table 7 and Figure 15, under the same batch size conditions, with the increase of the number of traversals, the fault recognition accuracy is higher.With the same number of traversals, when the batch size was valued at 20, the accuracy was higher than with the other two sizes.When the batch size was valued 10, the accuracy rate increased with the increase of the number of traversals, but the lack of continuous improvement was an under-fitting state.When the batch size was valued at 50, the accuracy rate was significantly decreased compared to the previous two sizes, as too much data in each batch processing caused an over-fitting phenomenon.
We also selected the above three batch sizes for ROC analysis by repeated experiments under different epoch conditions.In the following experiment, the threshold was set to 0.48.The test was positive if the accuracy of the fault recognition under different batch sizes was higher than the threshold value.As can be seen from Table 8 and Figure 16, when the batch size was valued at 20, it performed best.However, when the batch size was valued at 50, the overall ROC curve tended to be smoother.It should be noted that, for different datasets showing different characteristics, the batch size should not have a fixed range of selection.As the inspection report requires a certain word length that can express the corresponding characteristics, the best value for the batch size is 20.In this paper, three key parameters are selected as experimental variables, and the experimental results can be reflected: the unstructured data processing method based on RNN-LSTM proposed in this paper can achieve the current research level in machine learning when it is applied to malfunction inspection reports.This means that this method can provide a more effective way for grid inspectors to deal with unstructured text.

Conclusions
How to efficiently handle large numbers of unstructured text data such as malfunction inspection reports is a long-standing problem faced by operation engineers.This paper proposes a deep learning method for malfunction inspection report processing by using the text data mining-oriented RNN-LSTM.An effective training strategy for an effective RNN-LSTM network for modeling inspection data is presented.From the obtained results, and an effectiveness analysis, it was demonstrated that RNN-LSTM, especially with target replication, can successfully classify diagnoses of labeled malfunction inspection reports given unstructured data.It is emphasized that the experiment via different variables can be reflected in the configuration of key variables in how we should select parameters to achieve optimal results, including the selection of the maximum LSTM unit number, the processing capacity of the activation unit and the prevention of over-fitting.In this paper, we used fault labels without timestamps, but we are obtaining timestamped diagnoses, which will enable us to train models to perform early fault diagnosis by predicting future conditions in a larger inspection data set.
The matrix of Equation (A21): u t+1 i = u t i + z t i , u t+1 = W hh * h t + W hx * x, d next is defined as the partial derivative of the error to the hidden layer input at the next moment.

Appendix B. The Process of the Training Method for RNN-LSTM
The forward pass process of the training method is shown in Table A1.
Table A1.Forward pass process.

Figure 3 .
Figure 3.The repeating module in a standard RNN contains a single layer.

Figure 4 .
Figure 4.The repeating module in a long short-term memory (LSTM) contains four interacting layers.

Figure 3 .
Figure 3.The repeating module in a standard RNN contains a single layer.

Figure 4 .
Figure 4.The repeating module in a long short-term memory (LSTM) contains four interacting layers.

Figure 3 .
Figure 3.The repeating module in a standard RNN contains a single layer.

Figure 3 .
Figure 3.The repeating module in a standard RNN contains a single layer.

Figure 4 .
Figure 4.The repeating module in a long short-term memory (LSTM) contains four interacting layers.

Figure 4 .
Figure 4.The repeating module in a long short-term memory (LSTM) contains four interacting layers.

1 tC
− , into the new cell state t C .The previous steps already decided what to do, and now we just need to actually do it.Multiply the old state by t f , forgetting the information we decided to forget earlier; then add  t t i C × .These are the new candidate values,
Input: training samples and test sample (including various types of reports and their labels).The number of truncated words: maxlen = 200.Minimum number of words: min_count = 5.The parameters of the model: Dropout = 0.5, Dense = 1.Activation function type: Sofamax, Relu, tanh and sigmoid.Number of LSTM units: 32; 64; 128; 259; 512.Output: classification results and fault recognition accuracy of test samples.Initialization: set all parameters of the model to small random numbers Experimental operating environment: operating system: Windows 10; RAM: 32 G; CPU: XeonE5 8-core; graphics: NVIDIA 980M; program computing environment: Theano and Keras.
: training samples and test sample (including various types of reports and their labels).The number of truncated words: maxlen = 200.Minimum number of words: min_count = 5.The parameters of the model: Dropout = 0.5, Dense = 1.Activation function type: Sofamax, Relu, tanh and sigmoid.Number of LSTM units: 32; 64; 128; 259; 512.Output: classification results and fault recognition accuracy of test samples.Initialization: set all parameters of the model to small random numbers.Experimental operating environment: operating system: Windows 10; RAM: 32 G; CPU: XeonE5 8-core; graphics: NVIDIA 980M; program computing environment: Theano and Keras.

Figure 10 .
Figure 10.Accuracy of fault recognition under different LSTM units.

Figure 10 .
Figure 10.Accuracy of fault recognition under different LSTM units.

Figure 12 .
Figure 12.Accuracy of fault recognition under different activation units.

Figure 13 .
Figure 13.Comparison of ROC curves under different activation units.

Figure 12 .
Figure 12.Accuracy of fault recognition under different activation units.

Figure 12 .
Figure 12.Accuracy of fault recognition under different activation units.

Figure 13 .
Figure 13.Comparison of ROC curves under different activation units.

Figure 13 .
Figure 13.Comparison of ROC curves under different activation units.

Figure 15 .
Figure 15.Accuracy of fault recognition under different batch size.

Figure 15 .
Figure 15.Accuracy of fault recognition under different batch size.

Figure 15 .
Figure 15.Accuracy of fault recognition under different batch size.

Process 1 :
Forward Pass input units: y = current external input; roll over: activation: ŷ = y; cell state: ŝc v j = s c v j ; Loop over memory blocks, indexed j Step 1a: input gates (1):net inj = ∑ m w injm ŷm + ∑ Sj v=1 w injc v j ŝc v j ; y inj = f inj net inj ;Step 1b: forget gate (2):net ϕj = ∑ m w ϕjm ŷm + ∑Sj v=1 w ϕjc v j ŝv cj ; y ϕj = f ϕj net ϕj ; Step 1c: the cell states (3): Loop over the S j cells in blocks j, index v {net c v j = ∑ m w c v j m ŷm ; s c v j = y ϕj ŝc v j + y inj g net c v j net outj = ∑ m w outjm ŷm + ∑ Sj v=1 w outjc v j s c v j ; y outj = f outj net outj ; Cell outputs (5): Loop over the S j cells in block j, indexed v {y c v j = y outj s c v j }; End loop over memory blocks Output units (6): net k = ∑ m w km y m ; y k = f k (net k );

Table 1 .
Different fault type statistics in the dataset.

Table 2 .
Statistical distribution of relationship categories in samples.

Table 3 .
Accuracy of fault recognition under different LSTM units.

Table 3 .
Accuracy of fault recognition under different LSTM units.

Table 4 .
Area under the curve (AUC) analysis of receiver operating characteristic (ROC) curves under different LSTM unit numbers.

Table 4 .
Area under the curve (AUC) analysis of receiver operating characteristic (ROC) curves under different LSTM unit numbers.

Table 5 .
Accuracy of fault recognition under different activation units.

Table 6 .
AUC analysis of ROC curves under different activation units.

Table 5 .
Accuracy of fault recognition under different activation units.

Table 6 .
AUC analysis of ROC curves under different activation units.

Table 5 .
Accuracy of fault recognition under different activation units.

Table 6 .
AUC analysis of ROC curves under different activation units.

Table 7 .
Accuracy of fault recognition under different batch sizes.

Table 7 .
Accuracy of fault recognition under different batch sizes.

Table 8 .
AUC analysis of ROC curves under different batch sizes.

Table 8 .
AUC analysis of ROC curves under different batch sizes.

Table 8 .
AUC analysis of ROC curves under different batch sizes.

Batch Size AUC Standard Error Lower Bound (95%) Upper Bound (95%)
(W hh (j, i) + 0) = As known u i = ∑ V j=1 W xh (i, j) * x j , matrix obtained: Partial derivation of the error to W hh (i, j)By calculating the partial derivatives of all the parameter matrices with respect to the errors, the parameters can be updated according to the gradient descent of the partial derivatives.