Two-Level Fault Diagnosis of SF 6 Electrical Equipment Based on Big Data Analysis

With the increase of the operating time of sulphur hexafluoride (SF6) electrical equipment, the different degrees of discharge may occur inside the equipment. It makes the insulation performance of the equipment decline and will cause serious damage to the equipment. Therefore, it is of practical significance to diagnose fault and assess state for SF6 electrical equipment. In recent years, the frequency of monitoring data acquisition for SF6 electrical equipment has been continuously improved and the scope of collection has been continuously expanded, which makes massive data accumulated in the substation database. In order to quickly process massive SF6 electrical equipment condition monitoring data, we built a two-level fault diagnosis model for SF6 electrical equipment on the Hadoop platform. And we use the MapReduce framework to achieve the parallelization of the fault diagnosis algorithm, which further improves the speed of fault diagnosis for SF6 electrical equipment.


Introduction
SF6 electrical equipment refers to electrical equipment that uses sulphur hexafluoride (SF6) as insulation or arc extinguishing.SF6 electrical equipment has the advantages of small size, small maintenance, long life and good insulation.With the extensive popularization of intelligent substation, there are more and more power equipment [1].The traditional oil-filled equipment is being replaced gradually by SF6 electrical equipment which have unique advantages and account for an increasing proportion of new equipment.With the increasing number of SF6 electrical equipment, the reliability requirements have also increased [2].The internal discharge of SF6 electrical equipment leads to the decomposition of internal SF6 gas molecules.The chemical properties of its derivatives are very active.The derivatives can corrode the equipment, which can easily cause a decline in the insulation performance and cause serious damage to the safe and stable operation of the equipment [3,4].
In the literature [5], the phenomenon that the decision tree model or the radial basis function (RBF) neural network model is not used when the diagnosis is not high is firstly described.Then the RBF neural network and the decision tree model are used to diagnose the SF6 electrical equipment.This method requires two diagnostics for all data, which will take a lot of time.
The continuous creation of data has posed new research challenges due to its complexity, diversity and volume.Consequently, Big Data has increasingly become a fully recognized scientific field [6].Now big data analytics are increasingly being used to solve real-world problems in life [7].And the cloud platform provides an open architecture for big data analytics, which can improve the utilization of data resources [8].
The arrival of the era of big data is accompanied by massive data, which makes the screening of valuable information a core step in the widespread application of big data.Hadoop is an open source distributed computing platform, its distributed file system HDFS and distributed computing framework MapReduce solved the problem of data storage and programming [9].Hadoop's ability to store and process data in bits is trusted.The backup mechanism is used in HDFS to maintain multiple copies of data.The task monitoring mechanism is used in MapReduce.Hadoop is able to dynamically move data between nodes and ensure the dynamic balance of each node, so processing is very fast [10].
MapReduce is a programming model and related implementation for processing and generating large data sets that are suitable for a wide variety of practical tasks [11].MapReduce advantages over parallel databases include storage-system independence and fine-grain fault tolerance for large jobs [12].At the same time, MapReduce can realize parallel processing of data and greatly improve the processing efficiency of monitoring data of SF6 electrical equipment [13].
In order to solve the above problems, the two-level fault diagnosis model of SF6 electrical equipment based on the content of SF6 gas derivative as the input for the monitoring data of SF6 electrical equipment was proposed in this paper.The first level fault diagnosis model is first used to determine whether the data is fault data.If yes, it enters the second level fault diagnosis model to identify the specific fault category.If not, the judgment is no longer performed, thereby improving the diagnostic efficiency.In order to quickly realize the fault diagnosis of SF6 electrical equipment for the massive SF6 electrical equipment condition monitoring data, this paper implements the fault diagnosis of SF6 electrical equipment on Hadoop platform and realizes the parallelization of SF6 electrical equipment fault diagnosis algorithm.

Data Acquisition of SF6 Electrical Equipment
The two-level fault diagnosis model diagnoses the equipment by the content of SF6 gas derivatives.Therefore, it is necessary to select SF6 gas derivatives as the characteristic attributes first.When partial discharge occurs in SF6 electrical equipment, SF6 gas decomposes to produce a variety of derivatives.The formation mechanism of the derivatives is very complex, and mainly includes two processes.First, the SF6 gas decomposes to generate low fluorides, and then these low fluorides react with impurities, electrode materials, and insulating materials to generate other more stable derivatives.Derivatives derived from different discharge energies are also different.According to the magnitude of the released energy, discharge can be divided into three forms: arc, spark and corona discharge.Some references [14][15][16] mention that there is a large amount of SOF2 and a small amount of SO2F2 generated under arc discharge.The main derivative of spark discharge is also SOF2, in which the content of the derivative content is: SOF2 > SOF4 > SO2F2 > SO2 > S2F10/S2OF10 [17].During corona discharge, SOF2 is still the most abundant derivative, and the contents of SO2F2, S2OF10, and S2F10 are higher than those of the other two discharges.Therefore, SOF2, SOF4, SO2F2, SO2, S2OF10, and S2F10 were selected as the characteristic attribute of the algorithm [18].And the types of faults in SF6 electrical equipment can be divided into three main types: arc, spark and corona discharge.
Infrared spectroscopy is a method for the quantitative and qualitative analysis of various infrared light absorbing compounds.The composition is analyzed by the difference in the absorption of infrared radiation by the substance.One of the great advantages of infrared spectroscopy is that virtually any sample may be studied in any state [19].Infrared spectroscopy can be used to analyze the composition of SF6 gas in electrical equipment to determine the state of the equipment [20].Infrared spectroscopy can be used to detect large quantities of material at a lower cost [21].Therefore, we can install an infrared spectrum analyzer for gas composition analysis and data acquisition in SF6 electrical equipment.

Construction of two-level fault diagnosis model
The first level model is constructed using a random forest algorithm.The random forest model is used to filter out the fault data.Because it is only used to determine whether the equipment is faulty, the depth of the decision tree constructed in the forest is low, which can reduce the diagnosis time.The second level model is built by the neural network algorithm.The input was fault data filtered by a random forest model.The second-level model can not only diagnose the known types of faults, but also identify the new fault types through communicating with experts.And by updating the structure and weight of the neural network model, the model is continuously improved.Figure 1 is a block diagram of the SF6 electrical equipment fault diagnosis system.

Construction of two-level fault diagnosis model
The first level model is constructed using a random forest algorithm.The random forest model is used to filter out the fault data.Because it is only used to determine whether the equipment is faulty, the depth of the decision tree constructed in the forest is low, which can reduce the diagnosis time.The second level model is built by the neural network algorithm.The input was fault data filtered by a random forest model.The second-level model can not only diagnose the known types of faults, but also identify the new fault types through communicating with experts.And by updating the structure and weight of the neural network model, the model is continuously improved.Figure 1 is a block diagram of the SF6 electrical equipment fault diagnosis system.(1) Reading monitoring data, extract feature components and category components, then normalizing the feature components.(2) Putting the normalized data into the first-level random forest model.If the data is diagnosed as normal data, then output the result directly.Otherwise, go to step 3.The fault diagnosis process of SF6 electrical equipment is shown in Figure 2, and the main steps are follows.
(1) Reading monitoring data, extract feature components and category components, then normalizing the feature components.(2) Putting the normalized data into the first-level random forest model.If the data is diagnosed as normal data, then output the result directly.Otherwise, go to step 3. (3) Putting the data that diagnosed as fault data in step 2 into the second-level neural network model.
If the fault type of the equipment has been trained, then output the diagnosis result directly.Otherwise, go to step 4. (4) The fault type that cannot be correctly identified in step 3 is submitted to the experts for diagnosis to determine the type.Then the neural network is retrained to update the structure and weights of network so that it can be continuously improved.

Construction of two-level fault diagnosis model
The first level model is constructed using a random forest algorithm.The random forest model is used to filter out the fault data.Because it is only used to determine whether the equipment is faulty, the depth of the decision tree constructed in the forest is low, which can reduce the diagnosis time.The second level model is built by the neural network algorithm.The input was fault data filtered by a random forest model.The second-level model can not only diagnose the known types of faults, but also identify the new fault types through communicating with experts.And by updating the structure and weight of the neural network model, the model is continuously improved.Figure 1 is a block diagram of the SF6 electrical equipment fault diagnosis system.The fault diagnosis process of SF6 electrical equipment is shown in Figure 2, and the main steps are follows.The specific algorithm can refer to our paper: The two-level fault diagnosis model of SF6 electrical equipment [18].
Figure 3 is the architecture diagram of the secondary fault diagnosis system for SF6 electrical equipment.
(3) Putting the data that diagnosed as fault data in step 2 into the second-level neural network model.If the fault type of the equipment has been trained, then output the diagnosis result directly.Otherwise, go to step 4. (4) The fault type that cannot be correctly identified in step 3 is submitted to the experts for diagnosis to determine the type.Then the neural network is retrained to update the structure and weights of network so that it can be continuously improved.
The specific algorithm can refer to our paper: The two-level fault diagnosis model of SF6 electrical equipment [18].
Figure 3 is the architecture diagram of the secondary fault diagnosis system for SF6 electrical equipment.

Monitoring Data Preprocessing
The fault diagnosis of SF6 electrical equipment is one of the methods to verify the operation status of the equipment, but the detection model of the equipment is mostly based on an ideal data set [22].In the actual collection process, data redundancy, data loss, and data inconsistency will inevitably occur due to issues such as collection equipment, external environmental disturbances, and human's misoperation.This will eventually affect later data mining.Therefore, before the fault diagnosis of SF6 electrical equipment, this section fills in the missing values in the monitoring data to implement data pre-processing, and then implements the algorithm.
There are currently few data preprocessing studies on SF6 gas derivative monitoring data.In this section, the SF6 gas derivative content data is used as the processing object.Through the analysis of the data, it is found that there may be three cases of the missing monitoring data.A single approach cannot be fully applied to all situations.Therefore, the data missing values are classified, and different processing methods are adopted for each case.
There are three possible missing values for SF6 gas derivative content monitoring data.These three conditions are described as follows.
(1) The missing value is in the same range as the monitoring value recorded before and after it.
There are two cases.Case 1: Monitoring data including missing values over a period of time is within normal limits.Case 2: Due to equipment failure, the monitored data including missing values over a period of time is in an abnormal range.(2) The missing value is not in the same range as the monitoring value recorded before and after it.
There are also two cases.Case 1: The device just fails and the SF6 gas derivative content rises from the normal range to the abnormal range.Case 2: The equipment was repaired and the content of SF6 gas derivatives dropped from the abnormal range to the normal range.(3) The monitoring values are lost over a period of time.
For the above three cases, the process is as follows.
(1) The weighted interpolation method is used to deal with the case where the first type of data is missing, that is, the weighting factor is introduced for the mean interpolation method.
The basic mean interpolation method takes the arithmetic mean of the n numbers before and after the missing value as a substitute value, as shown in Equation ( 1).
The weighting factor is introduced for the mean interpolation method, as shown in Equation ( 2), wherein the record closer to the missing value time has a larger weight.
where w is the weight, the closer to the missing value, the greater the weight of the record.
There are six kinds of SF6 gas derivatives selected in this paper, and SOF2 is used as an example for pretreatment.In order to verify the validity of the algorithm, some SOF2 historical data was read from the database and created some missing cases.In the experiment, four monitoring values were chosen to estimate missing values.When SF6 electrical equipment is in normal condition, its SOF2 content does not exceed 10 µL/L.
In this section, multiple sets of experiments were performed for each type of data loss.For reasons of space, only a part of the content was selected as the result.
Experiment 1: The historical data of SOF2 are within the normal range.The experimental data is shown in Table 1.The numerical values of the serial numbers 3, 4, and 5 are removed in turn, and the interpolation results of SOF2 are shown in Table 2. Using the weighted interpolation method to calculate the missing values as follow: where According to Table 2, the average error of the mean interpolation method is 5.3%, and the average error of the weighted interpolation method is 4.8%.In this case, the weighted interpolation method works well.
Experiment 2: The historical data of SOF2 are within the abnormal range.The experimental data are shown in Table 3.The numerical values of the Nos.3, 4, and 5 are removed in turn, and the interpolation results of SOF2 are shown in Table 4.According to Table 4, the average error of the mean interpolation method is 4.12%, and the average error of the weighted interpolation method is 3.86%.The weighted interpolation method works well.
(2) Reading the SOF2 monitoring data from the database for a period of time.The missing value is not in the same range as the monitoring value recorded before and after it.
Experiment 3: When the content of the previous monitoring data of SOF2 missing value is within 10 µL/L, and the content of the latter monitoring data is greater than 10 µL/L, it indicates that the content of SOF2 fluctuates from the normal range to the abnormal range.The experimental data is shown in Table 5.If the experiment is continued using the weighted interpolation method and the mean interpolation method, the numerical values of the serial numbers 3 and 4 are sequentially removed, and the experimental results are shown in Table 6.It can be seen from Table 6 that although the error of the interpolation can be reduced by adjusting the weight, it is necessary to find a suitable weight.For sequence number 3, the method of linear interpolation is not suitable.
(3) When the SOF2 monitoring data is read from the database for a period of time, there is a continuous lack of data, and the linear interpolation method cannot be used at this time.
Considering case 2 and case 3, this paper uses the gray correlation degree to interpolate the data.The SF6 gas derivative content data complementing method based on gray correlation is to perform gray scale processing on other component data except the missing value attribute, and to find the closest set of data with the missing value tuple by calculating the correlation degree to make up the complement treatment.The main process is shown in Figure 4.   6 that although the error of the interpolation can be reduced by adjusting the weight, it is necessary to find a suitable weight.For sequence number 3, the method of linear interpolation is not suitable.
(3) When the SOF2 monitoring data is read from the database for a period of time, there is a continuous lack of data, and the linear interpolation method cannot be used at this time.
Considering case 2 and case 3, this paper uses the gray correlation degree to interpolate the data.The SF6 gas derivative content data complementing method based on gray correlation is to perform gray scale processing on other component data except the missing value attribute, and to find the closest set of data with the missing value tuple by calculating the correlation degree to make up the complement treatment.The main process is shown in Figure 4.The specific steps are follows:

Interpolated missing values
Step 1: Determining the main sequence and subsequence.
A tuple containing missing values is taken as the main sequence.A tuple, in the historical record, with at least one attribute value that is not within the normal range of the safety work is taken as a subsequence.Mark them as 1 m X X  and 5000 m = .
Step 2: The matrix formed by the main sequence and subsequences is standardized to obtain a normalized matrix X.
The formula for calculating the correlation coefficient where ( ) absValue i is the absolute difference sequence of i X and 0 X ,

( )
min AbsValue i and ( ) max AbsValue i are the minimum and maximum of ( ) Step 4: Calculate relevance.
The formula for calculating relevance ( ) The specific steps are follows: Step 1: Determining the main sequence and subsequence.
A tuple containing missing values is taken as the main sequence.A tuple, in the historical record, with at least one attribute value that is not within the normal range of the safety work is taken as a subsequence.Mark them as X 1 ∼ X m and m = 5000.
Step 2: The matrix formed by the main sequence and subsequences is standardized to obtain a normalized matrix X.
The formula for calculating the correlation coefficient rel(i) of X i for X 0 is: where absValue(i) is the absolute difference sequence of X i and X 0 , minAbsValue(i) and maxAbsValue(i) are the minimum and maximum of absValue(i), de f C ∈ (0, 1) is the resolution coefficient and de f C = 0.5.
The formula for calculating relevance p(i) of X i for X 0 is: where w is the weight of each SF6 gas derivate.There are six kinds of derivatives, so the w is set to 1/6.
Sorting the relevance obtained above, and extracting the first 10 sets of tuples those relevance are greater than 0.9 are taken to interpolate the missing values.If less than 10 sets, all the tuples satisfying the condition will be taken.The formula for interpolating missing values X 0j is: where j is the column of the missing value in the tuple, n is the number of tuples taken.Experiment 3 was re-examined using the gray correlation degree.The experimental results are shown in Table 7. From Table 7, it can be seen that the interpolated values for Nos. 3 and 4 are 8.51 and 13.32, the errors are 0.01 and 0.033 respectively.Therefore, the use of gray correlation can effectively solve the problem of missing data mutations.
The method of grey correlation is to find the closest tuples in the previous record to interpolate missing data.Therefore, when this method is used, there is no case of continuous lack of data.
After pre-processing the monitoring data of the SF6 electrical equipment, a two-level fault diagnosis model of the SF6 electrical equipment can be constructed.First, we train the model on the Hadoop platform.Second, we implemented parallelization of diagnostic algorithms on the Hadoop platform.

Implementation of the Two-Level Fault Diagnosis of SF6 Electrical Equipment Based on Hadoop
Hadoop is an open source distributed computing platform.Its distributed file system (HDFS) and distributed computing framework (MapReduce) solve the data storage and programming problems, respectively.Therefore, we implemented the parallelization of the above fault diagnosis algorithm on the Hadoop platform, which improved the processing speed of massive monitoring data and accelerated the diagnostic rate of SF6 electrical equipment.

Implementation of Random Forest Algorithm Based on MapReduce
After preprocessing the SF6 electrical equipment monitoring data, a two-level fault diagnosis model of the SF6 electrical equipment can be constructed on the Hadoop platform, and the parallel algorithm of the diagnostic algorithm can be realized through the Hadoop platform.
In the process of establishing a random forest, each decision tree is created in a serialized manner.Only when the current decision tree is generated will the next tree be created.Decision trees are independent of each other, and there is no need to rely on other trees when creating a decision tree, so parallelization can be achieved.
Using the MapReduce framework to achieve the parallelization of the random forest algorithm mainly has two stages: Map and Reduce.Each Map task establishes a decision tree, and finally uploads them to HDFS in the Reduce task to form a forest.The parallel forest construction process of random forest is shown in Figure 5 and the specific tree construction process of Map task is shown in Figure 6.

Specific steps are as follows:
Step1: Create a sample subset.Using Bagging method to extract subsets from the original sample as the sample subset of each decision tree, where treeID is the number of the decision tree and dataset is its corresponding sample subset.In this paper, the number of trees is set to 7. So the range of the treeID is 1 to 7. Step2: Use the sample subset to create a decision tree and initialize the number of Map tasks based on the number of decision trees.
The input of the Map function is <treeID, dataset>.This function mainly completes the decision tree construction, and the output is <treeID, list<feature>>.MapReduce parallelism is also used to

Specific steps are as follows:
Step1: Create a sample subset.Using Bagging method to extract subsets from the original sample as the sample subset of each decision tree, where treeID is the number of the decision tree and dataset is its corresponding sample subset.In this paper, the number of trees is set to 7. So the range of the treeID is 1 to 7. Step2: Use the sample subset to create a decision tree and initialize the number of Map tasks based on the number of decision trees.
The input of the Map function is <treeID, dataset>.This function mainly completes the decision tree construction, and the output is <treeID, list<feature>>.MapReduce parallelism is also used to

Specific steps are as follows:
Step 1: Create a sample subset.Using Bagging method to extract subsets from the original sample as the sample subset of each decision tree, where treeID is the number of the decision tree and dataset is its corresponding sample subset.In this paper, the number of trees is set to 7. So the range of the treeID is 1 to 7.
Step 2: Use the sample subset to create a decision tree and initialize the number of Map tasks based on the number of decision trees.
The input of the Map function is <treeID, dataset>.This function mainly completes the decision tree construction, and the output is <treeID, list<feature>>.MapReduce parallelism is also used to select the splitting attribute of the node.Every time the non-leaf node selects the split attribute, it needs to calculate the Gini value of the remaining feature attributes, and return the best split attribute and its value by comparison.The pseudo code is shown in Figures 7 and 8.
Step 3: All the Map tasks are completed, which means that the construction of the decision tree has been completed.At this time, the Reduce task is executed, and the split rule of each decision tree is written into HDFS to obtain a random forest classifier.
Return list<feature>; % record split attributes and split values ///////// Calculate the Gini value ///////// function：Calculate() % j:splite attribute，tmp: split value For j=1:n-1 tmp = (X ij +X ij+1 )/2; Gini(j,tmp); End for Return min(Gini); Step3: All the Map tasks are completed, which means that the construction of the decision tree has been completed.At this time, the Reduce task is executed, and the split rule of each decision tree is written into HDFS to obtain a random forest classifier.
Return list<feature>; % record split attributes and split values ///////// Calculate the Gini value ///////// function：Calculate() % j:splite attribute，tmp: split value For j=1:n-1 tmp = (X ij +X ij+1 )/2; Gini(j,tmp); End for Return min(Gini); Step3: All the Map tasks are completed, which means that the construction of the decision tree has been completed.At this time, the Reduce task is executed, and the split rule of each decision tree is written into HDFS to obtain a random forest classifier.

① Map stage
In the Map stage, the setup () function reads the initial network weights from the file system HDFS and initializes the neural network.The Map () function reads the sample data and trains on that node.After the set condition is reached (if the set number of iterations is reached, or the output error reaches the set value, the number of iterations reached in this section is used), and the model training ends.The pseudo code of Map is shown in Figure 10.

② Combine stage
Combine is used to merge the results of the Map.The input is the output of the Map, and its output is the input of Reduce.The type is the same as the output type of Map.

③ Reduce stage
In the Reduce stage, the <key, weightWritable> output of the Combine stage is used as the input of Reduce stage.The Reduce () function is to calculate the average value of the value for each key, and compare it with the network weight stored on the HDFS to determine whether to perform the next loop operation, and finally re-write the updated weights to HDFS.The pseudo code of Reduce stage is shown in Figure 11.In the Map stage, the setup () function reads the initial network weights from the file system HDFS and initializes the neural network.The Map () function reads the sample data and trains on that node.After the set condition is reached (if the set number of iterations is reached, or the output error reaches the set value, the number of iterations reached in this section is used), and the model training ends.The pseudo code of Map is shown in Figure 10.

① Map stage
In the Map stage, the setup () function reads the initial network weights from the file system HDFS and initializes the neural network.The Map () function reads the sample data and trains on that node.After the set condition is reached (if the set number of iterations is reached, or the output error reaches the set value, the number of iterations reached in this section is used), and the model training ends.The pseudo code of Map is shown in Figure 10.

② Combine stage
Combine is used to merge the results of the Map.The input is the output of the Map, and its output is the input of Reduce.The type is the same as the output type of Map.

③ Reduce stage
In the Reduce stage, the <key, weightWritable> output of the Combine stage is used as the input of Reduce stage.The Reduce () function is to calculate the average value of the value for each key, and compare it with the network weight stored on the HDFS to determine whether to perform the next loop operation, and finally re-write the updated weights to HDFS.The pseudo code of Reduce stage is shown in Figure 11.Combine is used to merge the results of the Map.The input is the output of the Map, and its output is the input of Reduce.The type is the same as the output type of Map.

Reduce stage
In the Reduce stage, the <key, weightWritable> output of the Combine stage is used as the input of Reduce stage.The Reduce () function is to calculate the average value of the value for each key, and compare it with the network weight stored on the HDFS to determine whether to perform the next loop operation, and finally re-write the updated weights to HDFS.The pseudo code of Reduce stage is shown in Figure 11.MR1 is mainly divided into Map and Reduce stages, namely:

Fault Diagnosis Experiment of SF6 Electrical Equipment
(1) Map stage ① Setup () loads the random forest file in HDFS.
② Enter the sample to be diagnosed for the first level of diagnosis and statistically determine the diagnosis result of the decision tree.③ Output <key, value>, where the key is the flag of whether the sample may be fault data.If  As can be seen from Table 8, one piece of data is normal, but it is classified as possible fault data.All fault data can be identified.From the results of the diagnosis, the random forest model can distinguish between normal and fault data.70% of the samples were used to train the neural network model, and 30% of the samples were used to test the model.Some diagnostic results are shown in Table 9.The reliability of the sample diagnosis shown in Table 9 is greater than the set confidence threshold (threshold is 0.8), so the neural network model is considered to be able to determine the diagnosis.In practice, the samples numbered 58 and 61 are the monitoring data when corona discharge occurs, and the samples numbered 64 and 99 are the monitoring data when the arc discharge occurs, so the diagnosis result is correct.After training the model, 10,010 groups (10,000 normal data and 10 sets of fault data) to be diagnosed into the diagnostic model, after the random forest diagnosis model, all 10 groups of fault data are identified, and after the neural network model is substituted, the diagnosis is made.The results are shown in Table 10.It can be seen from Table 10, the credibility of samples 5 and 6 is less than the threshold (threshold is 0.8), so they are sent to experts for evaluation.They have been identified as a new fault type-spark discharge.Then the neural network model is retrained and the network weights are updated so that the model can identify the type of spark discharge.The diagnosis results of the updated neural network model for samples Nos. 5 and 6 are shown in Table 11.It can be seen from Table 11 that the reliability of the sample diagnosis results of the numbers 5 and 6 is greater than threshold, that is, the updated neural network model can accurately diagnose the new fault type.
(2) Experiment 2: Comparison of fault diagnosis performance between Stand-alone and cluster mode.
The speedup S is usually used to measure the performance of parallel algorithms, which are defined as follows: where T s is the time taken for diagnosis by one node, T m is the time taken for nodes to perform parallel operations.The datasets of different sizes are run in the cluster mode of the Hadoop platform.The diagnostic times under different slave nodes are shown in Table 12.In order to more intuitively compare the speed of the fault diagnosis algorithm, the Table 12 is converted into a histogram as shown in Figure 13.It can be seen from Figure 13 that when the size of data to be diagnosed is less than 500 MB, the time consumed by single node and multiple nodes is similar.But as the size increases, the time of the fault diagnosis system in a stand-alone model is longer than cluster model.Therefore, when the size of the data that needs to be diagnosed reaches a certain level, using Hadoop's cluster mode can greatly improve the speed of fault diagnosis.
Through calculation, the speedup of fault diagnosis under different node numbers is shown in Table 13.It can be seen that when the amount of data to be diagnosed is small (less than 500 MB in the experiment), the acceleration ratio of the diagnosis is less than 1, indicating that the operating efficiency of cluster mode is lower than the stand-alone at this time.However, when the amount of data increases, with the increase in the number of nodes, the speedup also increases.
In order to further compare the influence of the number of nodes on the diagnosis speed, the experiment was carried out for the case where the data amount was 5 GB, and the result is shown in Figure 14.
In order to more intuitively compare the speed of the fault diagnosis algorithm, the Table 12 is converted into a histogram as shown in Figure 13.It can be seen from Figure 13 that when the size of data to be diagnosed is less than 500 MB, the time consumed by single node and multiple nodes is similar.But as the size increases, the time of the fault diagnosis system in a stand-alone model is longer than cluster model.Therefore, when the size of the data that needs to be diagnosed reaches a certain level, using Hadoop's cluster mode can greatly improve the speed of fault diagnosis.Through calculation, the speedup of fault diagnosis under different node numbers is shown in Table 13.It can be seen that when the amount of data to be diagnosed is small (less than 500 MB in the experiment), the acceleration ratio of the diagnosis is less than 1, indicating that the operating efficiency of cluster mode is lower than the stand-alone at this time.However, when the amount of data increases, with the increase in the number of nodes, the speedup also increases.In order to further compare the influence of the number of nodes on the diagnosis speed, the experiment was carried out for the case where the data amount was 5 GB, and the result is shown in Figure 14.As can be seen from Figure 14, when the amount of data is 5 GB, the speedup does not increase linearly with the increase in the number of nodes.This is because as the cluster size increases, the time spent communicating between nodes increases.Therefore, in practice, the size of the cluster can be selected according to the size of the amount of data.

Conclusions
In this paper, a two-level fault diagnosis model for SF6 electrical equipment is designed, and fault diagnosis for SF6 electrical equipment is implemented on the Hadoop platform in this paper.Before training the fault diagnosis model, the monitoring data is preprocessed first, and different data filling methods are adopted for different missing values.Secondly, the fault diagnosis algorithms are parallelized on the Hadoop platform.Finally, the time consumption of fault diagnosis of SF6 electrical equipment in stand-alone mode and cluster mode is compared by simulation, and the advantages of cluster mode in processing massive data are verified.
The first-level diagnostic model can quickly diagnose monitoring data and filter out the problematic data in a large amount of data.The second-level diagnostic model provides an indepth analysis of the fault types of SF6 electrical equipment and can be learned in real time to update the fault type library.
In the future, smart substations can be equipped with SF6-derived gas composition detection equipment (such as infrared spectrum analyzers) and upload data to cloud databases in real time.Engineers can check the operating status of SF6 electrical equipment in real time on the client side, As can be seen from Figure 14, when the amount of data is 5 GB, the speedup does not increase linearly with the increase in the number of nodes.This is because as the cluster size increases, the time spent communicating between nodes increases.Therefore, in practice, the size of the cluster can be selected according to the size of the amount of data.

Conclusions
In this paper, a two-level fault diagnosis model for SF6 electrical equipment is designed, and fault diagnosis for SF6 electrical equipment is implemented on the Hadoop platform in this paper.Before training the fault diagnosis model, the monitoring data is preprocessed first, and different data filling methods are adopted for different missing values.Secondly, the fault diagnosis algorithms are parallelized on the Hadoop platform.Finally, the time consumption of fault diagnosis of SF6 electrical equipment in stand-alone mode and cluster mode is compared by simulation, and the advantages of cluster mode in processing massive data are verified.
The first-level diagnostic model can quickly diagnose monitoring data and filter out the problematic data in a large amount of data.The second-level diagnostic model provides an in-depth analysis of the fault types of SF6 electrical equipment and can be learned in real time to update the fault type library.
In the future, smart substations can be equipped with SF6-derived gas composition detection equipment (such as infrared spectrum analyzers) and upload data to cloud databases in real time.Engineers can check the operating status of SF6 electrical equipment in real time on the client side, which saves time in field inspections.Engineers can quickly and accurately identify problems to avoid security incidents.

Figure 1 .Figure 2 .
Figure 1.System block diagram for fault diagnosis of SF6 electrical equipment.The fault diagnosis process of SF6 electrical equipment is shown in Figure2, and the main steps are follows.Begin

Figure 1 .
Figure 1.System block diagram for fault diagnosis of SF6 electrical equipment.

Figure 1 .
Figure 1.System block diagram for fault diagnosis of SF6 electrical equipment.

Figure 2 .
Figure 2. Flow chart of two-level fault diagnosis algorithm.

Figure 3 .
Figure 3. Architecture of the two-level fault diagnosis system for SF6 electrical equipment.Figure 3. Architecture of the two-level fault diagnosis system for SF6 electrical equipment.

Figure 3 .
Figure 3. Architecture of the two-level fault diagnosis system for SF6 electrical equipment.Figure 3. Architecture of the two-level fault diagnosis system for SF6 electrical equipment.

Figure 4 .
Figure 4.The data interpolation process based on method of grey correlation.

Figure 4 .
Figure 4.The data interpolation process based on method of grey correlation.

Figure 6 .
Figure 6.The specific tree construction process in Map task.

Figure 7 .
Figure 7.The pseudo code to construct a decision tree in Map task.

Figure 8 .
Figure 8.The pseudo code to calculate Gini value.

Figure 7 .
Figure 7.The pseudo code to construct a decision tree in Map task.

Figure 7 .
Figure 7.The pseudo code to construct a decision tree in Map task.

Figure 8 .
Figure 8.The pseudo code to calculate Gini value.

Figure 8 .
Figure 8.The pseudo code to calculate Gini value.

Figure 10 .
Figure 10.The pseudo code of Map stage.

Figure 9 .
Figure 9.The process of training back propagation (BP) neural network model.

BigFigure 9 .
Figure 9.The process of training back propagation (BP) neural network model.

Figure 10 .
Figure 10.The pseudo code of Map stage.Figure 10.The pseudo code of Map stage.

Figure 10 .
Figure 10.The pseudo code of Map stage.Figure 10.The pseudo code of Map stage.

Figure 11 .
Figure 11.The pseudo code of Reduce stage.

Figure 11 .
Figure 11.The pseudo code of Reduce stage.

Figure 11 .
Figure 11.The pseudo code of Reduce stage.

Figure 12 .
Figure 12.The diagnosis process of SF6 electrical equipment.

2
Neural network model results.

3
Monitoring data diagnosis results and analysis.

Figure 13 .
Figure 13.The diagnostic times under different slave nodes.

Figure 13 .
Figure 13.The diagnostic times under different slave nodes.

Figure 14 .
Figure 14.The speedup of fault diagnosis under different slave nodes.

Figure 14 .
Figure 14.The speedup of fault diagnosis under different slave nodes.
Putting the normalized data into the first-level random forest model.If the data is diagnosed as normal data, then output the result directly.Otherwise, go to step 3.

Table 1 .
The historical data of SOF2.

Table 2 .
The interpolation results of SOF2.
Note: MIM means mean interpolation method, WIM means weighted interpolation method.

Table 3 .
The historical data of SOF.

Table 4 .
The interpolation results of SOF2.

Table 5 .
The historical data of SOF2.

Table 6 .
The interpolation results of SOF2.

Table 6 .
The interpolation results of SOF2.

Table 7 .
The results of interpolating missing values.
Id of the decision treeDataset={X,Y}，X: the sample subset of n*m matrices， X ij : the jth attribute value of the i-th row in X Y: the decision attribute vector,，Y i : the output class of the i-th row data.
Id of the decision treeDataset={X,Y}，X: the sample subset of n*m matrices， X ij : the jth attribute value of the i-th row in X Y: the decision attribute vector,，Y i : the output class of the i-th row data.
///// bp Mapper ///////// Two MapReduce tasks are set up on the Hadoop platform to diagnose SF6 electrical equipment, as shown in Figure12.In the first task MR1, the random forest model is used for initial diagnosis.In the second task MR2, the neural network model is used for diagnosis of fault categories.

Table 8 .
The test results of random forest model.

Table 9 .
Some diagnostic results of neural network model.

Table 10 .
The results of neural network model.

Table 11 .
The diagnosis results of the updated neural network model.

Table 12 .
The diagnostic times under different slave nodes.

Table 13 .
The speedup of fault diagnosis under different node numbers.

Table 13 .
The speedup of fault diagnosis under different node numbers.