Feature Selection of Power Quality Disturbance Signals with an Entropy-Importance-Based Random Forest

: Power quality signal feature selection is an effective method to improve the accuracy and efﬁciency of power quality (PQ) disturbance classiﬁcation. In this paper, an entropy-importance (EnI)-based random forest (RF) model for PQ feature selection and disturbance classiﬁcation is proposed. Firstly, 35 kinds of signal features extracted from S-transform (ST) with random noise are used as the original input feature vector of RF classiﬁer to recognize 15 kinds of PQ signals with six kinds of complex disturbance. During the RF training process, the classiﬁcation ability of different features is quantiﬁed by EnI. Secondly, without considering the features with zero EnI, the optimal perturbation feature subset is obtained by applying the sequential forward search (SFS) method which considers the classiﬁcation accuracy and feature dimension. Then, the reconstructed RF classiﬁer is applied to identify disturbances. According to the simulation results, the classiﬁcation accuracy is higher than that of other classiﬁers, and the feature selection effect of the new approach is better than SFS and sequential backward search (SBS) without EnI. With the same feature subset, the new method can maintain a classiﬁcation accuracy above 99.7% under the condition of 30 dB or above, and the accuracy under 20 dB is 96.8%.


Introduction
Power quality (PQ) is the main control target of the smart grid, and PQ signal recognition is the foundation of PQ problem management [1].With the wide access of distributed generators (DGs) to the power system, many renewable power sources with random output characteristics, such as distributed solar energy and wind power, have a negative impact on the PQ of the power system [2].Then, it is necessary to carry out in-depth monitoring and analysis of the PQ in all points accessed by DGs [3].Therefore, the massive PQ data collected from a large number of monitors represents a higher real-time requirement for any PQ signal classification system [4].
Features extracted from the time-frequency analysis (TFA) results are always used as the input of the classifier for PQ disturbances identification.Previous studies have carried out a lot of in-depth research on TFA of PQ signals, including Hilbert-Huang transform (HHT) [5,6], S-transform (ST) [7][8][9] and discrete wavelet transform (DWT) [10][11][12].In the current research results, the environmental noise is the main factor which affects the PQ classification accuracy, especially in the distribution network.ST has been proved to have good anti-noise abilities among all the TFA methods [7][8][9].Feature extraction of PQ signals using ST and its improved form has been paid more and more attention.Nevertheless, existing methods extract a large number of features according to the ST results, but the ability of the features to identify the disturbances lacks effective analysis.The high feature vector dimension increases the complexity, reduces the classification efficiency and accuracy of PQ disturbances classifier.Moreover, the feature vectors used in different research are diverse.This will enhance the difficulties of constructing a unified PQ signal classifier.For the sake of simplifying the classifier and enhancing the classification efficiency, it is essential to add the feature selection step in the PQ disturbances recognition process.
In past studies, feature selection was either in accordance with the filter method based on the features' statistical characteristics, which made it difficult to analyze the classification ability of the feature combination [13,14], or used the wrapper method combined with the particle swarm optimization [15], genetic algorithm [16], rough set theory [17] or other intelligent algorithms, then according to the classification results chose the optimal or sub-optimal feature subset, but the efficiency of the search algorithm is unsatisfactory.Meanwhile, existing feature selection methods have to select different feature subsets under different noise conditions, and this limits the application possibilities of feature selection methods in practical engineering.
From the perspective of classifier design steps, neural network (NN) [18][19][20], support vector machine (SVM) [21][22][23], fuzzy rule (FR) [24], decision tree (DT) [25][26][27] and extreme learning machine (ELM) [28] are commonly applied to the classification of PQ signals, and all achieve good results.However, the NN and SVM have to set more parameters.This will lead to the difficulty of designing the classifier and makes it easier to fall into over-fitting.FR and DT have simple structures with higher classification accuracy and efficiency than NN and SVM [24][25][26][27], but it is difficult to choose the optimization threshold of the classification threshold of FR and DT.
Random forest (RF) is an excellent classifier model with advantages such as good anti-noise performance, less parameters and less influence of the over-fitting problem.Moreover, RF has better generalization ability than DT [29].In the verification of multiple public data sets, the classification accuracy of RF is the highest among all methods [30].What's more, RF is an effective method for the integration of feature selection.During the training process, classification ability of each feature can be obtained according to the training results of RF's every node.Then the optimal feature subset can be selected out on the basis of this.The analysis process of feature selection method based on RF is parallel to the filter method.At the same time, RF can adjust the optimum feature subset based on the classification accuracy on the new testing sets of different feature subsets as the wrapper approach, but more efficient.The feature selection process of RF takes both the statistical conclusion of the characteristics and the classification results of the classifier into consideration.It combines the virtues of the filter method and wrapper method.Therefore, RF has good applicability for feature selection.
For the sake of finding the optimizing feature subset and increasing the classification accuracy of PQ disturbances, a new method for PQ disturbances feature selection and classification using a entropy-importance (EnI)-based RF is proposed in this paper.Firstly, 15 kinds of PQ signals including six kinds of complex disturbance are simulated by a mathematical model.Then, the simulation signals are processed by ST to extract 35 kinds of commonly used features for PQ classification.Secondly, a RF classifier used for recognizing PQ signals is constructed with the original feature subset as the input vector.According to EnI score of features obtained from the RF training process, classification ability of each feature can be sorted to construct the optimal subset.Features with zero EnI score will not be selected.On this basis, a sequential forward search (SFS) strategy is adopted to determine the optimal feature subset and RF based classifier with optimal feature subset is reconstructed.Finally, the optimized RF classifier is used to recognize PQ signals.Simulation experiment results show that the new method is valid.
The remainder of the paper is designed as follows: Section 2 presents the basic theory and the classification process of RF.In Section 3, it describes the details of the new approach including the segmentation of non-leaf node, the calculation of EnI of each feature, and the feature selection strategy based on EnI.Then, the results of different simulation experiments are shown and discussed in Section 4. Finally, Section 5 presents the conclusions of this paper.

Classification by Random Forest
RF combines DT with ensemble learning to form a new kind of tree classifier: where f px, δ k q is a meta classifier, and it is a tree construct classifier that can be formed by several algorithms; x is the input vector; δ k is a random vector, independent with each other but sharing the same distribution, and it determines the growth of a single decision tree.RF generates a random feature subset in each non-leaf node of DT, and chooses the feature contained in this subset with the best classification results to split this node.Finally, RF summarizes the classification results of different DTs to achieve the optimal classification result.Compared to DT, RF overcomes the weakness of generalization ability, and improves the classification accuracy without significantly increasing the amount of computing.

RF Classification Capability Analysis
Generalization error is an important index to measure the extrapolation ability of the classifier.The classification ability of RF can be measured by analyzing its generalization error [29].Given a classifier set Fpxq " t f 1 pxq, f 2 pxq, . . ., f k pxqu, and the training set of each classifier is obtained from the original data set pX, Yq by random sampling.The margin function is: here, I Np¨q is an indicator function, ave k p¨q is average value, Y is the correct classification of the vector, j is the incorrect classification of the vector.The margin function measures the degree of the average correct classification number of classifiers exceeds the average number vote for any other class.The larger the margin function, the better the classification performance.The generalization error is calculated by: here, the subscript X, Y represent the definition of space.
In RF, f k pXq " f pX, δ k q.With the growth of tree number of RF, it can be known from the Strong Law of Large Numbers and the tree structure that: where P δ p f pX, δq " yq represents the probability of the classification results as the right class and max j‰Y P δ p f pX, δq " jq represents the maximum probability of the classification results as any other class.
Equation (4) denotes that PE ˚tend to a constant as the tree number increases, so RF is not easy to produce over-fitting problem.The margin function of RF is given as: Then, the strength of t f px, δ k qu is the mathematical expectation of marrpX, Yq: str " E X,Y marrpX, Yq Assuming str ě 0, according to Chebychev's inequality, there is: where varpmarrq is the variance of marrpX, Yq.In order to make more detailed description about varpmarrq, let: ĵpX, Yq " Then: marrpX, Yq " P δ p f pX, δq " Yq ´Pδ p f pX, δq " ĵpX, Yqq " E δ rI Np f pX, δq " Yq ´I Np f pX, δq " ĵpX, Yqqs The margin function of meta classifier is defined as: rmargpδ, X, Yq " I Np f pX, δq " Yq ´I Np f pX, δq " ĵpX, Yqq Therefore, marrpX, Yq is the expectation of rmargpδ, X, Yq in regard to δ.No matter what function h is, there is: here δ and δ 1 are independent with each other and share the same distribution, so: According to Equation (12), it can be obtained: varpmarrq when δ and δ 1 are holding fixed, ρpδ, δ 1 q is the correlation between rmargpδ, X, Yq and rmargpδ 1 , X, Yq; sdpδq and sdpδ 1 q are the standard deviation of rmargpδ, X, Yq and rmargpδ 1 , X, Yq respectively.Then the conditional functional of varpmarrq need to be met are obtained: where ρ is the mean correlation value.Then we have the function as follows: Put ( 7), (14), and (15) together yields the function as: When increasing the strength of the individual classifiers or decreasing the correlation between classifiers, the generalization error tends to a loose upper bound, so RF has good generalization ability.Meanwhile, with the increase of forest size, so it is not easy for RF to fall into over-fitting.

The Classification Process of RF
RF has a simple structure, good generalization ability and anti-noise performance [31].Compared to other classifiers, the time complexity of RF is lower, and RF can achieve higher classification accuracy.Thus, RF can meet the application needs of massive PQ signal classification.The steps for the classification of RF are described as follows: 1.
The boot-strap resampling technique is used to extract the training set for every tree in RF, and the size of training set is equal to the original data set.Samples that haven't been extracted are composed of out-of-bag data set.K training sets and out-of-bag data sets will be extracted by repeating the above process k times.2.
K decision trees are built according to the k training sets to construct a RF.

3.
During the training process, m try features are randomly selected from the original feature space to construct candidate segmentation feature subset for each non-leaf node.Most studies let m try " ?t, where t is the number of original features.4.
Each feature in the candidate segmentation feature subset is used to split the node, and the feature with the best segmentation performance is finally chosen as the segmentation feature of the node. 5.
Repeat step 3 and step 4 until all non-leaf nodes segmented, then the training process is over.6.
When using RF to classify PQ signals, a simple majority voting method is used to output the optimal classification results according to the classification results of each classifier.
The steps of the classification process are presented in Figure 1.
Entropy 2016, 18, 44 classification accuracy.Thus, RF can meet the application needs of massive PQ signal classification.
The steps for the classification of RF are described as follows: 1.The boot-strap resampling technique is used to extract the training set for every tree in RF, and the size of training set is equal to the original data set.Samples that haven't been extracted are composed of out-of-bag data set.K training sets and out-of-bag data sets will be extracted by repeating the above process k times.2. K decision trees are built according to the k training sets to construct a RF.

During the training process, try
m features are randomly selected from the original feature space to construct candidate segmentation feature subset for each non-leaf node.Most studies let try m t  , where t is the number of original features.
4. Each feature in the candidate segmentation feature subset is used to split the node, and the feature with the best segmentation performance is finally chosen as the segmentation feature of the node. 5. Repeat step 3 and step 4 until all non-leaf nodes segmented, then the training process is over.6.When using RF to classify PQ signals, a simple majority voting method is used to output the optimal classification results according to the classification results of each classifier.
The steps of the classification process are presented in Figure 1.

EnI Calculation and Node Segmentation
When using a feature to split the non-leaf node of a decision tree, there are two kinds of indicators to measure the segmentation effect: information gain [32] and Gini index [33].Information

EnI Calculation and Node Segmentation
When using a feature to split the non-leaf node of a decision tree, there are two kinds of indicators to measure the segmentation effect: information gain [32] and Gini index [33].Information gain is calculated from entropy like mutual information, which is always used for feature selection [34].Applying these two indicators to RF, the EnI and Gini-importance (GiI) of each feature can be obtained, respectively.During the training process, the EnI method can set the importance of features that have no or little contribution to the classification to zero.Based on this, the EnI method can greatly reduce the feature selection workload when compared to the existing GiI method.The features with the zero EnI do not have to be considered.Therefore, EnI based feature selection method is able to meet the actual needs of mass PQ signal classification.
Entropy is a quantitative measurement method of data carrying information.The more uniform the data distribution, the greater its entropy value.Assuming the node, which is going to be split, is composed of a set S, and S contains s samples and n class.The entropy of the node is given as: where, s k is sample number in class k pk " 1, 2, ¨¨¨, nq; P k " s k {s expresses the possibility that a sample belongs to class i.When S contains only one class, its entropy is zero; when all the classes in S are distributed evenly, the maximum value of the information entropy is taken.
Assuming when RF uses a feature A to split the node, S can be divided into m subsets S j , where j " 1, 2, ¨¨¨, m.Then the entropy of A splitting the node is defined as: where, s ij is the number of sample of class i in subset S j .According to Equations ( 17) and ( 18), the information gain of A splitting the node can be obtained as: The information gain of each feature in the candidate segmentation feature subset can be calculated according to Equations ( 17)- (19).According to the new feature selection method, the feature which has the highest Gain value is chosen as the segmentation feature of this node, and the information gain of other features (all features in original feature space except feature A) is set to zero: GainpAq "

#
GainpAq feature A has the highest information gain 0 else (20) After the completion of the RF, the EnI of a feature can be obtained by linear superposition of its all information gain values: where n represents the all non-leaf node number in RF.
Finally, the importance of each feature can be analyzed by sorting all features in descending order according to their EnI.The feature with higher EnI will be used to construct the optimal feature set.

Forward Search Strategy of PQ Feature Selection Based on EnI
In this paper, the SFS algorithm [35] based on EnI is proposed for PQ feature selection.Firstly, according to the descending sort order of the features based on their important degree, the feature is added to the selected feature subset Q one by one.Subset Q is used as the input vector of RF to retrain a classifier when a new feature is added, and the classification accuracy need to be recorded.Then, the process is repeated until all features are added into Q.Finally, the optimal feature subset is determined by taking both classification accuracy and the dimension of selected feature subset into consideration.The process of feature selection needs to be performed only one time to train the RF classifier.The flow diagram of the new method is shown as Figure 2.

Experimental Results and Analysis
Through the simulation contrast experiment, the new method is analyzed and validated in aspects of feature selection methods, classifier performance, and signal processing methods.

Experimental Results and Analysis
Through the simulation contrast experiment, the new method is analyzed and validated in aspects of feature selection methods, classifier performance, and signal processing methods.

Feature Extraction of PQ Signals
Referring to [13,15], 15 kinds of PQ signals are generated by simulation, including normal (C0), sag (C1), swell (C2), interruption (C3), flicker (C4), transient (C5), harmonic (C6), notch (C7), spike (C8), harmonic with sag (C9), harmonic with swell (C10), harmonic with flicker (C11), sag with transient (C12), swell with transient (C13) and flicker with transient (C14).The sampling frequency is 3.2 kHz, and the fundamental frequency is 50 Hz.For the sake of improving the capability of features extracted from ST, according to literature [25], the different values of the window width factor are given in different frequency domain.The original features extracted from ST modular matrix (STMM) are described as follow [13]: Feature 1 (F1): the maximum value of the maximum amplitude of each column in STMM (A max ).Feature 2 (F2): the minimum value of the maximum amplitude of each column in STMM (A min ).Feature 3 (F3): the mean value of the maximum amplitude of each column in STMM (Mean).Feature 4 (F4): the standard deviation (STD) of the maximum amplitude of each column in STMM (STD).Feature 5 (F5): the amplitude factor (A f ) of the maximum amplitude of each column in STMM, defined as A f " A max `Amin ´1 2 in the range 0 ă A f ă 1.
Feature 6 (F6): the STD of the maximum amplitude in the high frequency area above 100 Hz.Feature 7 (F7): the maximum value of the maximum amplitude in the high frequency area above 100 Hz (A HFmax ).Feature 8 (F8): the minimum value of the maximum amplitude in the high frequency area above 100 Hz (A HFmin ).The amplitude of voltage of a sampling point is x i , where 1 ď i ď M, and M is the number of all sampling points.Then the relevant calculation formulas of features are described as follow: Skewness: And the calculation formulas of F19 and F20 are given by: where Rmspmq is the root mean square (RMS) of each 1{4 cycles of the original signal, and R 0 is the RMS of standard PQ signal with no noise.
Moreover, sampling point in the matrix of ith row and jth column is x ij , where and M 2 are the starting line, the end line, the starting column and the ending column of the required submatrix for the calculation of relevant energy features respectively.The calculation formula of energy relevant features is described as follows: The calculation methods of these features mainly refer to [13,15].Among them, the calculation methods of features from F1 to F24 refer to [13], and calculation methods of features from F26 to F35 refer to [15].Moreover, there are six kinds of complex disturbances needed to be classified, and the classification of complex disturbances with transient is easy to be disturbed by noise and time-frequency energy of starting and ending points of voltage sag.Therefore, F25 is introduced for identification of transient oscillation components.
The calculation method of F25 is described as follows: (1) Using the maximum of the summation of amplitudes of each row in oscillation frequency domain, and the maximum of the summation of amplitudes of each column in the full time domain, to locate the possible time-frequency center point of oscillation.(2) The local energy of the final 1/4 cycle and the ˘150 Hz range of this time-frequency center point is calculated as F25.
The above features reflect the disturbance characteristics of different types of PQ disturbances from four aspects, which are disturbance amplitude, disturbance frequency, energy of high frequency and mutations of original signal energy.When a disturbance occurs, the values of some features will have big difference between different types of disturbances.Then the features which reflect the disturbance index can be used to recognize disturbances.Eleven features can distinguish different disturbances according to disturbance amplitude, including F1to F5, F21 and F26 to F30.Nineteen features can distinguish different disturbances according to disturbance frequency, including F6 to F18, F22 and F31 to F35.And these features reflect the main frequency components of disturbances and the amplitude spectrum differences.Three features can distinguish higher harmonics from transient oscillations according to the energy in high frequency area, including F23 to F25.Finally, based on the characteristic that the original signal amplitude of disturbances with sag, interruption and swell will mutate after a disturbance occurs, two features, F19 and F20, can distinguish these three kinds of disturbances by calculating the energy of 1/4 cycle of the original signal.According to the new method, features with non-zero EnI value will be added to selected feature subset one after another following the order from big to small of their EnI values.Whenever a feature is added, RF is used to verify the classification effect of this feature subset.Using information gain and Gini index as the basis of the node partition respectively, the two different importances of features are shown in Figure 3a,b.It can be known from Figure 3a that there are 20 features with their EnI value is 0. This means these features have no or very little effect on the node segmentation.Therefore, when searching the feature space, the new method needs only to iterate 15 times while GiI method needs to iterate 35 times.The efficiency of the new method in feature selection is better than GiI based method.According to the new method, features with non-zero EnI value will be added to selected feature subset one after another following the order from big to small of their EnI values.Whenever a feature is added, RF is used to verify the classification effect of this feature subset.Using information gain and Gini index as the basis of the node partition respectively, the two different importances of features are shown in Figure 3a,b.It can be known from Figure 3a that there are 20 features with their EnI value is 0. This means these features have no or very little effect on the node segmentation.Therefore, when searching the feature space, the new method needs only to iterate 15 times while GiI method needs to iterate 35 times.The efficiency of the new method in feature selection is better than GiI based method.The values of the STD of steady-state disturbances such as normal voltage, flicker and spike are small respectively, so F4 can divide all kinds of disturbances into two categories.F5 represents the amplitude factor of the maximum amplitude of each column in STMM.Because the values of F5 of swell, sag and other types of disturbances are in different intervals, F5 can distinguish swell and sag with others.F22 represents the maximum value of the intermediate frequency area, and it can distinguish harmonic with other disturbances.F25 represents the energy of local matrix.According to the characteristic that the disturbance frequency of transient is high, F25 can distinguish transient with other disturbances.
Figure 4a-c illustrates the classification performances of combinations of the first four features in Figure 3a in the condition of SNR = 8. Figure 4a shows the scatter plot of combination of F5, F22 and F25.It can be seen that C1 and C5, C2 and C4, C7 and C12 and C6 and C15 exist cross sample.The other types of disturbance are clearly divided.Then F4 and F5 are used for further segmentation as Figure 4b shows.Although C2 and C4 still exists cross in Figure 4b, the cross number is sharply reduced.C7, C12, C6 and C15 are completely separated.As shown in Figure 4c, C1 and C5 can be clearly divided by combination of F4 and F22.Therefore, the four features with the highest EnI value can distinguish 15 types of PQ signal effectively.The validity of the new method is proved.Figures 5 and 6 present the classification effect and training error of different feature subsets with different SNR respectively.With the feature number increased one by one, the classification accuracy is increasing and the training error is decreasing.As shown in Figures 5 and 6, the classification accuracy and the training error tend to be stable when the feature subset dimension of the new method exceeds four, while GiI method needs at least ten features to achieve satisfying classification results.
When the number of selected feature is 4 or 10, respectively, the details of the classification accuracy of EnI method and GiI method are listed in Tables 1-4.From these four tables, it can be   5 and 6 the classification accuracy and the training error tend to be stable when the feature subset dimension of the new method exceeds four, while GiI method needs at least ten features to achieve satisfying classification results.
When the number of selected feature is 4 or 10, respectively, the details of the classification accuracy of EnI method and GiI method are listed in Tables 1-4.From these four tables, it can be seen that EnI method can achieve higher classification accuracy with the same feature subset under the high noise environment (the SNR of PQ signals is 20 dB).
Entropy 2016, 18,44 seen that EnI method can achieve higher classification accuracy with the same feature subset under the high noise environment (the SNR of PQ signals is 20 dB).Table 1.Classification of new method (the number of feature is 4, SNR is 20 dB).
Comprehensive accuracy: 95.9%  Comprehensive accuracy: 66.3% Table 3. Classification of new method (the number of feature is 10, SNR is 20 dB).
Comprehensive accuracy: 97.1% Table 4. Classification of GiI method (the number of feature is 10, SNR is 20 dB).

Comparison Experiment and Analysis
The feature selection result of the new method is compared with GiI method, SFS algorithm [35] and sequential backward search (SBS) [36] to testify the validity of the new approach.The number of selected feature based on GiI method, SFS method and SBS method are 10, 13 and 15 respectively.The new method considers two cases, including the dimension of the feature subset are 4 and 10 respectively.Moreover, the original feature set is used as a contrast as well.
For the sake of verifying the validity of the feature selection results of the new method, four kinds of classifier, including RF, SVM [14], PNN [13] and DT, are used to classify 15 kinds of PQ signals under the condition of different noise environments and different feature subsets.The DT classifier is constructed by rpart software package in R project.The classification results are shown in Table 5.The feature selection methods based on EnI and GiI are compared according to Table 5.When RF is used as the classifier, and the selected feature number of EnI method is 4, the classification accuracy is almost close to GiI method with 10 features.When the selected feature number of EnI method is equal to GiI method, the accuracy of these two methods under the condition that the SNR is higher than 30 dB are the same, but the accuracy of EnI method under the condition that the SNR is 20 dB exceeds 0.3% compared to GiI method.It is proved that the new method based on EnI has better effect than GiI based method with RF based classifier.Meanwhile, when SNR is 20 dB, the SBS method can achieve the classification accuracy of 98.5%.However, when taking classification accuracy under all conditions and the efficiency of feature selection and extraction into consideration, EnI method is still thought to be better than SBS method.It can also be seen that the new method can use the same feature subset to achieve satisfying classification accuracy under different noise environments.This overcomes the disadvantage that existing research [15] needs to select different feature subsets under different noise environments.Meanwhile, when RF is used as classifier and the dimension of the selected feature subset increases from 4 to 10, the classification accuracy of high SNR environment has not improved, but the classification accuracy of SNR is 20 dB has improved 1.2%.Therefore, different feature subsets can be selected according to the demand of classification accuracy and efficiency in practical work.
The classification ability of different classifiers can also be analyzed using Table 5.As shown in Table 5, when compared to the other three classifiers, RF performs better on the new test sets.The best classification accuracy can only be achieved by using RF as the classifier no matter what level of the noise environment is.When the SNR is 50 dB, and the feature selection methods are EnI + SFS (the number of selected feature is 10), GiI + SFS (the number of selected feature is 10) and ALL, RF can achieve the classification accuracy of 99.9%.When the SNR is 40 dB, and the feature selection methods are EnI + SFS (the number of selected feature is 10) and GiI + SFS (the number of selected feature is 10), RF can achieve the classification accuracy of 100%.When the SNR is 30 dB, and the feature selection methods are EnI + SFS (the number of selected feature is 4), EnI + SFS (the number of selected feature is 10), GiI + SFS (the number of selected feature is 10) and ALL, RF can achieve the classification accuracy of 99.7%.When the noise environment is high (SNR is 20 dB), and the feature selection method is SBS, the RF classification accuracy is higher than the SVM of 9.9%, and is higher than the other two classifiers of 3.7% and 4.3% respectively.All these prove that RF has higher anti-noise ability, and is more suitable for the application under high noise environment.Moreover, the RF classification accuracy is higher than the DT under any condition, which proves that RF has better generalization ability than DT.
Besides classification accuracy, the impact on classification efficiency by feature selection is also analyzed.In practical application, the original PQ signals have the need for ST process after they are collected.Then the corresponding features are extracted according to the ST results.Finally, the extracted features are used as the input of the well trained classifier to output the disturbance type.Therefore, feature selection can effectively reduce the computing time of features and complexity of classifier.When the number of selected feature are 4, 10, 13, 15 and 35 respectively, the normalized time that 50 new test sets of original disturbance signals consumed from ST process to disturbance type output is shown in Figure 7.The whole time of signals recognized by 35 features were treated as the standard time (1 pu).
From Figure 7, it can be seen that, the total classificaiton time reduces significantly with the decrease of feature number.When the number of selected feature decreased from 35 to 4, the total classificaiton time can reduce by 39.3%.When the number of selected feature decreased from 35 to 10, the total classification time can reduce by 27.3%.It proves that feature selection improves the classification efficiency of the classifier effectively.
features were treated as the standard time (1 pu).
From Figure 7, it can be seen that, the total classificaiton time reduces significantly with the decrease of feature number.When the number of selected feature decreased from 35 to 4, the total classificaiton time can reduce by 39.3%.When the number of selected feature decreased from 35 to 10, the total classification time can reduce by 27.3%.It proves that feature selection improves the classification efficiency of the classifier effectively.

The Determination of Tree Number of RF Classifier
The number of the tree determines the scale of RF.With the increasing of tree number, the generalization error becomes smaller and EnI analysis of features becomes more accurate.Therefore, the classification performance will be better.The number of trees in RF is set to 300 during the feature selection process.However, too many trees will affect the efficiency of classification.Then it is necessary to analyze the influence of the number of the trees on the classification error in order to determine the optimal RF scale based on the optimized feature subset.Figure 8a

The Determination of Tree Number of RF Classifier
The number of the tree determines the scale of RF.With the increasing of tree number, the generalization error becomes smaller and EnI analysis of features becomes more accurate.Therefore, the classification performance will be better.The number of trees in RF is set to 300 during the feature selection process.However, too many trees will affect the efficiency of classification.Then it is necessary to analyze the influence of the number of the trees on the classification error in order to determine the optimal RF scale based on the optimized feature subset.Figure 8a decrease of feature number.When the number of selected feature decreased from 35 to 4, the total classificaiton time can reduce by 39.3%.When the number of selected feature decreased from 35 to 10, the total classification time can reduce by 27.3%.It proves that feature selection improves the classification efficiency of the classifier effectively.

The Determination of Tree Number of RF Classifier
The number of the tree determines the scale of RF.With the increasing of tree number, the generalization error becomes smaller and EnI analysis of features becomes more accurate.Therefore, the classification performance will be better.The number of trees in RF is set to 300 during the feature selection process.However, too many trees will affect the efficiency of classification.Then it is necessary to analyze the influence of the number of the trees on the classification error in order to determine the optimal RF scale based on the optimized feature subset.Figure 8a  In Figure 8a, when the tree number is over 10, the classification error is tending to be stable, while GiI method needs at least 100 trees in Figure 8b.The new method has simpler structure than GiI based RF with same classification accuracy.Meanwhile, with the increase of the tree number, the time spent on RF classification will be improved.Finally, the number of trees in RF is determined to be 10 during classification process.

Affection of Signal Processing Method on Classification Accuracy
The influence of the signal processing method for PQ signals will also be considered.Different signal processing methods will affect the classification accuracy of PQ disturbance signals.Therefore, after the new feature selection method and RF classifier are proved to be effective, the classification accuracy of discrete wavelet transform (DWT) [37] and wavelet package transform (WPT) [38] are compared to ST.The new method is chosen as the feature selection and classification method.
In the contrast experiment, the features of DWT based method are extracted refers to literature [37].The fourth-order Daubechies wavelet (db-4) was chosen as the mother wavelet function.Then a 9-level multiresolution decomposition process is performed to the original signals.According to the detail coefficients at each level and the approximate coefficient at the last level, 90 features are extracted.The feature extraction methods of DWT are shown in Table 6.Table 6.Feature extraction methods based on DWT [37].

Feature Extraction Methods Based on DWT
In Table 6, i = 1,2,L, . . .,l represents multi resolution level, and N stands for the number of details or approximate coefficients at each multi resolution level.
The features extracted from WPT refer to literature [38].The fourth-order Daubechies wavelet (db-4) was also chosen as the mother wavelet function.Then 16 wavelet coefficients can be obtained by performing a 4-level decomposition process, and 96 features can be extracted according to these coefficients.The feature extraction methods of WPT are shown in Table 7.In Table 7, j = 1,2,L, . . .,k represents the number of nodes at the fourth decomposition level, and M is the number of coefficients in each decomposed data.
After the original feature subsets are obtained, the new feature selection stategy put forward in this paper is adopted to select useful features as well.The number of features selected from the original feature subsets of DWT and WPT are 23 and 27, respectively, and the descriptions of these two optimal feature subsets are shown in Tables 8 and 9 respectively.Finally, the two optimal feature subsets are used as the input of the RF to train the classifier.The classification accuracy of the classifier is shown in Table 10.From Table 10, it can be clearly seen that the method with ST can achieve higher classification accuracy than the other signal processing methods under any conditions.When SNR is 20 dB and there is no feature selection process, the classification accuracy of ST based method is higher than DWT and WPT of 14.1% and 14.7%, respectively.If the feature selection process is performed, the classification accuracy of ST based method is higher than DWT and WPT of 11.3% and 14.5%, respectively.These prove that ST has good anti-noise ability.It is reasonable to use ST as the signal processing method in the new approach.

Conclusions
This paper proposes a PQ signal feature selection and classification approach based on an EnI based RF.The innovations in this article are listed as follows: (1) The EnI based feature selection method used in the new approach calculated the EnI value during the training process of RF.These values provide the theoretical basis for SFS search strategy and improve the efficiency of feature search strategy than GiI based method.
(2) RF is used for disturbance identification.While remains the classification accuracy and efficiency as DT method, RF also increases the generalization ability of the PQ classifier.(3) The new method has good anti-noise ability.It can use the same feature subset and RF structure to achieve satisfying classification accuracy under different noise environments.

Figure 1 .
Figure 1.Flow diagram of the RF based classification.

Figure 1 .
Figure 1.Flow diagram of the RF based classification.

Entropy 2016 ,
18, 44    consideration.The process of feature selection needs to be performed only one time to train the RF classifier.The flow diagram of the new method is shown as Figure2.

Figure 2 .
Figure 2. Flow diagram of the new feature selection method.

Figure 2 .
Figure 2. Flow diagram of the new feature selection method.

Feature 9 ( 21 Feature 33 (
F9): A HFmax ´AHFmin .Feature 10 (F10): the Skewness of the high frequency area.Feature 11 (F11): the kurtosis of the high frequency area.Feature 12 (F12): the standard deviation of the maximum amplitude of each frequency.Feature 13 (F13): the mean value of the maximum amplitude of each frequency.Feature 14 (F14): the mean value of the standard deviation of the amplitude of each frequency.Feature 15 (F15): the STD of the STD of the amplitude of each frequency.Feature 16 (F16): the STD of the STD of the amplitude of the low frequency area below 100 Hz.Feature 17 (F17): the STD of the STD of the amplitude of the high frequency area above 100 Hz.Feature 18 (F18): the total harmonic distortion (THD).Feature 19 (F19): the energy drop amplitude of 1/4 cycle of the original signal.Feature 20 (F20): the energy rising amplitude of 1/4 cycle of the original signal.Feature 21 (F21): the standard deviation of the amplitude of fundamental frequency.Feature 22 (F22): the maximum value of the intermediate frequency area.Feature 23 (F23): energy of the high frequency area from 700 Hz to 1000 Hz.Feature 24 (F24): energy of the high frequency area after morphological de-noising.Feature 25 (F25): energy of local matrix.Feature 26 (F26): the summation of maximum value and minimum value of the amplitude of STMM.Feature 27 (F27): the summation of the maximum value and minimum value of the maximum amplitude of each column in STMM.Feature 28 (F28): the root mean square of the mean value of the amplitude of each column in STMM.Feature 29 (F29): the summation of the maximum value and minimum value of the standard deviation of the amplitude of each column in STMM.Feature 30 (F30): the STD of the STD of the amplitude of each column in STMM.Feature 31 (F31): the mean value of the minimum value of the amplitude of each line in STMM.Feature 32 (F32): the STD of the minimum value of the amplitude of each line in STMM.Entropy 2016, 18, 44 9 of F33): the root mean square of the minimum value of the amplitude of each line in STMM.Feature 34 (F34): the STD of the STD of the amplitude of each line in STMM.Feature 35 (F35): the root mean square of the standard deviation of the amplitude of each line in STMM.

4. 2 .
Feature Selection and Classification Effect Analysis of the New Method Fifteen types of PQ disturbances with random disturbance parameters and signal-to-noise ratio (SNR) between 50 dB and 20 dB were simulated in Matlab 7.2.Five hundred samples of each type are generated to train the RF classifier for feature selection.Moreover, 100 samples of each type, with random disturbance parameters and the SNR are 50, 40, 30 and 20 dB respectively, are generated to verify the feature selection effect and classification ability of the new method under different noise environments.

Entropy 2016 ,
18, 44    Fifteen types of PQ disturbances with random disturbance parameters and signal-to-noise ratio (SNR) between 50 dB and 20 dB were simulated in Matlab 7.2.Five hundred samples of each type are generated to train the RF classifier for feature selection.Moreover, 100 samples of each type, with random disturbance parameters and the SNR are 50, 40, 30 and 20 dB respectively, are generated to verify the feature selection effect and classification ability of the new method under different noise environments.

Figure 3 .
Figure 3. (a) EnI value of features; (b) GiI value of features.According to Figure 3a, F4, F5, F22 and F25 have the highest EnI value.As explained in Section 4.1, F4 represents the standard deviation of the maximum amplitude of each column in STMM.Then the values of the standard deviation of disturbances such as sag, swell and interruption are large.The values of the STD of steady-state disturbances such as normal voltage, flicker and spike are small respectively, so F4 can divide all kinds of disturbances into two categories.F5 represents the amplitude factor of the maximum amplitude of each column in STMM.Because the values of F5 of swell, sag and other types of disturbances are in different intervals, F5 can distinguish swell and sag with others.F22 represents the maximum value of the intermediate frequency area, and it can distinguish harmonic with other disturbances.F25 represents the energy of local matrix.According

Entropy 2016 ,
18, 44    to the characteristic that the disturbance frequency of transient is high, F25 can distinguish transient with other disturbances.
Figure 4a-c illustrates the classification performances of combinations of the first four features in Figure 3a in the condition of SNR =  .Figure 4a shows the scatter plot of combination of F5, F22 and F25.It can be seen that C1 and C5, C2 and C4, C7 and C12 and C6 and C15 exist cross sample.The other types of disturbance are clearly divided.Then F4 and F5 are used for further segmentation as Figure4bshows.Although C2 and C4 still exists cross in Figure4b, the cross number is sharply reduced.C7, C12, C6 and C15 are completely separated.As shown in Figure4c, C1 and C5 can be clearly divided by combination of F4 and F22.Therefore, the four features with the highest EnI value can distinguish 15 types of PQ signal effectively.The validity of the new method is proved.

Figures 5 and 6
Figures 5 and 6 present the classification effect and training error of different feature subsets with different SNR respectively.With the feature number increased one by one, the classification accuracy is increasing and the training error is decreasing.As shown in Figures5 and 6the classification accuracy and the training error tend to be stable when the feature subset dimension of the new method exceeds four, while GiI method needs at least ten features to achieve satisfying classification results.

Figure 5 .Figure 6 .
Figure 5. (a) Classification accuracy of different feature subsets obtained from EnI method; (b) Classification accuracy of different feature subsets obtained from GiI method.

Figure 5 .Figure 5 .Figure 6 .
Figure 5. (a) Classification accuracy of different feature subsets obtained from EnI method; (b) Classification accuracy of different feature subsets obtained from GiI method.

Figure 6 .
Figure 6.(a) Training error of different feature subsets obtained from EnI method; (b) Train error of different feature subsets obtained from GiI method.

Figure 7 .
Figure 7.The normalized time of different selected feature number.

Figure 8 .
Figure 8.(a) Classification error of different scale of RF of EnI method; (b) Classification error of different scale of RF of GiI method.

Figure 7 .
Figure 7.The normalized time of different selected feature number.
,b show the relationship between the number of trees and classification error by different feature selection method.

Figure 7 .
Figure 7.The normalized time of different selected feature number.

Figure 8 .
Figure 8.(a) Classification error of different scale of RF of EnI method; (b) Classification error of different scale of RF of GiI method.Figure 8. (a) Classification error of different scale of RF of EnI method; (b) Classification error of different scale of RF of GiI method.

Figure 8 .
Figure 8.(a) Classification error of different scale of RF of EnI method; (b) Classification error of different scale of RF of GiI method.Figure 8. (a) Classification error of different scale of RF of EnI method; (b) Classification error of different scale of RF of GiI method.

Table 1 .
Classification of new method (the number of feature is 4, SNR is 20 dB).

Table 1 .
Classification of new method (the number of feature is 4, SNR is 20 dB).

Table 2 .
Classification of GiI method (the number of feature is 4, SNR is 20 dB).

Table 5 .
Comparison of feature selection method.

Table 8 .
The selected features extracted from DWT method.

Table 9 .
The selected features extracted from WPT method.

Table 10 .
Effect of different signal processing methods for PQ classification.