Multi-scale analysis based ball bearing defect diagnostics using mahalanobis distance and support vector machine. Entropy 2013

The objective of this research is to investigate the feasibility of utilizing the multi-scale analysis and support vector machine (SVM) classification scheme to diagnose the bearing faults in rotating machinery. For complicated signals, the characteristics of dynamic systems may not be apparently observed in a scale, particularly for the fault-related features of rotating machinery. In this research, the multi-scale analysis is employed to extract the possible fault-related features in different scales, such as the multi-scale entropy (MSE), multi-scale permutation entropy (MPE), multi-scale root-mean-square (MSRMS) and multi-band spectrum entropy (MBSE). Some of the features are then selected as the inputs of the support vector machine (SVM) classifier through the Fisher score (FS) as well as the Mahalanobis distance (MD) evaluations. The vibration signals of bearing test data at Case Western Reserve University (CWRU) are utilized as the illustrated examples. The analysis results demonstrate that an accurate bearing defect diagnosis can be achieved by using the extracted machine features in different scales. It can be also noted that the diagnostic results of bearing faults can be further enhanced through the feature selection procedures of FS and MD evaluations.


Introduction
Machine health monitoring and fault diagnosis have attracted considerable attention in both the academic fields as well as the industrial applications.Among the enormous number of components and devices, bearings are one of the most crucial parts in rotating machinery.Due to improper initial assembly, manufacturing inaccuracy, applied repetitive stress or impurity entanglement, the occurrence of bearing damage is inevitable during long term operation.Therefore, bearing fault detection in the early stages is an important task in the fields of mechanical and aerospace engineering.An accurate detection of bearing faults can reduce the maintenance costs and assure the machine operation safety.The bearing components are generally inaccessible for inspection due to the disassembly problem and other possible hazards.Therefore, the vibration analysis is one of the most commonly used techniques for detecting the malfunctions of bearings in rotating systems since the mechanical vibration signals consist of plentiful information related to the system dynamical characteristics.
Bearing fault diagnosis has been studied in the past decades and offered promising results in a large volume of prior literature.Among the previous investigations, Randall [1,2] provided a complete and systematic review for summarizing the measurement techniques as well as the fault diagnosis of rotating mechanical systems.In addition to the conventional spectral analysis of faulty bearing vibration signals [3,4], the envelope analysis and the demodulation technique are also widely utilized for diagnosis of different bearing defects [5][6][7][8].Because of the non-stationarity of vibration signals in the damaged machine, on the other hand, the time-frequency analysis methods, such as wavelet transform [9][10][11][12][13] and Hilbert-Huang Transform, are also employed to analyze the complicated signals with faulty features [10,[14][15][16][17].
Even though the aforementioned traditional methods have demonstrated the capability of diagnosing bearing defects and shown promising results, the bearing fault-related features are generally clouded by massive uncorrelated signals, and thus it is not easy to observe the fault features of bearings when the previous methods are used.Therefore, some filtering process is usually used to remove the uncorrelated components and preserve the faulty feature-correlated signals.Nevertheless, due to the lack of the rigorous rules for selecting the central frequency and bandwidth of the filters, these filter parameters are generally decided based on empirical experience.Moreover, most of the filtering processes will inevitably cause some phase shifts or waveform distortion of the signals, which may interfere with the analysis and diagnosis of the vibration signals.
As presented in the previous literature, the entropy can be used to measure the regularity or orderliness of the signals, and hence different calculations of entropy, such as the approximate entropy [18], spectral entropy [19], pattern spectrum entropy [20], Hilbert-Huang entropy [21], and energy entropy [22], are defined to characterize the system dynamics and disorderliness that are related to machinery defects and malfunctions.In addition, Costa et al. [23,24] indicated that the heartbeat time series of different pathological patients cannot be classified while the sample entropy of the heartbeat time series in one scale is analyzed.Therefore, they proposed the idea of multi-scale entropy (MSE) for separating the human heartbeat signals of healthy and pathological groups.In light of the effectiveness of MSE in physiological and biological signal analysis, different MSE-based algorithms and improved methods were proposed to reduce the influence of noise and employed in vibration analysis of rotating machinery [25][26][27][28].Furthermore, the concept of multi-scale permutation entropy (MPE) was applied to diagnose bearing faults and demonstrated enhanced performance compared with the single scale permutation entropy method [29].
In general, the classification schemes, such as artificial neural-fuzzy network [22,25], support vector machine (SVM) [8,12,26] and decision tree [13,30], are utilized to effectively identify the different types of machine faults.In order to increase the computational efficiency of the classification procedure and enhance the accuracy of the diagnosis results, the principal component analysis (PCA) method is to select the certain features of high priority [30,31].However, the selected features will lose the original physical representations through the feature space transformation process of PCA.Alternatively, Chen et al. [32] proposed the calculation of distinction index (DI) to extract the certain significant features among the numerous features.The DI is also known as the Fisher score (FS) and applied to select the signatures for face recognition [33].The assessment of the Mahalanobis distance (MD) can be used to measure the distinction between the characteristic vectors and demonstrates the effectiveness of identifying the component looseness at different locations of rotating machinery [16,34].
To advance the state-of-art of bearing defect diagnostics, a new approach is proposed to investigate the effectiveness of utilizing the multi-scale analysis and feature selection methods for diagnosis of different bearing faults in this research.The MSE, MPE, multi-scale root-mean-square (MSRMS) and multi-band spectrum entropy (MBSE) of the machine vibration signals are determined as the features in different scales.Through the FS evaluation, the features of high distinguishabilities are selected for diagnosis of different bearing defects.On the other hand, the combinations of features with high distinguishabilities can be also extracted by the MD evaluation procedure.The selected features are casted into the one-against-one SVM structure to classify the different classes of bearing faults.With the feature selection process, either the FS or MD evaluation, both the accuracy of classification results and the computational efficiency can be enhanced.Moreover, the features selected by means of the above feature selection methods can preserve the original physical representations since either the FS or MD evaluation process consists of statistical calculation without space transformation as PCA.
The research in this paper utilizes the data of the Bearing Center at Case Western Reserve University (CWRU) [35] to validate the feasibility of the proposed method and evaluate the effectiveness of this approach.The analysis results show that the different classes of bearing defects can be diagnosed accurately through the multi-scale analysis and SVM classification.The results demonstrate that the diagnostic accuracy can be enhanced by imposing the feature selection process.It is also noted that the MD evaluation can achieve more accurate diagnostic results than the FS evaluation.

Entropy and Multi-Scale Analysis
The concept of entropy is conventionally applied in the fields of information theory as well as thermodynamics.In 1948, Shannon proposed the method of entropy quantification to characterize the information buried in the data, and thus the method is broadly employed to measure the regularity or orderliness of a time series [36].In general, the entropy increases with the degree of disorder and is the maximum for a completely random system [23].Since the entropy analysis has been utilized for feature extraction of pathological and biological systems and presented the effectiveness of disease diagnosis, the concept of entropy quantification can be also employed for machine fault detection [25][26][27][28].
Suppose a single discrete random series S has N outcomes in which there exist n classes.The entropy En(S) of S is defined as: where p(s i ) is the probability density function of the random series S, and log represents the natural logarithmic function or logarithmic function to base of 2. The entropy is generally applied to indicate the complexity of a random variable series and can be utilized to predict the system behavior of the time series.According to the computations of the probability density functions, the entropy can be formulated in terms of the following different representations.

Sample Entropy
Consider a time series with data length of N, S={x 1 , x 2 ,..., x N }.Let m sequential points of the time series be a pattern.For instance, X i =[x i , x i+1 ,..., x i+m-1 ] represents the i-th pattern.Therefore, this time series consists of N-m+1 patterns.Define the pattern space X as: The sample entropy can be determined by the following steps: (1) Measure the mean self-similarity value of the pattern of length m, φ m (r), where r is the tolerance.
(2) Expand the pattern length m to m+1, and measure the mean value of φ m+1 (r).
(3) The sample entropy (SEn) is then determined as [37]: In Equation ( 3), the self-similarity value is formulated as: where G(•) represents the Heaviside function and

Spectral Entropy
The concept of spectral entropy was first proposed by Powell et al. [38] and has been applied in fault diagnosis of gearboxes and bearings [39].Like Shannon entropy, the spectral entropy determines the power spectrum density of the signals in the probability density function as shown in Equation (1).In this research, therefore, the Discrete Fourier Transform (DFT) is utilized to calculate the spectrum of the signals.Let Y(f i ) be the magnitude of the signal at the frequency f i , and the power spectrum density ˆ( ) i Y f is then defined as: where f s represents the sampling frequency.The spectral entropy SpEn is thus formulated as:

Permutation Entropy
Permutation entropy, which is an alternative method of defining the quantity of irregularity or complexity, was first proposed by Bandt et al. in 2002 and applied to speech signal processing and analysis [40].The concept of permutation entropy was also employed to analyze electroencephalographic signals [41,42] as well as the working status characterization of rotary machines [43].Based on the Shannon entropy, the permutation entropy is defined by calculating the probability density function of a time series with permutation of specific order.Consider a time series {x(t)} t=1,2...,N .We consider an order-m segment m i X of the time series: [ ( ), ( 1),..., ( 1)] and hence there are m! permutation π of order m.For each permutation π, the relative frequency is determined by the following formula: where # represents the number.The permutation entropy is thus expressed in terms of the probability density function which is represented by the relative frequency, i.e: Let us take an example of a seven-value series [40]: [4, 7,9,10, 6,11,3]  X (10) For order m=3, the series X can be segmented as:  (11) Since the Eauations (4), (7) and (9) and Equations ( 7), ( 9) and (10) are in increasing order (x t <x t+1 <x t+2 ), they represent the permutation 012.Equations ( 9), ( 10) and ( 6) and Equations ( 6), (11) and (3) correspond to permutation 201 since x t+2 < x t < x t+1 , while Equations ( 10), ( 6) and ( 11) is categorized to permutation type 102 with x t+1 < x t < x t+2 .Therefore, the permutation entropy of order m=3 is determined as: The concept of multi-scale analysis was proposed by Costa et al. and utilized to investigate the feasibility of identifying the pathological human's physiological signals [23,24].As depicted in their study, it is generally difficult to distinguish the inter-beat interval time series of different diseased and healthy systems if only a single-scale sample entropy is analyzed.Therefore, they proposed the concept of MSE to transform the original signal into different scales of data through the coarse-grained process.Their analysis results demonstrated that the MSE analysis is capable of classifying the different diseased systems.
For a given time series signal, S={x 1 , x 2 ,..., x N }, it is segmented into several data sets of length τ.The new time series sets { ( ) j y  } are then reconstructed by taking the mean values of segmented data, according to the equation: where τ is called the scale factor.Obviously, the coarse-grain is equivalent to the process of sliding window of length τ and taking the average of the original signal within the window in the way of non-overlap.In the point of view of signal processing, the coarse-grain process includes the following two steps: (1) using the moving average filter to remove the high frequency components.
Through the coarse-grain process, the multi-scale analysis can be utilized to characterize the dynamics of the signals in different scales.

Multi-Scale Entropy
According to coarse-grain process for multi-scale analysis, the MSE is defined by calculating the SEn of ( ) j y  with different scale τ, i.e: where represents the series obtained through the coarse-grain process with the scale τ.It is noted that the tolerance r in calculating the SEn of different scales is fixed without variation to the scale τ.In general, the tolerance r is selected to be 0.15 times the standard deviation of the time series signal S={x 1 , x 2 ,..., x N } [44].Therefore, the MSE analysis is not affected by the original signal amplitude.

Multi-Scale Permutation Entropy
The concept of multi-scale permutation entropy (MPE) was proposed by Aziz and Arif in 2005 and applied in measuring the entropy of physiological signals [45].In their study, it is demonstrated that the MPE has higher robustness in tolerating the signal noise than the MSE does.The MPE is defined as: where nPEn represents the normalized PEn that is defined by: Since the MSE and MPE correspond to the different features of dynamical systems, the two measurements of entropy can be utilized to characterize the signals of rotary system with faulted bearings.

Multi-Scale Root-Mean-Square
The root-mean-square (RMS) value is a traditional method to measure the signal's energy statistically and has been broadly utilized for machine condition monitoring.Based on the aforementioned concept of multi-scale analysis, the RMS measurement can be extended to quantify the energy distributions at different scales or frequency bands.The multi-scale RMS (MSRMS) is defined as: where represents the series obtained through the coarse-grain process with the scale τ.While the machine dynamics changes, particularly for the occurrence of component faults, the vibration energy distributions of different scales change correspondingly.Therefore, the MSRMS values can be also used as the indices for bearing defect diagnosis of rotary systems.

Multi-Band Spectrum Entropy
Through the coarse-grain process as expressed in Equation ( 13), the calculation of multi-band spectrum entropy is employed to measure the disorderliness of the signal spectrum within the different scales.Let y (τ) represent the time series through the coarse-grain process with the scale τ.The multi-band spectrum entropy (MBSE) is defined as: where ( ) ˆ( ) i X f  represents the power spectrum density of y (τ) at the frequency f i , and n (τ) is the length of the power spectrum density.The normalized MBSE (nMBSE) can be thus formulated as:

Feature Selection
As introduced in Section 2, multi-scale analysis can be utilized to extract the features of different scales in the signals, such as MSE, MPE, MSRMS and MBSE.However, superfluous features with different scales may interfere with the diagnosis of machine faults, and inevitably result in the inaccuracy of identification as well as the increase of computational efforts.Therefore, it is beneficial to select the high-priority features among the large number of features for fault diagnosis purpose before the feature classification process.In this research, two indices, FS and MD, are employed to evaluate the significance of the extracted features.

Fisher Score
FS is used to evaluate the distinguishability of the k-th feature for classifying the data sets between the i-th and the j-th classes.Suppose there are N i data sets in the i-th class and {v ik (1), v ik (2),...,v ik (N i )} represent the k-th feature of all the N i data sets in the i-th class.The mean value and standard deviation of the k-th feature are then determined as: The FS of the k-th feature between the i-th and the j-th classes is defined as [32]: By observing Equation ( 21), obviously, a large FS quantity represents that the k-th feature has high distinction between i-th and the j-th classes.Namely, the k-th feature has high priority for classifying the data sets between the i-th and the j-th classes.

Mahalanobis Distance
Mahalanobis [46] first proposed the determination of MD in which a statistical method is introduced to measure the distance between different data sets of two different classes.For the data sets of two different classes, let μ i and μ j represent the mean values of the characteristic vectors in the i-th and the j-th classes, and the MD between the characteristic vectors of the i-th and the j-th classes is defined as: where N i and N j are the number of data sets in the i-th and the j-th classes, and C i and C j represent the covariance matrices of the characteristic vectors in the i-th and the j-th classes, respectively.Apparently, the characteristic vectors consisting of multi-features of the data sets and the corresponding covariance matrices are included, therefore, the MD measurement can statistically depict the distance between the two classes of characteristic vectors with variations.It is noted that the FS formulates the distinguishability of individual feature for data sets of two classes, while the MD reflects the distinguishability of different combinations of features for data sets of two classes.Therefore, even though the features of highest FS are selected, the combination of the selected features may not have the most distinguishability for the data sets of two classes if the combination of the selected features does not have the highest MD quantity.

Support Vector Machine
The SVM aims to find the optimal plane, termed as the hyper plane, such that the data sets of two different classes can be separated by the hyper plane.The concept of SVM is illustrated as shown in Figure 1 [47].This figure shows that the hyper plane is placed between the data sets of two different classes, circles and triangles, and is oriented in such a way that the margin between the date sets of two classes is maximized.Namely, the distance between the hyper plane and the nearest data point in each class is maximal.The nearest data points that are utilized for defining the margin form the support vectors.Let {(x k , y k ), k=1,2,...,N} be the given data sets as the training sample.Each sample data x k є R d is distinguished as the i-class or j-class according to the value of y k (y k є {+1, -1}).The hyper plane can be formulated as: where w represents the weighting vector and b is the bias.The data point x k is thus distinguished as the class i or j according to the value of decision function, that is: Therefore, the SVM is equivalent to finding the solution of the constrained optimization problem: which is subjected to the constraint: This standard optimization problem can be solved through the quadratic programming process.While the linear hyper plane is incapable of separating the data sets effectively in the original space, the nonlinear SVM can be applied to classify the data sets of classes in a higher dimension through the space transformation that is achieved by the kernel function mapping procedure.The commonly used kernel functions include homogenous polynomial function, inhomogeneous polynomial function, Gaussian radial basis function, and sigmoid function [8].
The fundamental theory of SVM as stated above is developed and based on the classification of data sets in two classes.In general, it is necessary to classify the data sets of more than two classes in most practical applications, therefore, the three structures of SVMs, including one-against-one SVM, one-against-all SVM and hierarchical SVM, are the well-known classification tools that are frequently utilized to separate the data sets of N different classes.Among the three different structures of SVMs, the hierarchical SVM requires the least number of classifiers: at most N-1 classifiers are needed to separate the data sets of N different classes.Although the one-against-all SVM requires N classifiers to distinguish the data sets of N classes, the parallel computation technique can be employed to concurrently execute the classification of N classifiers, and hence the computing time of classification is not affected by the class number of data sets.However, the drawback of the above two SVM structures is that the SVMs require relatively more time and data sets for training procedures inevitably.The one-against-one SVM, on the other hand, requires more classifiers (N!/[2!(N-2)!]) than the above two structures.Nevertheless, the classification accuracy of one-against-one SVM is more reliable than the above two SVMs based on test experience [48].Therefore, the diagnosis results in this research is based on the classification of the one-against-one SVM.

Experimental Validation
In order to verify the proposed approach for identifying the bearing defects, the experimental vibration data of the Bearing Center at Case Western Reserve University (CWRU) [35] are utilized in this research to evaluate the effectiveness of the proposed method.The test rig of this experiment is shown in Figure 2. The fault conditions of the bearing consist of four classes: (1) normal condition; (2) inner race defect; (3) rolling element defect; (4) outer race defect.The defects are artificially produced on the bearing of the drive end at the test rig by drilling holes of 7, 14 and 21 mils in the corresponding defective locations respectively to simulate the different levels of bearing defect.The associated rotating speed of driving motor is set to be 1,730, 1,750, 1,772 and 1,797 RPM, respectively.The accelerometer is placed at the driving motor to measure the vibration signals in different test conditions.The sampling rate of the data acquisition device is set to 48 KHz and the recorded data length is set to 2048 data points.The numbers of data sets corresponding to the different faulted classes, defective levels and rotation speeds are shown in Table 1.The collected vibration data sets are processed through the coarse-grain procedure for multi-scale analysis.The MSE, MPE, MSRMS and MBSE of the vibration signals are then calculated respectively for different scales.In this research, the aforementioned quantities in twenty scales are selected as the features for bearing fault diagnosis.The FS and MD of the corresponding features are determined to evaluate the distinguishabilities of the features between data sets of different classes.According to the values of FS and MD, the features of high distinguishabilities are selected and fed into the SVM for fault classification, so that the computational efforts can be decreased and the classification accuracy can be enhanced simultaneously.
In this research, the one-against-one SVM model with a radial basis function as the hyper plane is utilized to classify the faulted features.For comparison, ten and twenty percent of the collected vibration data sets are randomly selected for training the SVM model, respectively.The remaining ninety (eighty) percent of data sets are then utilized for classification test.The statistical results are obtained by repeating the above procedures 100 times.
Figure 3 shows the diagnostic accuracy results of selecting different features through the FS and MD evaluations respectively, in which ten percent of the collected vibration data sets are used for training and the remaining ninety percent of data sets are used for fault diagnosis.It is observed that the highest classification accuracy can be achieved while the fifteen of the most distinguishable features are selected through the FS evaluation.On the other hand, the most accurate diagnosis result can be obtained by selecting the combination of 10 features through the MD evaluation.Table 2 and 3 show the contingency tables of classification results that the FS and MD evaluations for feature selection are employed respectively.The classification results in which the total 80 features (20 scales for MSE, MPE, MSRMS and MBSE, respectively) are utilized for fault diagnosis are also displayed in the tables for comparison.As observed in these two tables, it can be inferred that the accuracy of bearing fault diagnosis can be enhanced by using the evaluation of feature distinguishability.Concurrently, the computational efforts can be certainly reduced with the features that are selected by the FS evaluation as well as the MD evaluation.Additionally, the classification results in the tables also indicate that the diagnostic accuracy of utilizing the MD evaluation outperform the results of utilizing the FS evaluation.However, it is a dilemmatic problem since the computational effort of MD evaluation is much higher than that of FS evaluation.In order to further investigate the effect of utilizing different numbers of data sets for training the SVM model, twenty percent of the collected vibration data sets are randomly selected for training the SVM model and the remaining eighty percent of data sets are utilized for fault classification test.Figure 4 shows the diagnostic accuracy results when the FS and MD evaluations are used to select the different features, respectively.Similarly, the highest classification accuracy can be achieved when the twenty most distinguishable features are selected through the FS evaluation and the combination of ten features is utilized through the MD evaluation.The classification results of utilizing all the eighty features for bearing fault diagnosis are also displayed for comparison.Tables 4 and 5 show the contingency tables of the classification results.A similar conclusion can be inferred from the results by observing the tables: the diagnostic accuracy can be enhanced when either the FS or MD evaluation is executed before the SVM classification process.It is also noted from Tables 2 to 5 that the diagnostic accuracy of the bearing faults can be improved if more data sets are utilized for SVM model training.In other words, a more accurate SVM model can achieve more accurate bearing fault classification results.In conclusion, the feature selection procedure is beneficial for bearing fault diagnosis: both in increasing the classification accuracy and in decreasing the computational efforts.The MD evaluation can render more accurate results of bearing fault diagnosis than the FS evaluation although the MD evaluation procedure needs higher computational efforts than the FS evaluation procedure.Additionally, an accurate diagnostic result of bearing faults can be obtained if the trained SVM model is precise.

Conclusions
In this paper, multi-scale analysis is utilized to extract the features for detecting the different defects of bearings.Since the vibration signals of the machine normally consist of plentiful information related to the dynamic characteristics of the machine, the MSE, MPE, MSRMS and MBSE can be used to represent the fault features of the machine in different scales.The FS and MD evaluations of the feature selection procedure are proposed to enhance both the accuracy of bearing fault classification and the computational efficiency.The analysis results demonstrate that the proposed approach can achieve the accurate bearing fault diagnosis effectively.

Figure 2 .
Figure 2. Picture of the bearing fault test rig.

Figure 3 .
Figure 3. Diagnostic accuracy with FS and MD evaluations (10% data sets for SVM training).

Figure 4 .
Figure 4. Diagnostic accuracy with FS and MD evaluations (20% data sets for SVM training).

Table 1 .
Numbers of data sets corresponding to different faulted classes, defective levels and rotation speeds.

Table 2 .
Contingency table of classification results with FS evaluations (10% data sets for SVM training).

Table 3 .
Contingency table of classification results with MD evaluations (10% data sets for SVM training).

Table 4 .
Contingency table of classification results with FS evaluations (20% data sets for SVM training).

Table 5 .
Contingency table of classification results with MD evaluations (20% data sets for SVM training).