Next Article in Journal
Short-Term HRV Analysis Using Nonparametric Sample Entropy for Obstructive Sleep Apnea
Next Article in Special Issue
A Weighted Subdomain Adaptation Network for Partial Transfer Fault Diagnosis of Rotating Machinery
Previous Article in Journal
Using Bidimensional Multiscale Entropy Analysis of Ultrasound Images to Assess the Effect of Various Walking Intensities on Plantar Soft Tissues
Previous Article in Special Issue
Misalignment Fault Diagnosis for Wind Turbines Based on Information Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence

1
College of Liberal Arts and Sciences, National University of Defense Technology, Changsha 410073, China
2
Beijing Institute of Spacecraft System Engineering, China Academy of Space Technology, Beijing 100094, China
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(3), 266; https://doi.org/10.3390/e23030266
Submission received: 30 January 2021 / Revised: 20 February 2021 / Accepted: 22 February 2021 / Published: 24 February 2021

Abstract

:
Weak fault signals, high coupling data, and unknown faults commonly exist in fault diagnosis systems, causing low detection and identification performance of fault diagnosis methods based on T 2 statistics or cross entropy. This paper proposes a new fault diagnosis method based on optimal bandwidth kernel density estimation (KDE) and Jensen–Shannon (JS) divergence distribution for improved fault detection performance. KDE addresses weak signal and coupling fault detection, and JS divergence addresses unknown fault detection. Firstly, the formula and algorithm of the optimal bandwidth of multidimensional KDE are presented, and the convergence of the algorithm is proved. Secondly, the difference in JS divergence between the data is obtained based on the optimal KDE and used for fault detection. Finally, the fault diagnosis experiment based on the bearing data from Case Western Reserve University Bearing Data Center is conducted. The results show that for known faults, the proposed method has 10 % and 2 % higher detection rate than T 2 statistics and the cross entropy method, respectively. For unknown faults, T 2 statistics cannot effectively detect faults, and the proposed method has approximately 15 % higher detection rate than the cross entropy method. Thus, the proposed method can effectively improve the fault detection rate.

1. Introduction

The development of industrial informatization has given rise to a large amount of data in various fields. This has led to data processing becoming a difficult problem in the industry, especially for fault diagnosis. The explosive growth of data provides more information, and therefore, typical data analysis theories often fail in achieving the necessary results. The main reason for this failure can be attributed to the typical data analysis theory that often sets the data distribution type through prior information and performs analyses based on this assumption. Once the distribution type is set, the subsequent analysis can perform the estimation and parametric analysis based on only that distribution type; however, with the growth of data, more information is provided, and thus, the type of data distribution will need to be modified. As a nonparametric estimation method, kernel density estimation (KDE) is the most suitable method for the massive amount of the current data. KDE does not employ a priori assumption for the overall data distribution, and it directly starts from the sample data. When the sample size is sufficient, the KDE can approximate different distributions. Furthermore, Sheather and Jones [1] provides the optimal bandwidth estimation formula for a one-dimensional KDE and proves that the kernel function is asymptotically unbiased and consistent in the density estimation. However, with the growth of the dimension, the multidimensional KDE becomes more complex, and its optimal bandwidth formula is not provided. The distribution of multidimensional data has been described to a certain extent by estimating the kernel density of the reduced data in different dimensions Muir [2], Laurent [3]. In fact, the optimal KDE of multidimensional data is a problem that needs to be studied further.
In the field of fault diagnosis, an essential problem is measuring the difference between samples. A frequency histogram has been used to indicate the distribution difference between two samples Sugumaran and Ramachandran [4], Scott [5]; however, there are three shortcomings to this method: (1) the large number of discrete operations require a higher amount of time; (2) the process depends on the selection of the interval, which is more subjective; (3) there is no intuitive index to reflect this difference. In fact, based on KDE, the JS divergence can be used to measure the difference in data distribution, which can overcome the above shortcomings to a certain extent. For example, the failure of a rolling bearing, which is a key component of mechanical equipment, will have a serious effect on the safe and stable operation of the equipment, and the incipient fault detection of rolling bearings can help avoid equipment running with faults and avoid causing serious safety accidents and economic losses, which has important practical and engineering significance.
In Saruhan et al. [6], vibration analysis of rolling element bearings (REBs) defects is studied. The REBs are the most widely used mechanical parts in rotating machinery under high load and high rotational speeds. In addition, characteristics of bearing faults are analyzed in detail in references Razavi-Far et al. [7], Harmouche et al. [8]. Compared with traditional fault diagnosis, the fault diagnosis of rolling bearings is more complex:
  • The fault signal is weak: Bearing data is a type of high-frequency data, and the fault signal is often covered by these high-frequency signals, thereby leading to the failure of traditional fault diagnosis methods. KDE is highly accurate in describing data distribution, so it can identify weak signals.
  • Data is highly coupled: Bearing data is reflected in the form of a vibration signal, and there is strong coupling in different dimension signals, thereby making fault diagnosis difficult. Multi-dimensional KDE plays an important role in depicting the correlation of data, which can characterize the relationship between different dimensions of data.
  • Incomplete data set: Most bearings work under normal conditions, and the fault data collected are often fewer, which makes the data incomplete, thereby resulting in the imperfection of the fault data set and increasing the difficulty of fault detection. The fault detection method constructed by JS divergence can deal with unknown faults and incomplete data sets without using additional data sets.
To overcome these problems, in-depth research has been conducted on this topic. Fault detection technology based on trend elimination and noise reduction has been proposed previously He et al. [9], Demetriou and Polycarpou [10]. The signal trend ratio is enhanced by eliminating the trend, and the signal–noise ratio is enhanced by noise reduction, and therefore, the fault detection effect is improved. However, this method uses the traditional detection method and cannot effectively solve the problem of data coupling. In reference Zhang et al. [11], Fu et al. [12], a fault detection method based on PCA dimension reduction and modal decomposition feature extraction is proposed. For multidimensional data, PCA dimension reduction is performed to reduce data dimensions and eliminate correlation between different dimensions. Then, the modal decomposition method is used to extract features among dimensions for fault detection. This method can effectively solve the strong coupling between data; however, it will lose some information in the process of PCA dimension reduction, and it leads to a reduction in the fault detection effect. In reference Itani et al. [13], Kong et al. [14], Jones and Sheather [15], Desforges et al. [16], a bearing fault detection method based on KDE is proposed. These studies analyzed the feasibility of KDE method in fault detection, and combined different classification methods for experiments. However, these methods only use one-dimensional KDE, and cannot directly describe high-dimensional data.
The data distribution is reconstructed by KDE and the cross-entropy function is constructed to measure the distribution difference for improving the fault detection results. However, this method cannot reflect the correlation between different dimensions, and the cross-entropy function is not precise in the description of density distribution, which leads to a reduction in the fault detection effect, especially for unknown fault detection, which is not included in the fault set.
In this study, the KDE method is extended to multidimensional data to avoid information loss caused by the KDE for each dimension, and to better describe the density probability distribution of the data. Meanwhile, this study improves the traditional method using the cross-entropy function as the measurement of density distribution difference, and it uses JS divergence as the measurement of density distribution difference, thereby avoiding the relativity caused by the cross-entropy function. Most fault identification methods are based only on distance measurement; however, only relying on distance measurement cannot effectively detect unknown faults. Based on JS divergence, distribution characteristics of JS divergence between the sample density distribution and population density distribution are derived using the sliding window principle. Thus, the detection threshold of fault identification is assigned to realize the identification of unknown faults.
This paper is based on the following structure. In Section 2, the trend elimination method and detection method are introduced, and the intrinsic and extrinsic signals in the observation data are separated. Then, the fault detection threshold is constructed via statistics. In Section 3, the KDE method is extended to multidimensional data, and the optimal bandwidth is derived. Then, JS divergence is employed to measure the difference between probability distributions of different densities. In Section 4, the sliding window principle is used to sample the training data to obtain the distribution characteristics of JS divergence between the sample density distribution and the overall density distribution, and the detection threshold of fault identification is obtained using the KDE method. In Section 5, the normal data, two known faults, and one unknown fault are identified using the bearing data of the Case Western Reserve University Bearing Data Center as the fault diagnosis data. The experimental results show that the method can identify all types of faults well.

2. T 2 Statistics Fault Detection

In the operation process of the complex equipment or systems, the common observation state can be divided into intrinsic and extrinsic parts. In general, the intrinsic part represents the main working state of the system, which has a certain trend, monotony, and periodicity. The extrinsic part represents system noise, which has a certain zero mean value, high frequency vibration, and statistical stability. For the intrinsic part, the state equation of the system can be used to describe the law. When a fault occurs in the intrinsic part, the symptoms are relatively significant, and the corresponding fault detection methods are relatively mature. However, for high-frequency vibration signals, the incipient fault is often hidden in the extrinsic part, which is easily covered by noise. Therefore, it is necessary to analyze the observed data in depth.

2.1. Signal Decomposition

In the initial operation stage of the equipment, the unstable operation of the system causes large data fluctuations, which will not only have a great effect on the system trend, but also affect the statistical characteristics of the data. Therefore, it is necessary to truncate the data to remove unstable signals [9]. The corresponding time of the time series after removing the nonstationary period data is t 1 , t 2 , , t m , and the following m observation data are obtained:
Y   =   y t 1 , y t 2 , , y t m .
Each sampling y t i contains n features, which are expressed as components in the form of
y t i = y 1 t i , y 2 t i , , y n t i T , i = 1 , 2 , , m .
Then, the data Y can be decomposed into
Y = Y ^ + R ,
where Y ^ denotes the intrinsic part, which is composed of trend, and R denotes the extrinsic part, which is composed of observation noise and fault data.
The intrinsic part is composed of multiple signals. Selecting the appropriate basis function f t = f 1 t , f 2 t , , f s t T can help describe the intrinsic part. By traversing m data to model the nonlinear data Y ,
y 1 , y 2 , , y m = β 1 1 β 2 1 β s 1 β 1 2 β 2 2 β s 2 β 1 n β 2 n β s n f 0 t 1 f 0 t 2 f 0 t m f 1 t 1 f 1 t 2 f 1 t m f s t 1 f s t 2 f s t m .
Note that
F = Δ f 0 t 1 f 0 t 2 f 0 t m f 1 t 1 f 1 t 2 f 1 t m f s t 1 f s t 2 f s t m , β = Δ β 1 1 β 2 1 β s 1 β 1 2 β 2 2 β s 2 β 1 n β 2 n β s n
Then, Equation (4) can be expressed as
Y = β F .
Thus, the efficient estimator of β is
β ^ = Y F T F F T 1 .
Using Equations (3) and (7), the signal can be decomposed into
Y ^ = β ^ F = Y F T F F T 1 F R = Y Y ^ = Y I F T F F T 1 F
Usually, the choice of the basis function is a problem worthy of discussion, and it depends on prior knowledge of practical application scenarios; however, this is not the focus of this paper, and is therefore not covered here.
Remark 1.
For the bearing data, the data is generally stable and periodic. Therefore, Fourier transform is usually used to extract periodic features instead of more complex basis functions, such as a polynomial basis function and wavelet basis function.

2.2. T 2 Statistics Detection

For simplicity, remember r i = r t i , i = 1 , 2 , , m . According to Equation (8), the training data after signal decomposition are R = r 1 , r 2 , , r m , which is generally considered a normal random vector with expectation of 0 , so that
r i N 0 , Σ ,
where Σ denotes the total covariance matrix. When the covariance matrix Σ is unknown, the unbiased estimation is given by
Σ ^ = R R T m 1 .
Let Z = z 1 , z 2 , , z p be the data in the test window to be tested; the sample mean value z ¯ is
z ¯ = 1 p i = 1 p z i .
Then, z ¯ is still normal distributed and
z ¯ N 0 , 1 p Σ .
The T 2 statistics can be constructed as
T 2 = p z ¯ T Σ ^ 1 z ¯ .
Reference Solomons and Hotelling [17] reports that the distribution of the T 2 statistic satisfies
m n n m 1 T 2 = p m n n m 1 z ¯ T Σ ^ 1 z ¯ F n , m n .
Therefore, if the significance level is α , we can get that
m n n m 1 T 2 = l m n n m 1 z ¯ T Σ ^ 1 z ¯ < F α n , m n .
The testing data Z and the training data R both come from the same mode; otherwise, they are considered different. The error rate of this criterion is α .

3. Optimal Kernel Density Estimation

Section 2 introduces the fault detection method based on T 2 statistics, including the signal decomposition technology and fault detection method based on the T 2 statistics. However, the fault detection method based on the T 2 statistics assumes that data satisfies the normal distribution, while the actual observation data may not meet the hypothesis, which can lead the discriminant performance of the T 2 statistics to not satisfy the design requirements. In addition, the statistics test the data from the angle of the intrinsic part Y ^ and covariance matrix Σ ^ . These two attributes are not sufficient to describe all statistical characteristics of the system. When the incipient fault is submerged by data noise, it is easy to miss the detection. In this study, a KDE method for multidimensional data is constructed to describe the probability and statistical characteristics of the data more accurately.

3.1. Optimal Bandwidth Theorem

For the observed data, the frequency histogram can be used to show its statistical characteristics directly. However, in the actual application process, the frequency histogram is a discrete statistical method, the interval number of the histogram is difficult to divide, and more importantly, the discretization operation inconveniences the subsequent data processing. To overcome these limitations, the KDE method is proposed. This method is a nonparametric estimation method that estimates the population probability density distribution directly by sampling data.
For any point x R n , assuming that the probability density of a certain mode is f x , the kernel density of f x is estimated based on the sampling data R = r 1 , r 2 , , r m in Section 2.1. As reported in reference Rao [18], the estimation formula is
f ^ K x , h m = 1 m h m n i = 1 m K r i x h m ,
where m , n , K · , and h m denote the number of sampling data, dimension of sampling data, kernel function, and bandwidth, respectively.
For the sake of convenience in the following discussions, in the case of no doubt,
f ^ K x f ^ K x , h m g x d x x R n g x d x
The kernel function K · satisfies K x d x = 1 ; therefore, K r i x h m d x = h m n , that is, f ^ K x d x = 1 . Thus, f ^ K x satisfies both positive definiteness, continuity, and normality. Therefore, it is reasonable to use it as the KDE. The Gaussian kernel function is a good choice as given by
K x = 2 π n n 2 2 e x T x x T x 2 2
In this study, the performance of the kernel density estimator is characterized by the mean integral square error (MISE).
MISE f ^ K x = E f ^ K x f x 2 d x
Reference Rao [18] shows that the estimation result f ^ K x is not sensitive to the selection of the kernel function K · ; that is, the MISE of the estimation results obtained using different kernel functions is almost the same, which is reflected in the subsequent derivation process. In addition, the MISE depends on the selection of the bandwidth h m . If h m is too small, the density estimation f ^ K x shows an irregular shape because of the increase in the randomness. While h m is too large, density estimation f ^ K x is too averaged to show sufficient detail.
The optimal bandwidth formula is provided in the following theorem, and it is one of the key theoretical results of this study.
Theorem 1.
For any dimensional probability density function f · and any kernel function K · with a symmetric form, if f ^ K · in Equation (16) is used to estimate f · , and if the function tr 2 f x x x T with respect to x is integrable when the MISE f ^ K · in Equation (19) is the minimum, the bandwidth h m satisfies
h m = m d K 2 n 3 c K t r 2 f x x x T 2 d x 1 1 n + 4 n + 4 ,
where c K and d K are two constant values given by
c K = K 2 x d x d K = x T x K 2 x d x
Equation (20) is called the optimal bandwidth formula and h m denotes the optimal bandwidth.
A detailed proof of this theorem is given below.
Proof. 
It can be proved that the following two equations hold
E f ^ K x = K u f x + h m u d u E f ^ K 2 x = K 2 u f x + h m u d u m h m n + m 1 K u f x + h m u d u 2 m
In fact,
E f ^ K x = i = 1 m f r i 1 m h m n i = 1 m K r i x h m d r m d r 1 = 1 m h m n i = 1 m f r K r x h m d r = f x + h m u K u d u .
In addition,
E f ^ K 2 x = i = 1 m f r i m h m n 1 i = 1 m f r i K r i x h m 2 d r 1 d r m = m h m n 2 i = 1 m f r i i = 1 m f r i K r i x h m 2 d r 1 d r m = m h m n 2 i = 1 m f r i i = 1 m K 2 r i x h m + i j m K r i x h m K r j x h m d r 1 d r m = m h m n 2 i = 1 m f r i i = 1 m K 2 r i x h m + i = 1 m f r i i j m K r i x h m K r j x h m d r 1 d r m = m h m n 2 i = 1 m f r i K 2 r i x h m d r + i j m f r i K r i x h m f r j K r j x h m d r i d r j = m h m n 2 m f r K 2 r x h m d r + m m 1 f r K r x h m d r 2 = m h m n 2 m h m n K 2 u f x + h m u d u + m m 1 h m n f x + h m u K u d u 2 .
From Equation (23),
E f ^ K x f x = h m 2 2 u T 2 f x + θ h m u x x T u K u d u ,
where θ represents a constant value between 0 and 1. According to Equations (23) and (24),
E f ^ K 2 x E f ^ K x 2 = K 2 u f x + h m u d u m h m n K u f x + h m u d u 2 m .
According to the Equations (25) and (26), the following equation holds.
E f ^ K x f x 2 = E f ^ K 2 x E f ^ K x 2 + E f ^ K x f x 2 = K 2 u f x + h m u d u m h m n K u f x + h m u d u 2 m + 1 2 h m 2 u T 2 f x + θ h m u x x T u K u d u 2
To facilitate the subsequent reasoning, the following theorem is given.
Theorem 2.
For any matrixΦ, K · is a kernel density function with symmetric form; then,
x T Φ x K x d x = t r Φ n x T x K x d x .
Proof. 
If the odd function g x is integrable on R , there must be g x d x = 0 . Similarly, it can be verified that the kernel function K · with a symmetric form satisfies
i j Φ i j x i x j K x d x 1 d x n = 0 .
Then,
x T Φ x K x d x = x T Φ x K x d x 1 d x n = i Φ i i x i 2 K x d x 1 d x n + i j Φ i j x i x j K x d x 1 d x n = tr Φ x 1 2 K x d x 1 d x n = t r Φ n x T x K x d x 1 d x n = t r Φ n x T x K x d x .
Thus, the Theorem 2 is proved. □
For any unit length vector u R n , the Taylor expansion can be used to obtain
f x + h m u = f x + h m u T f x + o h m 2 f x + θ h m u x i x j = 2 f x x i x j + θ h m u T 2 f x x i x j + o h m
If the bandwidth h m satisfies the condition
lim m h m = 0 , lim m 1 m h m n = 0 ,
Then, from Equations (22)–(32), we get that
E f ^ K x f x 2 = c K f x m h m n + o 1 m h m n + h m 4 d K 2 4 n 2 tr 2 f x x x T 2 + o h m 4 .
In fact,
E f ^ K x f x 2 = K 2 u f x + h m u d u m h m n K u f x + h m u d u 2 m + h m 2 2 u T 2 f x + θ h m u x x T u K u d u 2 = c K f x m h m n + o 1 m h m n f x 2 m + o 1 m + h m 2 2 n tr 2 f x + θ h m u x x T u T u K u d u 2 = c K f x m h m n + o 1 m h m n + h m 2 2 n tr 2 f x x x T d K + o h m 2 2 = c K f x m h m n + o 1 m h m n + h m 4 d K 2 4 n 2 tr 2 f x x x T 2 + o h m 4 .
Based on Equation (33), if tr 2 f x x x T is integrable, there is
MISE f ^ K x = c K f x m h m n + h m 4 4 n 2 d K tr 2 f x x x T 2 d x + o 1 m h m n + o h m = c K m h m n + 1 4 n 2 h m 4 d K 2 tr 2 f x x x T 2 d x + o 1 m h m n + o h m .
When MISE f ^ K · is the smallest, the derivative of Equation (35) with respect to h m is 0, which means
M I S E f ^ K x h m = 0 .
Thus, the optimal bandwidth h m in Theorem 1 is obtained as
h m = m d K 2 n 3 c K t r 2 f x x x T 2 d x 1 1 n + 4 n + 4 .
 □
Remark 2.
When the number of samples m is determined, the appropriate bandwidth h m can be selected using Equation (20) to construct the KDE, which can better fit the sample distribution. In Equation (20), the influence of the kernel function on bandwidth selection is on c K and d K , which are almost the same under different kernel function selection, and they have a slight effect on the final bandwidth selection.

3.2. Optimal Bandwidth Algorithm

The optimal bandwidth formula is given by Equation (20). However, f x is unknown in Equation (20), and therefore, tr 2 f x x x T d x is also unknown. An approximate value of the bandwidth parameter h m can be obtained by replacing f x with f ^ K x in Equation (16). Furthermore, an iterative algorithm can be used to calculate a more accurate bandwidth parameter. Theorem 3 shows that the algorithm is convergent.
Theorem 3.
For any n-dimensional probability density function f · and Gaussian kernel function K · , if f ^ K · in Equation (16) is used to estimate f · , then the iterative calculation formula of h m is obtained as
h m , k + 1 = m d K 2 n 3 c K tr 2 f ^ K x , h m , k x x T 2 d x 1 1 n + 4 n + 4
and it is convergent, where h m , k is the value of h m during the k th iteration.
Proof. 
For a particular Gaussian kernel function
K u = 2 π n n 2 2 e u T u u T u 2 2
d K is a χ 2 distribution with degree of freedom n, and the expectation is equal to the degree of freedom.
d K = u T u K u d u = n
In addition,
c K = K 2 u d u = 2 π n e u T u d u = 2 π n .
Substituting Equations (39)–(40) into Equation (20) and substituting f ^ K x in Equation (16) for f x , the iterative form of calculating h m is obtained as
h m , k + 1 = n m 1 1 n + 4 n + 4 2 π n n n + 4 n + 4 tr 2 f ^ K x x x T 2 d x 1 1 n + 4 n + 4 = m n h m , k 2 n 1 1 n + 4 n + 4 2 π n n n + 4 n + 4 tr 2 x x T i = 1 m K r i x h m , k 2 d x 1 1 n + 4 n + 4
To facilitate the subsequent reasoning, the following lemma is given as
Lemma 1.
For any function f 1 , f 2 , , f n , inequality
f 1 + f 2 + + f n 2 d x n f 1 2 + f 2 2 + + f n 2 d x .
If and only if f 1 x = f 2 x = = f n x holds almost everywhere.
Proof. 
In fact, for any function f 1 , f 2 , , f n , there are
0 f 1 x + f 2 x + + f n x 2 n f 1 x 2 + f 2 x 2 + + f n x 2 .
Thus, the two sides of Equation (44) are integrated as
f 1 + f 2 + + f n 2 d x n f 1 2 + f 2 2 + + f n 2 d x .
It is obvious that the sign of Equation (43) holds the condition that f 1 x = f 2 x = = f n x is almost everywhere. □
Because the second derivative of Equation (39) with respect to x i is
2 x i x i K x = 2 π n n 2 2 e x T x x T x 2 2 x i 2 1 .
In addition,
2 x j x j K r i x h m , k 2 d x = 2 x j x j 2 π n n 2 2 e r i x T r i x r i x T r i x 2 h m , k 2 2 h m , k 2 2 d x = 3 4 2 π n h m , k n 4 .
From Lemma 1 and Equation (47)
t r 2 x x T i = 1 m K r i x h m , k 2 d x n m i , j 2 x j x j K r i x h m , k 2 d x = 3 4 n m 2 2 π n h m , k n 4 .
When h m , k is sufficiently large, we can assume that K r i x h m , k is almost the same everywhere, i.e., the equal sign in Equation (48) is tenable.
h m , k + 1 = m n h m , k 2 n 2 π 1 1 n + 4 n + 4 3 4 n m 2 2 π n h m , k n 4 1 1 n + 4 n + 4 = h m , k 3 4 n m 1 1 n + 4 n + 4 < h m , k
When h m , k is large, the iterative process decreases. Because h m , k has a lower bound, the algorithm converges. □
In summary, the KDE method based on optimal bandwidth is given (See Algorithm 1), and the flowchart of the KDE method is shown in Figure 1.
Algorithm 1: Kernel density estimation (KDE) method based on optimal bandwidth.
Entropy 23 00266 i001

4. Fault Detection Method Based on JS Divergence Distribution

In Section 3, we construct a multidimensional KDE method based on the optimal bandwidth; this method can accurately describe the density distribution of multidimensional data. JS divergence is used to measure the distribution difference, and thus, it can highlight the difference in the statistical characteristics of different mode data.

4.1. Mode Difference Index

In Section 3, the probability density estimation of multidimensional data is obtained using the kernel function method, and the optimal bandwidth formula is derived. When the system fails, the state of the system will inevitably change, and the statistical characteristics of the system output will also change, thereby leading to significant changes in the density distribution of the observed data. For two groups of the sample window data R and Z , the cross entropy H R , Z can be used to measure the distribution difference of R and Z .
H R , Z = f ^ K , Z x log f ^ K , R x d x ,
where f ^ K , R , f ^ K , Z represents the optimal KDE of R and Z calculated using Equation (16).
H R , Z does not satisfy the definition of distance because H R , Z does not necessarily satisfy positive definiteness and symmetry; that is, H R , Z < 0 or H R , Z H R , Z .
  • The smaller the difference of distribution, the smaller is H R , Z , which means that even H R , Z < 0 , and therefore, it is reasonable to use H R , Z to measure the distribution difference of R and Z .
  • However, the quantitative description of distribution difference must satisfy symmetry; otherwise, the exchange position and distribution difference will be different, which is difficult to accept.
The JS divergence J S   R , Z was used as a measure of the distribution difference between R and Z in reference Zhang et al. [19], Bruni et al. [20] as follows:
J S R , Z = f ^ K , R log f ^ K , R + f ^ K , Z log f ^ K , Z f ^ K , R + f ^ K , Z log f ^ K , R + f ^ K , Z / 2 d x .
It is easy to get that
J S R , Z 0 J S R , Z = J S Z , R
In this paper, Equation (52) is used to measure the distribution difference between testing data Z and training data R for realizing fault detection and isolation.

4.2. Mode Discrimination Method

If the training data has q patterns R 1 , R 2 , , R q , the JS divergence set
J S Z , R 1 , J S Z , R 2 , , J S Z , R q
between the testing data Z and different modes R can be calculated using Equation (51).
If i 0 is the schema tag corresponding to the minimum JS divergence, it means that
i 0 = arg min J S Z , R 1 , J S Z , R 2 , , J S Z , R q .
It is reasonable to assume that testing data Z and training data R i 0 belong to the same mode. However, for a new failure mode that may be unknown in the application, Equation (50) evaluates the testing data Z as the known failure mode of type i 0 , which is obviously unreasonable.
If J S Z , R i 0 is too large, we believe that testing data Z comes from an unknown new failure mode; its label is q + 1 . However, the method to obtain the threshold J S high of J S Z , R i 0 is a problem that should be investigated. A method to determine J S high is provided below.
For the training data R i 0 = r 1 , r 2 , , r m of the i 0 mode, the density estimation of the data set can be obtained using Equation (16).
f ^ K , R x = 1 m h m n i = 1 m K r i x h m
In addition, if the length of the sampling window is fixed as p p < m , the new sampling data is R j = r j , r j + 1 , , r j + p R i 0 , j = 1 , 2 , , m p by sliding the sampling window. For each R j , the density of the dataset can be estimated as
f ^ K , R j x = 1 p h p n i = j j + p K r i x h p .
Using Equation (52), the divergence between the training data R and the sample data R j can be obtained as
J S j = J S R , R j = H f ^ K , R + f ^ K , R j , f ^ K , R + f ^ K , R j / 2 H f ^ K , R H f ^ K , R j .
Using Equation (55), we can obtain a series of JS divergence calculation value sets
JS = J S 1 , J S 2 , , J S m p .
We use this set to provide the estimation formula f ^ J S x of the density function f J S x of the JS divergence as
f ^ J S x = 1 m p h m p n j = 1 m p K J S j x h m p .
If the significance level is α , the probability of f ^ J S x that exceeds the threshold J S high is
P 0 J S high f ^ J S x d x . < α
Because the distribution type of JS divergence is not a common random distribution, the quantile cannot be obtained by looking up the table; instead, it can only be obtained by numerical integration. If h is the step size, and
h * i 1 + f ^ J S x d x α h * i + f ^ J S x d x ,
It is reasonable to deduce that
J S high = h * i .
The following fault detection and isolation criteria are constructed by Equation (58).
Criterion 1.
Suppose i 0 is the pattern label corresponding to the minimum JS divergence—see Equation (38)—the training data R i 0 = r 1 , r 2 , , r m corresponding to the i 0 mode and the upper bound of JS divergence is J S high —see Equation (56). If the testing data Z = z 1 , z 2 , , z l meet the requirements,
J S Z , R i 0 J S high .
The testing data Z and training data R i 0 belong to the same failure mode; otherwise, the testing data Z are considered to originate from the unknown new failure mode, and their label is marked as q + 1 .
In conclusion, the fault diagnosis method based on optimal bandwidth is provided (See Algorithm 2), and the corresponding fault diagnosis method flowchart is shown in Figure 2.
Algorithm 2: Fault Diagnosis Method Based on Optimal KDE.
Entropy 23 00266 i002
Remark 3.
Equations (54) and (55) show that the calculation result of JS divergence is directly related to the length of sampling data. Indeed, with the increase in the sampling data length, the density estimation obtained by Equation (54) can describe the distribution characteristics of samples more effectively, thereby significantly improving the accuracy of fault detection.

5. Numerical Simulation

The bearing data from Case Western Reserve University Bearing Data Center were used as the diagnosis research object, and they have been considered as a case for many fault diagnosis, such as in references Smith and Randall [21], Lou and Loparo [22], Rai and Mohanty [23].
The sampling frequency of the motor data was 12 kHz, and 12 kHz is the default sampling frequency for Case Western Reserve University Bearing Data Center. The dataset contains four groups of sample data: normal data ( f 0 ), 0.007 inch inner raceway fault data ( f 1 ), 0.014 inch inner raceway fault data ( f 2 ), and 0.014 inch outer raceway fault data( f 3 ). Each group of data had two dimensions: the acceleration data of the drive end ( f i D E ) and the acceleration data of the fan end ( f i F E ). All the experiments were conducted on an Lenovo Ryzen 3700X CPU with 3.60 GHz processor, 16 GB RAM.

5.1. Data Preprocessing

The observed data in the process of the bearing operation show obvious periodicity, which needs to be eliminated. Taking normal data f 0 as an example, the main frequency in the observed signal can be obtained by fast Fourier transform (FFT), and the result of the FFT is shown in Figure 3.
Figure 3 indicates that the main frequency is approximately 1036 Hz, and thus, the basis function is constructed as
f t = 1 sin 1036 × 2 π t cos 1036 × 2 π t T .
The estimation of β calculated using Equation (7) is
β ^ = 0 . 0116 0.0158 0.0548 0.0280 0.0326 0.0396 .
Thus, the data after removing the intrinsic signal are shown in Figure 4, where Figure 4a represents the acceleration data of the drive end and Figure 4b represents the acceleration data of the fan end.
In the later fault detection process, the data of all modes are similar to the above operation, and the results are recorded as f i .

5.2. Fault Detection Effect

5.2.1. Norm Data and Known Fault

For the norm data f 0 and the known fault f 1 , f 2 , the first 20,480 sample points are selected as the training set, which are recorded as f i train . The last 81,920 sample points are taken as the testing set, which are recorded as f i test . A total of 128 sample points are used as detection objects in each test. The training set data are shown in Figure 5, where Figure 5a,b represent data f i train , i = 1 , 2 of the two dimensions, respectively.
Figure 5 shows that the bearing data have high frequency, and the fault does not change the observed mean value; however, it changes the dispersion characteristics or the correlation of data.

5.2.2. Unknown Fault

The training data does not necessarily contain all types of patterns, and the detection of unknown faults is always a difficult problem. f 3 is used as an unknown fault for fault detection; the training set sample does not contain any information about f 3 . The unknown fault data are shown in Figure 6, where in Figure 6a represents the acceleration data at the driving end and Figure 6b represents the acceleration data at the fan end.
Figure 6 shows that the data of unknown faults is close to the other two types of fault data. If the fault detection method is not sensitive, the detection rate will be reduced significantly.

5.2.3. Detection Effect

The characteristics of bearing data make bearing fault detection extremely challenging. The input of the training set is f 0 train , the estimation accuracy is ε = 10 4 , and the maximum number of iterations is k m a x = 100 , according to Algorithm 1, the optimal bandwidth is
h m = 0.0445 .
The KDE of the training set is obtained by Equation (15), and the results are shown in Figure 7, where Figure 7a,c,e represent the two-dimensional frequency histograms of the training data f i train , i = 0 , 1 , 2 , and Figure 7b,d,f represent the two-dimensional KDE of the training data f i train , i = 0 , 1 , 2 .
Figure 7 further shows that the bearing fault changes the dispersion characteristics and data correlation. Meanwhile, Figure 7 shows that the KDE of the training data obtained by Equation (15) is in good agreement with the data distribution of the training data, and therefore, this method can really describe the distribution of multidimensional data.
The JS divergence of the training data and KDE of the distribution are obtained by Equations (51) and (58); the results are shown in Figure 8.
When the significance level is α = 95 % , the detection thresholds of the training set, which are calculated using Equation (58), are
f 0 : J S high < 0.1375 f 1 : J S high < 0.0995 f 2 : J S high < 0.1225
Thus, the detection results of using JS divergence methods on the testing data are shown in Figure 9. If the detection points fall within the threshold, the data set to be detected is in the same pattern; otherwise, the data have different patterns.
Furthermore, detection rates using different methods are shown in Table 1.
For the known fault, Table 1 indicates that the bearing fault identification based on multidimensional KDE and JS divergence achieves better results compared to those obtained using the T 2 statistics detection methods in the testing data. The detection rate of normal data f 0 increases from 95.08 % to 97.03 % , the detection rate of fault data f 1 increases from 81.33 % to 95.81 % , and the detection rate of fault data f 2 increases from 70.69 % to 95.36 % . Meanwhile, compared with the cross-entropy methods, the detection rate of normal data f 0 increased from 96.95 % to 97.03 % ; of fault data f 1 increased from 94.41 % to 95.81 % ; and of fault data f 2 increased from 94.19 % to 95.36 % .
For the unknown fault f 3 , Table 1 shows that the T 2 statistics detection method cannot detect the unknown faults. The method using cross entropy as a measure can only detect unknown faults with a detection rate of 53.16 % , which is not obvious. The JS divergence method constructed in this study can identify the unknown fault accurately, and the detection rate reaches 69.49 % . This is because JS divergence is more accurate at measuring the differences between distributions.

5.3. Influence of Window Width on Fault Diagnosis

The fault diagnosis effect is related to the data window width; therefore, the fault diagnosis effect under different window widths is investigated. The results are shown in Figure 10.
Figure 10 indicates that, with the increase in the detection window, the detection performance of the proposed method for the known fault detection first rises, and then, it tends to be stable. This is because when the length of the detection window increases to a certain extent, the data to be detected already contains sufficient information. Meanwhile, if the detection window continues to increase, the contribution rate to the improvement of the fault detection rate is not large. Meanwhile, for unknown faults, the detection rate increases rapidly with the length of the detection window because the longer the detection window, the higher the amount of information contained in the data to be detected, and the better is the difference characterized between the fault and the known fault.

6. Conclusions

In this study, a method of bearing fault detection and identification was constructed using multidimensional KDE and JS divergence. The distribution characteristics of JS divergence between the sample density distribution and population density distribution were derived using the sliding sampling window method. Thus, the threshold of fault detection was provided, and therefore, different faults, especially unknown faults, could be identified. The theory showed that the multidimensional KDE method could reduce information loss caused by processing each dimension; the JS divergence is more accurate than the traditional cross entropy to measure the difference in density distribution. The experimental results verified the above conclusions.
For a known fault, the detection effect of this method was obviously better than that of the traditional method, and it also had a certain degree of improvement compared with the cross-entropy method. Second, for unknown faults, the traditional method could not detect the distribution difference accurately, while the detection effect of the proposed method was significantly improved.
Furthermore, the detection effect of this method depends on the window width. The detection effect improved with a growth in the detection window. In this paper, under the condition of a given window width, the estimation formula for the optimal bandwidth of a multidimensional KDE was provided. The experimental results showed that the formula was applicable to any mode of data, and therefore, it had a certain universality.
However, this study has certain limitations. Firstly, although the calculation formula of multidimensional KDE is given in this study, the computational complexity will increase when the dimension is large, which may restrict the further application of the method. Secondly, the calculation of JS divergence is time consuming, which is not conducive to rapid fault diagnosis.
In future research, we can try to use the PCA dimension reduction method to solve the computational complexity caused by very large dimension, and optimize the algorithm flow of JS divergence to expedite the calculation. In the latest study Ginzarly et al. [24], prognosis of the vehicle’s electrical machine is treated using a hidden Markov model after modeling the electrical machine using the finite element method. Therefore, we will try to combine this method in future work and apply it to the fault detection of other systems.

Author Contributions

Conceptualization and methodology, J.W. (Juhui Wei); formal analysis and visualization, Z.H.; validation and data curation, J.W. (Jiongqi Wang); resources, D.W.; writing—review and editing X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 61903366, 61903086, 61773021, Natural Science Foundation of Hunan Province grant number 2019JJ50745, 2019JJ20018, 2020JJ4280 and Foundation of Beijing Institute of Control Engineering grant number HTKJ2019KL502007.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors express their appreciation to the Associate Editor and anonymous reviewers for their helpful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
KDEkernel density estimation
JSJensen–Shannon
PCAprincipal component analysis
MISEmean integral square error

References

  1. Sheather, S.J.; Jones, M.C. A reliable data-based bandwidth selection method for kernel density estimation. J. R. Stat. Soc. 1991, 53, 683–690. [Google Scholar] [CrossRef]
  2. Muir, D. Multidimensional Kernel Density Estimates over Periodic Domains. Circular Statistics. 2017. Available online: https://www.mathworks.com/matlabcentral/fileexchange/44129-multi-dimensional-kernel-density-estimates-over-periodic-domains (accessed on 21 February 2021).
  3. Laurent, B. Efficient estimation of integral functionals of a density. Ann. Stat. 1996, 24, 659–681. [Google Scholar] [CrossRef]
  4. Sugumaran, V.; Ramachandran, K.I. Fault diagnosis of roller bearing using fuzzy classifier and histogram features with focus on automatic rule learning. Expert Syst. Appl. 2011, 38, 4901–4907. [Google Scholar] [CrossRef]
  5. Scott, D.W. Averaged shifted histograms: Effective nonparametric density estimators in several dimensions. Ann. Stat. 1985, 13, 1024–1040. [Google Scholar] [CrossRef]
  6. Saruhan, H.; Sardemir, S.; Iek, A.; Uygur, L. Vibration analysis of rolling element bearings defects. J. Appl. Res. Technol. 2014, 12, 384–395. [Google Scholar] [CrossRef]
  7. Razavi-Far, R.; Farajzadeh-Zanjani, M.; Saif, M. An integrated class-imbalanced learning scheme for diagnosing bearing defects in induction motors. IEEE Trans. Ind. Inform. 2017, 13, 2758–2769. [Google Scholar] [CrossRef]
  8. Harmouche, J.; Delpha, C.; Diallo, D. Incipient fault amplitude estimation using kl divergence with a probabilistic approach. Signal Process. 2016, 120, 1–7. [Google Scholar] [CrossRef] [Green Version]
  9. He, Z.; Shardt, Y.A.W.; Wang, D.; Hou, B.; Zhou, H.; Wang, J. An incipient fault detection approach via detrending and denoising. Control Eng. Pract. 2018, 74, 1–12. [Google Scholar] [CrossRef]
  10. Demetriou, M.A.; Polycarpou, M.M. Incipient fault diagnosis of dynamical systems using online approximators. IEEE Trans. Autom. Control 1998, 43, 1612–1617. [Google Scholar] [CrossRef]
  11. Zhang, X.; Polycarpou, M.M.; Parisini, T. A robust detection and isolation scheme for abrupt and incipient faults in nonlinear systems. IIEEE Trans. Autom. Control 2002, 47, 576–593. [Google Scholar] [CrossRef]
  12. Fu, F.; Wang, D.; Li, W.; Li, F. Data-driven fault identifiability analysis for discrete-time dynamic systems. Int. J. Syst. Sci. 2020, 51, 404–412. [Google Scholar] [CrossRef]
  13. Itani, S.; Lecron, F.; Fortemps, P. A one-class classification decision tree based on kernel density estimation. Appl. Soft Comput. 2020, 91, 106250. [Google Scholar] [CrossRef] [Green Version]
  14. Kong, Y.; Li, D.; Fan, Y.; Lv, J. Interaction pursuit in high-dimensional multi-response regression via distance correlation. Ann. Stat. 2017, 45, 897–922. [Google Scholar] [CrossRef] [Green Version]
  15. Jones, M.C.; Sheather, S.J. Using non-stochastic terms to advantage in kernel-based estimation of integrated squared density derivatives. Stat. Probab. Lett. 1991, 11, 511–514. [Google Scholar] [CrossRef]
  16. Desforges, M.J.; Jacob, P.J.; Ball, A.D. Fault detection in rotating machinery using kernel-based probability density estimation. Int. J. Syst. Sci. 2000, 31, 1411–1426. [Google Scholar] [CrossRef]
  17. Solomons, L.M.; Hotelling, H. The limits of a measure of skewness. Ann. Math. Stat. 1932, 3, 141–142. [Google Scholar]
  18. Rao, P. Nonparametric Functional Estimation; Elsevier: Amsterdam, The Netherlands, 1983. [Google Scholar]
  19. Zhang, X.; Delpha, C.; Diallo, D. Incipient fault detection and estimation based on Jensen–Shannon divergence in a data-driven approach. Signal Process. 2019, 169, 107410. [Google Scholar] [CrossRef]
  20. Bruni, V.; Rossi, E.; Vitulano, D. On the equivalence between Jensen–Shannon divergence and Michelson contrast. IEEE Trans. Inf. Theory 2012, 58, 4278–4288. [Google Scholar] [CrossRef]
  21. Smith, W.A.; Randall, R.B. Rolling element bearing diagnostics using the case western reserve university data: A benchmark study. Mech. Syst. Signal Process. 2015, 64–65, 100–131. [Google Scholar] [CrossRef]
  22. Lou, X.; Loparo, K.A. Bearing fault diagnosis based on wavelet transform and fuzzy inference. Mech. Syst. Signal Process. 2004, 18, 1077–1095. [Google Scholar] [CrossRef]
  23. Rai, V.K.; Mohanty, A.R. Bearing fault diagnosis using fft of intrinsic mode functions in Hilbert–Huang transform. Mech. Syst. Signal Process. 2007, 21, 2607–2615. [Google Scholar] [CrossRef]
  24. Ginzarly, R.; Hoblos, G.; Moubayed, N. From modeling to failure prognosis of permanent magnet synchronous machine. Appl. Sci. 2020, 10, 691. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Flowchart of KDE method based on optimal bandwidth.
Figure 1. Flowchart of KDE method based on optimal bandwidth.
Entropy 23 00266 g001
Figure 2. Flowchart of fault diagnosis method based on optimal KDE.
Figure 2. Flowchart of fault diagnosis method based on optimal KDE.
Entropy 23 00266 g002
Figure 3. Single-sided amplitude spectrum of f 0 .
Figure 3. Single-sided amplitude spectrum of f 0 .
Entropy 23 00266 g003
Figure 4. Preprocessed data to remove trends by fast Fourier transform (FFT).
Figure 4. Preprocessed data to remove trends by fast Fourier transform (FFT).
Entropy 23 00266 g004
Figure 5. Training data f 1 , f 2 after being preprocessed.
Figure 5. Training data f 1 , f 2 after being preprocessed.
Entropy 23 00266 g005
Figure 6. Training data f 3 after preprocessed.
Figure 6. Training data f 3 after preprocessed.
Entropy 23 00266 g006
Figure 7. Training data after being preprocessed.
Figure 7. Training data after being preprocessed.
Entropy 23 00266 g007
Figure 8. The results of detection threshold.
Figure 8. The results of detection threshold.
Entropy 23 00266 g008
Figure 9. Fault detection effect using JS divergence as index.
Figure 9. Fault detection effect using JS divergence as index.
Entropy 23 00266 g009
Figure 10. Fault diagnosis effect under different window width h m .
Figure 10. Fault diagnosis effect under different window width h m .
Entropy 23 00266 g010
Table 1. Detection rate of normal and different failure modes using different methods.
Table 1. Detection rate of normal and different failure modes using different methods.
Method T 2 Statistics DetectionCross EntropyJS Divergence
Normal mode f 0 95.80 % 96.95 % 97.03 %
Known fault f 1 83.47 % 94.41 % 95.81 %
Known fault f 2 78.11 % 94.19 % 95.36 %
Unknown fault f 3 \ 53.16 % 69.49 %
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Wei, J.; He, Z.; Wang, J.; Wang, D.; Zhou, X. Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence. Entropy 2021, 23, 266. https://doi.org/10.3390/e23030266

AMA Style

Wei J, He Z, Wang J, Wang D, Zhou X. Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence. Entropy. 2021; 23(3):266. https://doi.org/10.3390/e23030266

Chicago/Turabian Style

Wei, Juhui, Zhangming He, Jiongqi Wang, Dayi Wang, and Xuanying Zhou. 2021. "Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence" Entropy 23, no. 3: 266. https://doi.org/10.3390/e23030266

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop