Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence

Wei, Juhui; He, Zhangming; Wang, Jiongqi; Wang, Dayi; Zhou, Xuanying

doi:10.3390/e23030266

Open AccessArticle

Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence

by

Juhui Wei

¹,

Zhangming He

^1,2,*

,

Jiongqi Wang

¹,

Dayi Wang

² and

Xuanying Zhou

¹

College of Liberal Arts and Sciences, National University of Defense Technology, Changsha 410073, China

²

Beijing Institute of Spacecraft System Engineering, China Academy of Space Technology, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(3), 266; https://doi.org/10.3390/e23030266

Submission received: 30 January 2021 / Revised: 20 February 2021 / Accepted: 22 February 2021 / Published: 24 February 2021

(This article belongs to the Special Issue Information Theory and Its Application in Machine Condition Monitoring)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Weak fault signals, high coupling data, and unknown faults commonly exist in fault diagnosis systems, causing low detection and identification performance of fault diagnosis methods based on

T^{2}

statistics or cross entropy. This paper proposes a new fault diagnosis method based on optimal bandwidth kernel density estimation (KDE) and Jensen–Shannon (JS) divergence distribution for improved fault detection performance. KDE addresses weak signal and coupling fault detection, and JS divergence addresses unknown fault detection. Firstly, the formula and algorithm of the optimal bandwidth of multidimensional KDE are presented, and the convergence of the algorithm is proved. Secondly, the difference in JS divergence between the data is obtained based on the optimal KDE and used for fault detection. Finally, the fault diagnosis experiment based on the bearing data from Case Western Reserve University Bearing Data Center is conducted. The results show that for known faults, the proposed method has

10 %

and

2 %

higher detection rate than

T^{2}

statistics and the cross entropy method, respectively. For unknown faults,

T^{2}

statistics cannot effectively detect faults, and the proposed method has approximately

15 %

higher detection rate than the cross entropy method. Thus, the proposed method can effectively improve the fault detection rate.

Keywords:

fault detection; optimal bandwidth; kernel density estimation; JS divergence; bearing

1. Introduction

The development of industrial informatization has given rise to a large amount of data in various fields. This has led to data processing becoming a difficult problem in the industry, especially for fault diagnosis. The explosive growth of data provides more information, and therefore, typical data analysis theories often fail in achieving the necessary results. The main reason for this failure can be attributed to the typical data analysis theory that often sets the data distribution type through prior information and performs analyses based on this assumption. Once the distribution type is set, the subsequent analysis can perform the estimation and parametric analysis based on only that distribution type; however, with the growth of data, more information is provided, and thus, the type of data distribution will need to be modified. As a nonparametric estimation method, kernel density estimation (KDE) is the most suitable method for the massive amount of the current data. KDE does not employ a priori assumption for the overall data distribution, and it directly starts from the sample data. When the sample size is sufficient, the KDE can approximate different distributions. Furthermore, Sheather and Jones [1] provides the optimal bandwidth estimation formula for a one-dimensional KDE and proves that the kernel function is asymptotically unbiased and consistent in the density estimation. However, with the growth of the dimension, the multidimensional KDE becomes more complex, and its optimal bandwidth formula is not provided. The distribution of multidimensional data has been described to a certain extent by estimating the kernel density of the reduced data in different dimensions Muir [2], Laurent [3]. In fact, the optimal KDE of multidimensional data is a problem that needs to be studied further.

In the field of fault diagnosis, an essential problem is measuring the difference between samples. A frequency histogram has been used to indicate the distribution difference between two samples Sugumaran and Ramachandran [4], Scott [5]; however, there are three shortcomings to this method: (1) the large number of discrete operations require a higher amount of time; (2) the process depends on the selection of the interval, which is more subjective; (3) there is no intuitive index to reflect this difference. In fact, based on KDE, the JS divergence can be used to measure the difference in data distribution, which can overcome the above shortcomings to a certain extent. For example, the failure of a rolling bearing, which is a key component of mechanical equipment, will have a serious effect on the safe and stable operation of the equipment, and the incipient fault detection of rolling bearings can help avoid equipment running with faults and avoid causing serious safety accidents and economic losses, which has important practical and engineering significance.

In Saruhan et al. [6], vibration analysis of rolling element bearings (REBs) defects is studied. The REBs are the most widely used mechanical parts in rotating machinery under high load and high rotational speeds. In addition, characteristics of bearing faults are analyzed in detail in references Razavi-Far et al. [7], Harmouche et al. [8]. Compared with traditional fault diagnosis, the fault diagnosis of rolling bearings is more complex:

The fault signal is weak: Bearing data is a type of high-frequency data, and the fault signal is often covered by these high-frequency signals, thereby leading to the failure of traditional fault diagnosis methods. KDE is highly accurate in describing data distribution, so it can identify weak signals.
Data is highly coupled: Bearing data is reflected in the form of a vibration signal, and there is strong coupling in different dimension signals, thereby making fault diagnosis difficult. Multi-dimensional KDE plays an important role in depicting the correlation of data, which can characterize the relationship between different dimensions of data.
Incomplete data set: Most bearings work under normal conditions, and the fault data collected are often fewer, which makes the data incomplete, thereby resulting in the imperfection of the fault data set and increasing the difficulty of fault detection. The fault detection method constructed by JS divergence can deal with unknown faults and incomplete data sets without using additional data sets.

To overcome these problems, in-depth research has been conducted on this topic. Fault detection technology based on trend elimination and noise reduction has been proposed previously He et al. [9], Demetriou and Polycarpou [10]. The signal trend ratio is enhanced by eliminating the trend, and the signal–noise ratio is enhanced by noise reduction, and therefore, the fault detection effect is improved. However, this method uses the traditional detection method and cannot effectively solve the problem of data coupling. In reference Zhang et al. [11], Fu et al. [12], a fault detection method based on PCA dimension reduction and modal decomposition feature extraction is proposed. For multidimensional data, PCA dimension reduction is performed to reduce data dimensions and eliminate correlation between different dimensions. Then, the modal decomposition method is used to extract features among dimensions for fault detection. This method can effectively solve the strong coupling between data; however, it will lose some information in the process of PCA dimension reduction, and it leads to a reduction in the fault detection effect. In reference Itani et al. [13], Kong et al. [14], Jones and Sheather [15], Desforges et al. [16], a bearing fault detection method based on KDE is proposed. These studies analyzed the feasibility of KDE method in fault detection, and combined different classification methods for experiments. However, these methods only use one-dimensional KDE, and cannot directly describe high-dimensional data.

The data distribution is reconstructed by KDE and the cross-entropy function is constructed to measure the distribution difference for improving the fault detection results. However, this method cannot reflect the correlation between different dimensions, and the cross-entropy function is not precise in the description of density distribution, which leads to a reduction in the fault detection effect, especially for unknown fault detection, which is not included in the fault set.

In this study, the KDE method is extended to multidimensional data to avoid information loss caused by the KDE for each dimension, and to better describe the density probability distribution of the data. Meanwhile, this study improves the traditional method using the cross-entropy function as the measurement of density distribution difference, and it uses JS divergence as the measurement of density distribution difference, thereby avoiding the relativity caused by the cross-entropy function. Most fault identification methods are based only on distance measurement; however, only relying on distance measurement cannot effectively detect unknown faults. Based on JS divergence, distribution characteristics of JS divergence between the sample density distribution and population density distribution are derived using the sliding window principle. Thus, the detection threshold of fault identification is assigned to realize the identification of unknown faults.

This paper is based on the following structure. In Section 2, the trend elimination method and detection method are introduced, and the intrinsic and extrinsic signals in the observation data are separated. Then, the fault detection threshold is constructed via statistics. In Section 3, the KDE method is extended to multidimensional data, and the optimal bandwidth is derived. Then, JS divergence is employed to measure the difference between probability distributions of different densities. In Section 4, the sliding window principle is used to sample the training data to obtain the distribution characteristics of JS divergence between the sample density distribution and the overall density distribution, and the detection threshold of fault identification is obtained using the KDE method. In Section 5, the normal data, two known faults, and one unknown fault are identified using the bearing data of the Case Western Reserve University Bearing Data Center as the fault diagnosis data. The experimental results show that the method can identify all types of faults well.

2. $T^{2}$ Statistics Fault Detection

In the operation process of the complex equipment or systems, the common observation state can be divided into intrinsic and extrinsic parts. In general, the intrinsic part represents the main working state of the system, which has a certain trend, monotony, and periodicity. The extrinsic part represents system noise, which has a certain zero mean value, high frequency vibration, and statistical stability. For the intrinsic part, the state equation of the system can be used to describe the law. When a fault occurs in the intrinsic part, the symptoms are relatively significant, and the corresponding fault detection methods are relatively mature. However, for high-frequency vibration signals, the incipient fault is often hidden in the extrinsic part, which is easily covered by noise. Therefore, it is necessary to analyze the observed data in depth.

2.1. Signal Decomposition

In the initial operation stage of the equipment, the unstable operation of the system causes large data fluctuations, which will not only have a great effect on the system trend, but also affect the statistical characteristics of the data. Therefore, it is necessary to truncate the data to remove unstable signals [9]. The corresponding time of the time series after removing the nonstationary period data is

t_{1}, t_{2}, \dots, t_{m}

, and the following m observation data are obtained:

Y = [y (t_{1}), y (t_{2}), \dots, y (t_{m})] .

(1)

Each sampling

y (t_{i})

contains n features, which are expressed as components in the form of

y (t_{i}) = {[y_{1} (t_{i}), y_{2} (t_{i}), \dots, y_{n} (t_{i})]}^{T}, i = 1, 2, \dots, m .

(2)

Then, the data

Y

can be decomposed into

Y = \hat{Y} + R,

(3)

where

\hat{Y}

denotes the intrinsic part, which is composed of trend, and

R

denotes the extrinsic part, which is composed of observation noise and fault data.

The intrinsic part is composed of multiple signals. Selecting the appropriate basis function

f (t) = {[f_{1} (t), f_{2} (t), \dots, f_{s} (t)]}^{T}

can help describe the intrinsic part. By traversing m data to model the nonlinear data

Y

,

[y_{1}, y_{2}, \dots, y_{m}] = [\begin{matrix} β_{1}^{(1)} & β_{2}^{(1)} & \dots & β_{s}^{(1)} \\ β_{1}^{(2)} & β_{2}^{(2)} & \dots & β_{s}^{(2)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ β_{1}^{(n)} & β_{2}^{(n)} & \dots & β_{s}^{(n)} \end{matrix}] [\begin{matrix} f_{0} (t_{1}) & f_{0} (t_{2}) & \dots & f_{0} (t_{m}) \\ f_{1} (t_{1}) & f_{1} (t_{2}) & \dots & f_{1} (t_{m}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ f_{s} (t_{1}) & f_{s} (t_{2}) & \dots & f_{s} (t_{m}) \end{matrix}] .

(4)

Note that

F \overset{Δ}{=} [\begin{matrix} f_{0} (t_{1}) & f_{0} (t_{2}) & \dots & f_{0} (t_{m}) \\ f_{1} (t_{1}) & f_{1} (t_{2}) & \dots & f_{1} (t_{m}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ f_{s} (t_{1}) & f_{s} (t_{2}) & \dots & f_{s} (t_{m}) \end{matrix}], β \overset{Δ}{=} [\begin{matrix} β_{1}^{(1)} & β_{2}^{(1)} & \dots & β_{s}^{(1)} \\ β_{1}^{(2)} & β_{2}^{(2)} & \dots & β_{s}^{(2)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ β_{1}^{(n)} & β_{2}^{(n)} & \dots & β_{s}^{(n)} \end{matrix}]

(5)

Then, Equation (4) can be expressed as

Y = β F .

(6)

Thus, the efficient estimator of

β

is

\hat{β} = Y F^{T} {(F F^{T})}^{- 1} .

(7)

Using Equations (3) and (7), the signal can be decomposed into

\{\begin{matrix} \hat{Y} = \hat{β} F = Y F^{T} {(F F^{T})}^{- 1} F \\ R = Y - \hat{Y} = Y (I - F^{T} {(F F^{T})}^{- 1} F) \end{matrix}

(8)

Usually, the choice of the basis function is a problem worthy of discussion, and it depends on prior knowledge of practical application scenarios; however, this is not the focus of this paper, and is therefore not covered here.

Remark 1.

For the bearing data, the data is generally stable and periodic. Therefore, Fourier transform is usually used to extract periodic features instead of more complex basis functions, such as a polynomial basis function and wavelet basis function.

2.2. $T^{2}$ Statistics Detection

For simplicity, remember

r_{i} = r (t_{i}), i = 1, 2, \dots, m

. According to Equation (8), the training data after signal decomposition are

R = [r_{1}, r_{2}, \dots, r_{m}]

, which is generally considered a normal random vector with expectation of

0

, so that

r_{i} \sim N (0, Σ),

(9)

where

Σ

denotes the total covariance matrix. When the covariance matrix

Σ

is unknown, the unbiased estimation is given by

\hat{Σ} = \frac{R R^{T}}{m - 1} .

(10)

Let

Z = [z_{1}, z_{2}, \dots, z_{p}]

be the data in the test window to be tested; the sample mean value

\bar{z}

is

\bar{z} = \frac{1}{p} \sum_{i = 1}^{p} z_{i} .

(11)

Then,

\bar{z}

is still normal distributed and

\bar{z} \sim N (0, \frac{1}{p} Σ) .

(12)

The

T^{2}

statistics can be constructed as

T^{2} = p {\bar{z}}^{T} {\hat{Σ}}^{- 1} \bar{z} .

(13)

Reference Solomons and Hotelling [17] reports that the distribution of the

T^{2}

statistic satisfies

\frac{m - n}{n (m - 1)} T^{2} = \frac{p (m - n)}{n (m - 1)} {\bar{z}}^{T} {\hat{Σ}}^{- 1} \bar{z} \sim F (n, m - n) .

(14)

Therefore, if the significance level is

α

, we can get that

\frac{m - n}{n (m - 1)} T^{2} = \frac{l (m - n)}{n (m - 1)} {\bar{z}}^{T} {\hat{Σ}}^{- 1} \bar{z} < F_{α} (n, m - n) .

(15)

The testing data

Z

and the training data

R

both come from the same mode; otherwise, they are considered different. The error rate of this criterion is

α

.

3. Optimal Kernel Density Estimation

Section 2 introduces the fault detection method based on

T^{2}

statistics, including the signal decomposition technology and fault detection method based on the

T^{2}

statistics. However, the fault detection method based on the

T^{2}

statistics assumes that data satisfies the normal distribution, while the actual observation data may not meet the hypothesis, which can lead the discriminant performance of the

T^{2}

statistics to not satisfy the design requirements. In addition, the statistics test the data from the angle of the intrinsic part

\hat{Y}

and covariance matrix

\hat{Σ}

. These two attributes are not sufficient to describe all statistical characteristics of the system. When the incipient fault is submerged by data noise, it is easy to miss the detection. In this study, a KDE method for multidimensional data is constructed to describe the probability and statistical characteristics of the data more accurately.

3.1. Optimal Bandwidth Theorem

For the observed data, the frequency histogram can be used to show its statistical characteristics directly. However, in the actual application process, the frequency histogram is a discrete statistical method, the interval number of the histogram is difficult to divide, and more importantly, the discretization operation inconveniences the subsequent data processing. To overcome these limitations, the KDE method is proposed. This method is a nonparametric estimation method that estimates the population probability density distribution directly by sampling data.

For any point

x \in R^{n}

, assuming that the probability density of a certain mode is

f (x)

, the kernel density of

f (x)

is estimated based on the sampling data

R = [r_{1}, r_{2}, \dots, r_{m}]

in Section 2.1. As reported in reference Rao [18], the estimation formula is

{\hat{f}}_{K} (x, h_{m}) = \frac{1}{m h_{m}^{n}} \sum_{i = 1}^{m} K (\frac{r_{i} - x}{h_{m}}),

(16)

where

m,

n,

K (\cdot),

and

h_{m}

denote the number of sampling data, dimension of sampling data, kernel function, and bandwidth, respectively.

For the sake of convenience in the following discussions, in the case of no doubt,

\{\begin{matrix} {\hat{f}}_{K} (x) ≜ {\hat{f}}_{K} (x, h_{m}) \\ \int g (x) d x ≜ \int_{x \in R^{n}} g (x) d x \end{matrix}

(17)

The kernel function

K (\cdot)

satisfies

\int K (x) d x = 1

; therefore,

\int K (\frac{r_{i} - x}{h_{m}}) d x = h_{m}^{n}

, that is,

\int {\hat{f}}_{K} (x) d x = 1

. Thus,

{\hat{f}}_{K} (x)

satisfies both positive definiteness, continuity, and normality. Therefore, it is reasonable to use it as the KDE. The Gaussian kernel function is a good choice as given by

K (x) = {(2 π)}^{- n / 2} e^{- (x^{T} x) / 2}

(18)

In this study, the performance of the kernel density estimator is characterized by the mean integral square error (MISE).

MISE ({\hat{f}}_{K} (x)) = \int E {[{\hat{f}}_{K} (x) - f (x)]}^{2} d x

(19)

Reference Rao [18] shows that the estimation result

{\hat{f}}_{K} (x)

is not sensitive to the selection of the kernel function

K (\cdot)

; that is, the MISE of the estimation results obtained using different kernel functions is almost the same, which is reflected in the subsequent derivation process. In addition, the MISE depends on the selection of the bandwidth

h_{m}

. If

h_{m}

is too small, the density estimation

{\hat{f}}_{K} (x)

shows an irregular shape because of the increase in the randomness. While

h_{m}

is too large, density estimation

{\hat{f}}_{K} (x)

is too averaged to show sufficient detail.

The optimal bandwidth formula is provided in the following theorem, and it is one of the key theoretical results of this study.

Theorem 1.

For any dimensional probability density function

f (\cdot)

and any kernel function

K (\cdot)

with a symmetric form, if

{\hat{f}}_{K} (\cdot)

in Equation (16) is used to estimate

f (\cdot)

, and if the function

tr (\frac{\partial^{2} f (x)}{\partial x \partial x^{T}})

with respect to

x

is integrable when the

MISE ({\hat{f}}_{K} (\cdot))

in Equation (19) is the minimum, the bandwidth

h_{m}

satisfies

h_{m} = {(\frac{m d_{K}^{2}}{n^{3} c_{K}} \int t r {(\frac{\partial^{2} f (x)}{\partial x \partial x^{T}})}^{2} d x)}^{- 1 / (n + 4)},

(20)

where

c_{K}

and

d_{K}

are two constant values given by

\{\begin{matrix} c_{K} = \int K^{2} (x) d x \\ d_{K} = \int x^{T} x K^{2} (x) d x \end{matrix}

(21)

Equation (20) is called the optimal bandwidth formula and

h_{m}

denotes the optimal bandwidth.

A detailed proof of this theorem is given below.

Proof.

It can be proved that the following two equations hold

\{\begin{matrix} E [{\hat{f}}_{K} (x)] = \int K (u) f (x + h_{m} u) d u \\ E [{\hat{f}}_{K}^{2} (x)] = \frac{\int K^{2} (u) f (x + h_{m} u) d u}{m h_{m}^{n}} + \frac{(m - 1) {(\int K (u) f (x + h_{m} u) d u)}^{2}}{m} \end{matrix}

(22)

In fact,

\begin{matrix} E [{\hat{f}}_{K} (x)] & = \int \dots \int \prod_{i = 1}^{m} f (r_{i}) \frac{1}{m h_{m}^{n}} \sum_{i = 1}^{m} K (\frac{r_{i} - x}{h_{m}}) d r_{m} \dots d r_{1} \\ = \frac{1}{m h_{m}^{n}} \sum_{i = 1}^{m} \int f (r) K (\frac{r - x}{h_{m}}) d r \\ = \int f (x + h_{m} u) K (u) d u . \end{matrix}

(23)

In addition,

\begin{matrix} E [{\hat{f}}_{K}^{2} (x)] = \int \dots \int \prod_{i = 1}^{m} f (r_{i}) {({(m h_{m}^{n})}^{- 1} \sum_{i = 1}^{m} f (r_{i}) K (\frac{r_{i} - x}{h_{m}}))}^{2} d r_{1} \dots d r_{m} \\ = {(m h_{m}^{n})}^{- 2} \int \dots \int \prod_{i = 1}^{m} f (r_{i}) {(\sum_{i = 1}^{m} f (r_{i}) K (\frac{r_{i} - x}{h_{m}}))}^{2} d r_{1} \dots d r_{m} \\ = {(m h_{m}^{n})}^{- 2} \int \dots \int \prod_{i = 1}^{m} f (r_{i}) (\sum_{i = 1}^{m} K^{2} (\frac{r_{i} - x}{h_{m}}) + \sum_{i \neq j}^{m} K (\frac{r_{i} - x}{h_{m}}) K (\frac{r_{j} - x}{h_{m}})) d r_{1} \dots d r_{m} \\ = {(m h_{m}^{n})}^{- 2} \int \dots \int (\prod_{i = 1}^{m} f (r_{i}) \sum_{i = 1}^{m} K^{2} (\frac{r_{i} - x}{h_{m}}) + \prod_{i = 1}^{m} f (r_{i}) \sum_{i \neq j}^{m} K (\frac{r_{i} - x}{h_{m}}) K (\frac{r_{j} - x}{h_{m}})) d r_{1} \dots d r_{m} \\ = {(m h_{m}^{n})}^{- 2} (\sum_{i = 1}^{m} \int f (r_{i}) K^{2} (\frac{r_{i} - x}{h_{m}}) d r + \sum_{i \neq j}^{m} \int \int (f (r_{i}) K (\frac{r_{i} - x}{h_{m}}) f (r_{j}) K (\frac{r_{j} - x}{h_{m}})) d r_{i} d r_{j}) \\ = {(m h_{m}^{n})}^{- 2} (m \int f (r) K^{2} (\frac{r - x}{h_{m}}) d r + m (m - 1) {(\int f (r) K (\frac{r - x}{h_{m}}) d r)}^{2}) \\ = {(m h_{m}^{n})}^{- 2} (m h_{m}^{n} \int K^{2} (u) f (x + h_{m} u) d u + m (m - 1) {(h_{m}^{n} \int f (x + h_{m} u) K (u) d u)}^{2}) . \end{matrix}

(24)

From Equation (23),

E [{\hat{f}}_{K} (x)] - f (x) = \frac{h_{m}^{2}}{2} \int u^{T} (\frac{\partial^{2} f (x + θ h_{m} u)}{\partial x \partial x^{T}}) u K (u) d u,

(25)

where

θ

represents a constant value between 0 and 1. According to Equations (23) and (24),

E [{\hat{f}}_{K}^{2} (x)] - {(E [{\hat{f}}_{K} (x)])}^{2} = \frac{\int K^{2} (u) f (x + h_{m} u) d u}{m h_{m}^{n}} - \frac{{(\int K (u) f (x + h_{m} u) d u)}^{2}}{m} .

(26)

According to the Equations (25) and (26), the following equation holds.

\begin{matrix} E {[{\hat{f}}_{K} (x) - f (x)]}^{2} & = E [{\hat{f}}_{K}^{2} (x)] - {(E [{\hat{f}}_{K} (x)])}^{2} + {(E [{\hat{f}}_{K} (x)] - f (x))}^{2} \\ = \frac{\int K^{2} (u) f (x + h_{m} u) d u}{m h_{m}^{n}} - \frac{{(\int K (u) f (x + h_{m} u) d u)}^{2}}{m} \\ + {(\frac{1}{2} h_{m}^{2} \int u^{T} (\frac{\partial^{2} f (x + θ h_{m} u)}{\partial x \partial x^{T}}) u K (u) d u)}^{2} \end{matrix}

(27)

To facilitate the subsequent reasoning, the following theorem is given.

Theorem 2.

For any matrixΦ,

K (\cdot)

is a kernel density function with symmetric form; then,

\int x^{T} Φ x K (x) d x = \frac{t r (Φ)}{n} \int x^{T} x K (x) d x .

(28)

Proof.

If the odd function

g (x)

is integrable on

R

, there must be

\int_{- \infty}^{\infty} g (x) d x = 0

. Similarly, it can be verified that the kernel function

K (\cdot)

with a symmetric form satisfies

\int \dots \int \sum_{i \neq j} Φ_{i j} x_{i} x_{j} K (x) d x_{1} \dots d x_{n} = 0 .

(29)

Then,

\begin{matrix} \int x^{T} Φ x K (x) d x = \int \dots \int x^{T} Φ x K (x) d x_{1} \dots d x_{n} \\ = \int \dots \int \sum_{i} Φ_{i i} x_{i}^{2} K (x) d x_{1} \dots d x_{n} + \int \dots \int \sum_{i \neq j} Φ_{i j} x_{i} x_{j} K (x) d x_{1} \dots d x_{n} \\ = tr (Φ) \int \dots \int x_{1}^{2} K (x) d x_{1} \dots d x_{n} \\ = \frac{t r (Φ)}{n} \int \dots \int x^{T} x K (x) d x_{1} \dots d x_{n} \\ = \frac{t r (Φ)}{n} \int x^{T} x K (x) d x . \end{matrix}

(30)

Thus, the Theorem 2 is proved. □

For any unit length vector

u \in R^{n}

, the Taylor expansion can be used to obtain

\{\begin{matrix} f (x + h_{m} u) = f (x) + h_{m} u^{T} \nabla (f (x)) + o (h_{m}) \\ \frac{\partial^{2} f (x + θ h_{m} u)}{\partial x_{i} \partial x_{j}} = \frac{\partial^{2} f (x)}{\partial x_{i} \partial x_{j}} + θ h_{m} u^{T} \nabla (\frac{\partial^{2} f (x)}{\partial x_{i} \partial x_{j}}) + o (h_{m}) \end{matrix}

(31)

If the bandwidth

h_{m}

satisfies the condition

\{\begin{matrix} lim_{m \to \infty} (h_{m}) = 0, \\ lim_{m \to \infty} (\frac{1}{m h_{m}^{n}}) = 0, \end{matrix}

(32)

Then, from Equations (22)–(32), we get that

E {[{\hat{f}}_{K} (x) - f (x)]}^{2} = \frac{c_{K} f (x)}{m h_{m}^{n}} + o (\frac{1}{m h_{m}^{n}}) + \frac{h_{m}^{4} d_{K}^{2}}{4 n^{2}} {(tr (\frac{\partial^{2} f (x)}{\partial x \partial x^{T}}))}^{2} + o (h_{m}^{4}) .

(33)

In fact,

\begin{matrix} \begin{matrix} E {[{\hat{f}}_{K} (x) - f (x)]}^{2} & = \frac{\int K^{2} (u) f (x + h_{m} u) d u}{m h_{m}^{n}} - \frac{{(\int K (u) f (x + h_{m} u) d u)}^{2}}{m} \\ + {(\frac{h_{m}^{2}}{2} \int u^{T} (\frac{\partial^{2} f (x + θ h_{m} u)}{\partial x \partial x^{T}}) u K (u) d u)}^{2} \end{matrix} \\ = \frac{c_{K} f (x)}{m h_{m}^{n}} + o (\frac{1}{m h_{m}^{n}}) - \frac{f {(x)}^{2}}{m} + o (\frac{1}{m}) + {(\frac{h_{m}^{2}}{2 n} tr (\frac{\partial^{2} f (x + θ h_{m} u)}{\partial x \partial x^{T}}) \int u^{T} u K (u) d u)}^{2} \\ = \frac{c_{K} f (x)}{m h_{m}^{n}} + o (\frac{1}{m h_{m}^{n}}) + {(\frac{h_{m}^{2}}{2 n} tr (\frac{\partial^{2} f (x)}{\partial x \partial x^{T}}) d_{K} + o (h_{m}^{2}))}^{2} \\ = \frac{c_{K} f (x)}{m h_{m}^{n}} + o (\frac{1}{m h_{m}^{n}}) + \frac{h_{m}^{4} d_{K}^{2}}{4 n^{2}} {(tr (\frac{\partial^{2} f (x)}{\partial x \partial x^{T}}))}^{2} + o (h_{m}^{4}) . \end{matrix}

(34)

Based on Equation (33), if

tr (\frac{\partial^{2} f (x)}{\partial x \partial x^{T}})

is integrable, there is

\begin{matrix} MISE ({\hat{f}}_{K} (x)) & = \int (\frac{c_{K} f (x)}{m h_{m}^{n}} + \frac{h_{m}^{4}}{4 n^{2}} {(d_{K} tr (\frac{\partial^{2} f (x)}{\partial x \partial x^{T}}))}^{2}) d x + o (\frac{1}{m h_{m}^{n}}) + o (h_{m}) \\ = \frac{c_{K}}{m h_{m}^{n}} + \frac{1}{4 n^{2}} h_{m}^{4} d_{K}^{2} \int tr {(\frac{\partial^{2} f (x)}{\partial x \partial x^{T}})}^{2} d x + o (\frac{1}{m h_{m}^{n}}) + o (h_{m}) . \end{matrix}

(35)

When

MISE ({\hat{f}}_{K} (\cdot))

is the smallest, the derivative of Equation (35) with respect to

h_{m}

is 0, which means

\frac{\partial M I S E ({\hat{f}}_{K} (x))}{\partial h_{m}} = 0 .

(36)

Thus, the optimal bandwidth

h_{m}

in Theorem 1 is obtained as

h_{m} = {(\frac{m d_{K}^{2}}{n^{3} c_{K}} \int t r {(\frac{\partial^{2} f (x)}{\partial x \partial x^{T}})}^{2} d x)}^{- 1 / (n + 4)} .

(37)

□

Remark 2.

When the number of samples m is determined, the appropriate bandwidth

h_{m}

can be selected using Equation (20) to construct the KDE, which can better fit the sample distribution. In Equation (20), the influence of the kernel function on bandwidth selection is on

c_{K}

and

d_{K}

, which are almost the same under different kernel function selection, and they have a slight effect on the final bandwidth selection.

3.2. Optimal Bandwidth Algorithm

The optimal bandwidth formula is given by Equation (20). However,

f (x)

is unknown in Equation (20), and therefore,

\int tr (\frac{\partial^{2} f (x)}{\partial x \partial x^{T}}) d x

is also unknown. An approximate value of the bandwidth parameter

h_{m}

can be obtained by replacing

f (x)

with

{\hat{f}}_{K} (x)

in Equation (16). Furthermore, an iterative algorithm can be used to calculate a more accurate bandwidth parameter. Theorem 3 shows that the algorithm is convergent.

Theorem 3.

For any n-dimensional probability density function

f (\cdot)

and Gaussian kernel function

K (\cdot)

, if

{\hat{f}}_{K} (\cdot)

in Equation (16) is used to estimate

f (\cdot)

, then the iterative calculation formula of

h_{m}

is obtained as

h_{m, k + 1} = {(\frac{m d_{K}^{2}}{n^{3} c_{K}} \int tr {(\frac{\partial^{2} {\hat{f}}_{K} (x, h_{m, k})}{\partial x \partial x^{T}})}^{2} d x)}^{- 1 / (n + 4)}

(38)

and it is convergent, where

h_{m, k}

is the value of

h_{m}

during the

k - th

iteration.

Proof.

For a particular Gaussian kernel function

K (u) = {(2 π)}^{- n / 2} e^{- (u^{T} u) / 2}

(39)

d_{K}

is a

χ^{2}

distribution with degree of freedom n, and the expectation is equal to the degree of freedom.

d_{K} = \int u^{T} u K (u) d u = n

(40)

In addition,

c_{K} = \int K^{2} (u) d u = \int {(2 π)}^{- n} e^{- u^{T} u} d u = {(2 \sqrt{π})}^{- n} .

(41)

Substituting Equations (39)–(40) into Equation (20) and substituting

{\hat{f}}_{K} (x)

in Equation (16) for

f (x)

, the iterative form of calculating

h_{m}

is obtained as

\begin{matrix} h_{m, k + 1} = {(\frac{n}{m})}^{1 / (n + 4)} {(2 \sqrt{π})}^{- n / (n + 4)} {(\int tr {(\frac{\partial^{2} {\hat{f}}_{K} (x)}{\partial x \partial x^{T}})}^{2} d x)}^{- 1 / (n + 4)} \\ = {(m n h_{m, k}^{2 n})}^{1 / (n + 4)} {(2 \sqrt{π})}^{- n / (n + 4)} {(\int tr {(\frac{\partial^{2}}{\partial x \partial x^{T}} (\sum_{i = 1}^{m} K (\frac{r_{i} - x}{h_{m, k}})))}^{2} d x)}^{- 1 / (n + 4)} \end{matrix}

(42)

To facilitate the subsequent reasoning, the following lemma is given as

Lemma 1.

For any function

f_{1}, f_{2}, \dots, f_{n}

, inequality

\int {(f_{1} + f_{2} + \dots + f_{n})}^{2} d x \leq \int n (f_{1}^{2} + f_{2}^{2} + \dots + f_{n}^{2}) d x .

(43)

If and only if

f_{1} (x) = f_{2} (x) = \dots = f_{n} (x)

holds almost everywhere.

Proof.

In fact, for any function

f_{1}, f_{2}, \dots, f_{n}

, there are

0 \leq {(f_{1} (x) + f_{2} (x) + \dots + f_{n} (x))}^{2} \leq n (f_{1} {(x)}^{2} + f_{2} {(x)}^{2} + \dots + f_{n} {(x)}^{2}) .

(44)

Thus, the two sides of Equation (44) are integrated as

\int {(f_{1} + f_{2} + \dots + f_{n})}^{2} d x \leq \int n (f_{1}^{2} + f_{2}^{2} + \dots + f_{n}^{2}) d x .

(45)

It is obvious that the sign of Equation (43) holds the condition that

f_{1} (x) = f_{2} (x) = \dots = f_{n} (x)

is almost everywhere. □

Because the second derivative of Equation (39) with respect to

x_{i}

is

\frac{\partial^{2}}{\partial x_{i} \partial x_{i}} K (x) = {(2 π)}^{- n / 2} e^{- (x^{T} x) / 2} (x_{i}^{2} - 1) .

(46)

In addition,

\begin{matrix} \int {(\frac{\partial^{2}}{\partial x_{j} \partial x_{j}} (K (\frac{r_{i} - x}{h_{m, k}})))}^{2} d x & = \int {(\frac{\partial^{2}}{\partial x_{j} \partial x_{j}} ({(2 π)}^{- n / 2} e^{- {(r_{i} - x)}^{T} (r_{i} - x) / 2 h_{m, k}^{2}}))}^{2} d x \\ = \frac{3}{4} {(2 \sqrt{π})}^{- n} h_{m, k}^{n - 4} . \end{matrix}

(47)

From Lemma 1 and Equation (47)

\begin{matrix} \int t r {(\frac{\partial^{2}}{\partial x \partial x^{T}} (\sum_{i = 1}^{m} K (\frac{r_{i} - x}{h_{m, k}})))}^{2} d x & \leq \int n m \sum_{i, j} {(\frac{\partial^{2}}{\partial x_{j} \partial x_{j}} (K (\frac{r_{i} - x}{h_{m, k}})))}^{2} d x \\ = \frac{3}{4} {(n m)}^{2} {(2 \sqrt{π})}^{- n} h_{m, k}^{n - 4} . \end{matrix}

(48)

When

h_{m, k}

is sufficiently large, we can assume that

K (\frac{r_{i} - x}{h_{m, k}})

is almost the same everywhere, i.e., the equal sign in Equation (48) is tenable.

\begin{matrix} h_{m, k + 1} & = {(\frac{m n h_{m, k}^{2 n}}{2 \sqrt{π}})}^{1 / (n + 4)} {(\frac{3}{4} {(n m)}^{2} {(2 \sqrt{π})}^{- n} h_{m, k}^{n - 4})}^{- 1 / (n + 4)} \\ = h_{m, k} {(\frac{3}{4} n m)}^{- 1 / (n + 4)} < h_{m, k} \end{matrix}

(49)

When

h_{m, k}

is large, the iterative process decreases. Because

h_{m, k}

has a lower bound, the algorithm converges. □

In summary, the KDE method based on optimal bandwidth is given (See Algorithm 1), and the flowchart of the KDE method is shown in Figure 1.

Algorithm 1: Kernel density estimation (KDE) method based on optimal bandwidth.

4. Fault Detection Method Based on JS Divergence Distribution

In Section 3, we construct a multidimensional KDE method based on the optimal bandwidth; this method can accurately describe the density distribution of multidimensional data. JS divergence is used to measure the distribution difference, and thus, it can highlight the difference in the statistical characteristics of different mode data.

4.1. Mode Difference Index

In Section 3, the probability density estimation of multidimensional data is obtained using the kernel function method, and the optimal bandwidth formula is derived. When the system fails, the state of the system will inevitably change, and the statistical characteristics of the system output will also change, thereby leading to significant changes in the density distribution of the observed data. For two groups of the sample window data

R

and

Z

, the cross entropy

H (R, Z)

can be used to measure the distribution difference of

R

and

Z

.

H (R, Z) = \int - {\hat{f}}_{K, Z} (x) log ({\hat{f}}_{K, R} (x)) d x,

(50)

where

{\hat{f}}_{K, R}, {\hat{f}}_{K, Z}

represents the optimal KDE of

R

and

Z

calculated using Equation (16).

H (R, Z)

does not satisfy the definition of distance because

H (R, Z)

does not necessarily satisfy positive definiteness and symmetry; that is,

H (R, Z) < 0

or

H (R, Z) \neq H (R, Z)

.

The smaller the difference of distribution, the smaller is $H (R, Z)$ , which means that even $H (R, Z) < 0$ , and therefore, it is reasonable to use $H (R, Z)$ to measure the distribution difference of $R$ and $Z$ .
However, the quantitative description of distribution difference must satisfy symmetry; otherwise, the exchange position and distribution difference will be different, which is difficult to accept.

The JS divergence

J S (R, Z)

was used as a measure of the distribution difference between

R

and

Z

in reference Zhang et al. [19], Bruni et al. [20] as follows:

J S (R, Z) = \int \begin{matrix} {\hat{f}}_{K, R} log ({\hat{f}}_{K, R}) & + {\hat{f}}_{K, Z} log ({\hat{f}}_{K, Z}) \\ - ({\hat{f}}_{K, R} + {\hat{f}}_{K, Z}) log (({\hat{f}}_{K, R} + {\hat{f}}_{K, Z}) / 2) \end{matrix} d x .

(51)

It is easy to get that

\{\begin{matrix} J S (R, Z) \geq 0 \\ J S (R, Z) = J S (Z, R) \end{matrix}

(52)

In this paper, Equation (52) is used to measure the distribution difference between testing data

Z

and training data

R

for realizing fault detection and isolation.

4.2. Mode Discrimination Method

If the training data has q patterns

\{R_{1}, R_{2}, \dots, R_{q}\}

, the JS divergence set

\{J S (Z, R_{1}), J S (Z, R_{2}), \dots, J S (Z, R_{q})\}

between the testing data

Z

and different modes

R

can be calculated using Equation (51).

If

i_{0}

is the schema tag corresponding to the minimum JS divergence, it means that

i_{0} = arg min \{J S (Z, R_{1}), J S (Z, R_{2}), \dots, J S (Z, R_{q})\} .

(53)

It is reasonable to assume that testing data

Z

and training data

R_{i_{0}}

belong to the same mode. However, for a new failure mode that may be unknown in the application, Equation (50) evaluates the testing data

Z

as the known failure mode of type

i_{0}

, which is obviously unreasonable.

If

J S (Z, R_{i_{0}})

is too large, we believe that testing data

Z

comes from an unknown new failure mode; its label is

q + 1

. However, the method to obtain the threshold

J S_{high}

of

J S (Z, R_{i_{0}})

is a problem that should be investigated. A method to determine

J S_{high}

is provided below.

For the training data

R_{i_{0}} = [r_{1}, r_{2}, \dots, r_{m}]

of the

i_{0}

mode, the density estimation of the data set can be obtained using Equation (16).

{\hat{f}}_{K, R} (x) = \frac{1}{m {(h_{m})}^{n}} \sum_{i = 1}^{m} K (\frac{r_{i} - x}{h_{m}})

(54)

In addition, if the length of the sampling window is fixed as

p (p < m)

, the new sampling data is

R^{(j)} = [r_{j}, r_{j + 1}, \dots, r_{j + p}] \subset R_{i_{0}}, j = 1, 2, \dots, m - p

by sliding the sampling window. For each

R^{(j)}

, the density of the dataset can be estimated as

{\hat{f}}_{K, R^{(j)}} (x) = \frac{1}{p {(h_{p})}^{n}} \sum_{i = j}^{j + p} K (\frac{r_{i} - x}{h_{p}}) .

(55)

Using Equation (52), the divergence between the training data

R

and the sample data

R^{(j)}

can be obtained as

\begin{matrix} J S_{j} & = J S (R, R^{(j)}) \\ = H (({\hat{f}}_{K, R} + {\hat{f}}_{K, R^{(j)}}), ({\hat{f}}_{K, R} + {\hat{f}}_{K, R^{(j)}}) / 2) - H ({\hat{f}}_{K, R}) - H ({\hat{f}}_{K, R^{(j)}}) . \end{matrix}

(56)

Using Equation (55), we can obtain a series of JS divergence calculation value sets

JS = \{J S_{1}, J S_{2}, \dots, J S_{m - p}\} .

We use this set to provide the estimation formula

{\hat{f}}_{J S} (x)

of the density function

f_{J S} (x)

of the JS divergence as

{\hat{f}}_{J S} (x) = \frac{1}{(m - p) {(h_{m - p})}^{n}} \sum_{j = 1}^{m - p} K (\frac{J S_{j} - x}{h_{m - p}}) .

(57)

If the significance level is

α

, the probability of

{\hat{f}}_{J S} (x)

that exceeds the threshold

J S_{high}

is

P \{\int_{0}^{J S_{high}} {\hat{f}}_{J S} (x) d x .\} < α

(58)

Because the distribution type of JS divergence is not a common random distribution, the quantile cannot be obtained by looking up the table; instead, it can only be obtained by numerical integration. If h is the step size, and

\int_{h * (i - 1)}^{+ \infty} {\hat{f}}_{J S} (x) d x \leq α \leq \int_{h * i}^{+ \infty} {\hat{f}}_{J S} (x) d x,

(59)

It is reasonable to deduce that

J S_{high} = h * i .

(60)

The following fault detection and isolation criteria are constructed by Equation (58).

Criterion 1.

Suppose

i_{0}

is the pattern label corresponding to the minimum JS divergence—see Equation (38)—the training data

R_{i_{0}} = [r_{1}, r_{2}, \dots, r_{m}]

corresponding to the

i_{0}

mode and the upper bound of JS divergence is

J S_{high}

—see Equation (56). If the testing data

Z = [z_{1}, z_{2}, \dots, z_{l}]

meet the requirements,

J S (Z, R_{i_{0}}) \leq J S_{high} .

(61)

The testing data

Z

and training data

R_{i_{0}}

belong to the same failure mode; otherwise, the testing data

Z

are considered to originate from the unknown new failure mode, and their label is marked as

q + 1

.

In conclusion, the fault diagnosis method based on optimal bandwidth is provided (See Algorithm 2), and the corresponding fault diagnosis method flowchart is shown in Figure 2.

Algorithm 2: Fault Diagnosis Method Based on Optimal KDE.

Remark 3.

Equations (54) and (55) show that the calculation result of JS divergence is directly related to the length of sampling data. Indeed, with the increase in the sampling data length, the density estimation obtained by Equation (54) can describe the distribution characteristics of samples more effectively, thereby significantly improving the accuracy of fault detection.

5. Numerical Simulation

The bearing data from Case Western Reserve University Bearing Data Center were used as the diagnosis research object, and they have been considered as a case for many fault diagnosis, such as in references Smith and Randall [21], Lou and Loparo [22], Rai and Mohanty [23].

The sampling frequency of the motor data was 12 kHz, and 12 kHz is the default sampling frequency for Case Western Reserve University Bearing Data Center. The dataset contains four groups of sample data: normal data (

f_{0}

), 0.007 inch inner raceway fault data (

f_{1}

), 0.014 inch inner raceway fault data (

f_{2}

), and 0.014 inch outer raceway fault data(

f_{3}

). Each group of data had two dimensions: the acceleration data of the drive end (

f_{i} - D E

) and the acceleration data of the fan end (

f_{i} - F E

). All the experiments were conducted on an Lenovo Ryzen 3700X CPU with 3.60 GHz processor, 16 GB RAM.

5.1. Data Preprocessing

The observed data in the process of the bearing operation show obvious periodicity, which needs to be eliminated. Taking normal data

f_{0}

as an example, the main frequency in the observed signal can be obtained by fast Fourier transform (FFT), and the result of the FFT is shown in Figure 3.

Figure 3 indicates that the main frequency is approximately 1036 Hz, and thus, the basis function is constructed as

f (t) = {[\begin{matrix} 1 & sin (1036 \times 2 π t) & cos (1036 \times 2 π t) \end{matrix}]}^{T} .

The estimation of

β

calculated using Equation (7) is

\hat{β} = [\begin{matrix} 0 . 0116 & - 0.0158 & 0.0548 \\ 0.0280 & 0.0326 & - 0.0396 \end{matrix}] .

Thus, the data after removing the intrinsic signal are shown in Figure 4, where Figure 4a represents the acceleration data of the drive end and Figure 4b represents the acceleration data of the fan end.

In the later fault detection process, the data of all modes are similar to the above operation, and the results are recorded as

f_{i}

.

5.2. Fault Detection Effect

5.2.1. Norm Data and Known Fault

For the norm data

f_{0}

and the known fault

f_{1}, f_{2}

, the first 20,480 sample points are selected as the training set, which are recorded as

f_{i - train}

. The last 81,920 sample points are taken as the testing set, which are recorded as

f_{i - test}

. A total of 128 sample points are used as detection objects in each test. The training set data are shown in Figure 5, where Figure 5a,b represent data

f_{i - train}, i = 1, 2

of the two dimensions, respectively.

Figure 5 shows that the bearing data have high frequency, and the fault does not change the observed mean value; however, it changes the dispersion characteristics or the correlation of data.

5.2.2. Unknown Fault

The training data does not necessarily contain all types of patterns, and the detection of unknown faults is always a difficult problem.

f_{3}

is used as an unknown fault for fault detection; the training set sample does not contain any information about

f_{3}

. The unknown fault data are shown in Figure 6, where in Figure 6a represents the acceleration data at the driving end and Figure 6b represents the acceleration data at the fan end.

Figure 6 shows that the data of unknown faults is close to the other two types of fault data. If the fault detection method is not sensitive, the detection rate will be reduced significantly.

5.2.3. Detection Effect

The characteristics of bearing data make bearing fault detection extremely challenging. The input of the training set is

f_{0 - train}

, the estimation accuracy is

ε = 10^{- 4}

, and the maximum number of iterations is

k_{m a x} = 100

, according to Algorithm 1, the optimal bandwidth is

h_{m} = 0.0445 .

The KDE of the training set is obtained by Equation (15), and the results are shown in Figure 7, where Figure 7a,c,e represent the two-dimensional frequency histograms of the training data

f_{i - train}, i = 0, 1, 2

, and Figure 7b,d,f represent the two-dimensional KDE of the training data

f_{i - train}, i = 0, 1, 2

.

Figure 7 further shows that the bearing fault changes the dispersion characteristics and data correlation. Meanwhile, Figure 7 shows that the KDE of the training data obtained by Equation (15) is in good agreement with the data distribution of the training data, and therefore, this method can really describe the distribution of multidimensional data.

The JS divergence of the training data and KDE of the distribution are obtained by Equations (51) and (58); the results are shown in Figure 8.

When the significance level is

α = 95 %

, the detection thresholds of the training set, which are calculated using Equation (58), are

\{\begin{matrix} f_{0} : J S_{high} < 0.1375 \\ f_{1} : J S_{high} < 0.0995 \\ f_{2} : J S_{high} < 0.1225 \end{matrix}

Thus, the detection results of using JS divergence methods on the testing data are shown in Figure 9. If the detection points fall within the threshold, the data set to be detected is in the same pattern; otherwise, the data have different patterns.

Furthermore, detection rates using different methods are shown in Table 1.

For the known fault, Table 1 indicates that the bearing fault identification based on multidimensional KDE and JS divergence achieves better results compared to those obtained using the

T^{2}

statistics detection methods in the testing data. The detection rate of normal data

f_{0}

increases from

95.08 %

to

97.03 %

, the detection rate of fault data

f_{1}

increases from

81.33 %

to

95.81 %

, and the detection rate of fault data

f_{2}

increases from

70.69 %

to

95.36 %

. Meanwhile, compared with the cross-entropy methods, the detection rate of normal data

f_{0}

increased from

96.95 %

to

97.03 %

; of fault data

f_{1}

increased from

94.41 %

to

95.81 %

; and of fault data

f_{2}

increased from

94.19 %

to

95.36 %

.

For the unknown fault

f_{3}

, Table 1 shows that the

T^{2}

statistics detection method cannot detect the unknown faults. The method using cross entropy as a measure can only detect unknown faults with a detection rate of

53.16 %

, which is not obvious. The JS divergence method constructed in this study can identify the unknown fault accurately, and the detection rate reaches

69.49 %

. This is because JS divergence is more accurate at measuring the differences between distributions.

5.3. Influence of Window Width on Fault Diagnosis

The fault diagnosis effect is related to the data window width; therefore, the fault diagnosis effect under different window widths is investigated. The results are shown in Figure 10.

Figure 10 indicates that, with the increase in the detection window, the detection performance of the proposed method for the known fault detection first rises, and then, it tends to be stable. This is because when the length of the detection window increases to a certain extent, the data to be detected already contains sufficient information. Meanwhile, if the detection window continues to increase, the contribution rate to the improvement of the fault detection rate is not large. Meanwhile, for unknown faults, the detection rate increases rapidly with the length of the detection window because the longer the detection window, the higher the amount of information contained in the data to be detected, and the better is the difference characterized between the fault and the known fault.

6. Conclusions

In this study, a method of bearing fault detection and identification was constructed using multidimensional KDE and JS divergence. The distribution characteristics of JS divergence between the sample density distribution and population density distribution were derived using the sliding sampling window method. Thus, the threshold of fault detection was provided, and therefore, different faults, especially unknown faults, could be identified. The theory showed that the multidimensional KDE method could reduce information loss caused by processing each dimension; the JS divergence is more accurate than the traditional cross entropy to measure the difference in density distribution. The experimental results verified the above conclusions.

For a known fault, the detection effect of this method was obviously better than that of the traditional method, and it also had a certain degree of improvement compared with the cross-entropy method. Second, for unknown faults, the traditional method could not detect the distribution difference accurately, while the detection effect of the proposed method was significantly improved.

Furthermore, the detection effect of this method depends on the window width. The detection effect improved with a growth in the detection window. In this paper, under the condition of a given window width, the estimation formula for the optimal bandwidth of a multidimensional KDE was provided. The experimental results showed that the formula was applicable to any mode of data, and therefore, it had a certain universality.

However, this study has certain limitations. Firstly, although the calculation formula of multidimensional KDE is given in this study, the computational complexity will increase when the dimension is large, which may restrict the further application of the method. Secondly, the calculation of JS divergence is time consuming, which is not conducive to rapid fault diagnosis.

In future research, we can try to use the PCA dimension reduction method to solve the computational complexity caused by very large dimension, and optimize the algorithm flow of JS divergence to expedite the calculation. In the latest study Ginzarly et al. [24], prognosis of the vehicle’s electrical machine is treated using a hidden Markov model after modeling the electrical machine using the finite element method. Therefore, we will try to combine this method in future work and apply it to the fault detection of other systems.

Author Contributions

Conceptualization and methodology, J.W. (Juhui Wei); formal analysis and visualization, Z.H.; validation and data curation, J.W. (Jiongqi Wang); resources, D.W.; writing—review and editing X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 61903366, 61903086, 61773021, Natural Science Foundation of Hunan Province grant number 2019JJ50745, 2019JJ20018, 2020JJ4280 and Foundation of Beijing Institute of Control Engineering grant number HTKJ2019KL502007.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors express their appreciation to the Associate Editor and anonymous reviewers for their helpful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

KDE	kernel density estimation
JS	Jensen–Shannon
PCA	principal component analysis
MISE	mean integral square error

References

Sheather, S.J.; Jones, M.C. A reliable data-based bandwidth selection method for kernel density estimation. J. R. Stat. Soc. 1991, 53, 683–690. [Google Scholar] [CrossRef]
Muir, D. Multidimensional Kernel Density Estimates over Periodic Domains. Circular Statistics. 2017. Available online: https://www.mathworks.com/matlabcentral/fileexchange/44129-multi-dimensional-kernel-density-estimates-over-periodic-domains (accessed on 21 February 2021).
Laurent, B. Efficient estimation of integral functionals of a density. Ann. Stat. 1996, 24, 659–681. [Google Scholar] [CrossRef]
Sugumaran, V.; Ramachandran, K.I. Fault diagnosis of roller bearing using fuzzy classifier and histogram features with focus on automatic rule learning. Expert Syst. Appl. 2011, 38, 4901–4907. [Google Scholar] [CrossRef]
Scott, D.W. Averaged shifted histograms: Effective nonparametric density estimators in several dimensions. Ann. Stat. 1985, 13, 1024–1040. [Google Scholar] [CrossRef]
Saruhan, H.; Sardemir, S.; Iek, A.; Uygur, L. Vibration analysis of rolling element bearings defects. J. Appl. Res. Technol. 2014, 12, 384–395. [Google Scholar] [CrossRef]
Razavi-Far, R.; Farajzadeh-Zanjani, M.; Saif, M. An integrated class-imbalanced learning scheme for diagnosing bearing defects in induction motors. IEEE Trans. Ind. Inform. 2017, 13, 2758–2769. [Google Scholar] [CrossRef]
Harmouche, J.; Delpha, C.; Diallo, D. Incipient fault amplitude estimation using kl divergence with a probabilistic approach. Signal Process. 2016, 120, 1–7. [Google Scholar] [CrossRef] [Green Version]
He, Z.; Shardt, Y.A.W.; Wang, D.; Hou, B.; Zhou, H.; Wang, J. An incipient fault detection approach via detrending and denoising. Control Eng. Pract. 2018, 74, 1–12. [Google Scholar] [CrossRef]
Demetriou, M.A.; Polycarpou, M.M. Incipient fault diagnosis of dynamical systems using online approximators. IEEE Trans. Autom. Control 1998, 43, 1612–1617. [Google Scholar] [CrossRef]
Zhang, X.; Polycarpou, M.M.; Parisini, T. A robust detection and isolation scheme for abrupt and incipient faults in nonlinear systems. IIEEE Trans. Autom. Control 2002, 47, 576–593. [Google Scholar] [CrossRef]
Fu, F.; Wang, D.; Li, W.; Li, F. Data-driven fault identifiability analysis for discrete-time dynamic systems. Int. J. Syst. Sci. 2020, 51, 404–412. [Google Scholar] [CrossRef]
Itani, S.; Lecron, F.; Fortemps, P. A one-class classification decision tree based on kernel density estimation. Appl. Soft Comput. 2020, 91, 106250. [Google Scholar] [CrossRef] [Green Version]
Kong, Y.; Li, D.; Fan, Y.; Lv, J. Interaction pursuit in high-dimensional multi-response regression via distance correlation. Ann. Stat. 2017, 45, 897–922. [Google Scholar] [CrossRef] [Green Version]
Jones, M.C.; Sheather, S.J. Using non-stochastic terms to advantage in kernel-based estimation of integrated squared density derivatives. Stat. Probab. Lett. 1991, 11, 511–514. [Google Scholar] [CrossRef]
Desforges, M.J.; Jacob, P.J.; Ball, A.D. Fault detection in rotating machinery using kernel-based probability density estimation. Int. J. Syst. Sci. 2000, 31, 1411–1426. [Google Scholar] [CrossRef]
Solomons, L.M.; Hotelling, H. The limits of a measure of skewness. Ann. Math. Stat. 1932, 3, 141–142. [Google Scholar]
Rao, P. Nonparametric Functional Estimation; Elsevier: Amsterdam, The Netherlands, 1983. [Google Scholar]
Zhang, X.; Delpha, C.; Diallo, D. Incipient fault detection and estimation based on Jensen–Shannon divergence in a data-driven approach. Signal Process. 2019, 169, 107410. [Google Scholar] [CrossRef]
Bruni, V.; Rossi, E.; Vitulano, D. On the equivalence between Jensen–Shannon divergence and Michelson contrast. IEEE Trans. Inf. Theory 2012, 58, 4278–4288. [Google Scholar] [CrossRef]
Smith, W.A.; Randall, R.B. Rolling element bearing diagnostics using the case western reserve university data: A benchmark study. Mech. Syst. Signal Process. 2015, 64–65, 100–131. [Google Scholar] [CrossRef]
Lou, X.; Loparo, K.A. Bearing fault diagnosis based on wavelet transform and fuzzy inference. Mech. Syst. Signal Process. 2004, 18, 1077–1095. [Google Scholar] [CrossRef]
Rai, V.K.; Mohanty, A.R. Bearing fault diagnosis using fft of intrinsic mode functions in Hilbert–Huang transform. Mech. Syst. Signal Process. 2007, 21, 2607–2615. [Google Scholar] [CrossRef]
Ginzarly, R.; Hoblos, G.; Moubayed, N. From modeling to failure prognosis of permanent magnet synchronous machine. Appl. Sci. 2020, 10, 691. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Flowchart of KDE method based on optimal bandwidth.

Figure 2. Flowchart of fault diagnosis method based on optimal KDE.

Figure 3. Single-sided amplitude spectrum of

f_{0}

.

Figure 3. Single-sided amplitude spectrum of

f_{0}

.

Figure 4. Preprocessed data to remove trends by fast Fourier transform (FFT).

Figure 5. Training data

f_{1}, f_{2}

after being preprocessed.

Figure 5. Training data

f_{1}, f_{2}

after being preprocessed.

Figure 6. Training data

f_{3}

after preprocessed.

Figure 6. Training data

f_{3}

after preprocessed.

Figure 7. Training data after being preprocessed.

Figure 8. The results of detection threshold.

Figure 9. Fault detection effect using JS divergence as index.

Figure 10. Fault diagnosis effect under different window width

h_{m}

.

Figure 10. Fault diagnosis effect under different window width

h_{m}

.

Table 1. Detection rate of normal and different failure modes using different methods.

Method	$T^{2}$ Statistics Detection	Cross Entropy	JS Divergence
Normal mode $f_{0}$	$95.80 %$	$96.95 %$	$97.03 %$
Known fault $f_{1}$	$83.47 %$	$94.41 %$	$95.81 %$
Known fault $f_{2}$	$78.11 %$	$94.19 %$	$95.36 %$
Unknown fault $f_{3}$	\	$53.16 %$	$69.49 %$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, J.; He, Z.; Wang, J.; Wang, D.; Zhou, X. Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence. Entropy 2021, 23, 266. https://doi.org/10.3390/e23030266

AMA Style

Wei J, He Z, Wang J, Wang D, Zhou X. Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence. Entropy. 2021; 23(3):266. https://doi.org/10.3390/e23030266

Chicago/Turabian Style

Wei, Juhui, Zhangming He, Jiongqi Wang, Dayi Wang, and Xuanying Zhou. 2021. "Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence" Entropy 23, no. 3: 266. https://doi.org/10.3390/e23030266

APA Style

Wei, J., He, Z., Wang, J., Wang, D., & Zhou, X. (2021). Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence. Entropy, 23(3), 266. https://doi.org/10.3390/e23030266

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence

Abstract

1. Introduction

2. $T^{2}$ Statistics Fault Detection

2.1. Signal Decomposition

2.2. $T^{2}$ Statistics Detection

3. Optimal Kernel Density Estimation

3.1. Optimal Bandwidth Theorem

3.2. Optimal Bandwidth Algorithm

4. Fault Detection Method Based on JS Divergence Distribution

4.1. Mode Difference Index

4.2. Mode Discrimination Method

5. Numerical Simulation

5.1. Data Preprocessing

5.2. Fault Detection Effect

5.2.1. Norm Data and Known Fault

5.2.2. Unknown Fault

5.2.3. Detection Effect

5.3. Influence of Window Width on Fault Diagnosis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence

Abstract

1. Introduction

2. T 2 Statistics Fault Detection

2.1. Signal Decomposition

2.2. T 2 Statistics Detection

3. Optimal Kernel Density Estimation

3.1. Optimal Bandwidth Theorem

3.2. Optimal Bandwidth Algorithm

4. Fault Detection Method Based on JS Divergence Distribution

4.1. Mode Difference Index

4.2. Mode Discrimination Method

5. Numerical Simulation

5.1. Data Preprocessing

5.2. Fault Detection Effect

5.2.1. Norm Data and Known Fault

5.2.2. Unknown Fault

5.2.3. Detection Effect

5.3. Influence of Window Width on Fault Diagnosis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

2. $T^{2}$ Statistics Fault Detection

2.2. $T^{2}$ Statistics Detection