Fault Detection Based on Multi-Dimensional KDE and Jensen–Shannon Divergence

Weak fault signals, high coupling data, and unknown faults commonly exist in fault diagnosis systems, causing low detection and identification performance of fault diagnosis methods based on T2 statistics or cross entropy. This paper proposes a new fault diagnosis method based on optimal bandwidth kernel density estimation (KDE) and Jensen–Shannon (JS) divergence distribution for improved fault detection performance. KDE addresses weak signal and coupling fault detection, and JS divergence addresses unknown fault detection. Firstly, the formula and algorithm of the optimal bandwidth of multidimensional KDE are presented, and the convergence of the algorithm is proved. Secondly, the difference in JS divergence between the data is obtained based on the optimal KDE and used for fault detection. Finally, the fault diagnosis experiment based on the bearing data from Case Western Reserve University Bearing Data Center is conducted. The results show that for known faults, the proposed method has 10% and 2% higher detection rate than T2 statistics and the cross entropy method, respectively. For unknown faults, T2 statistics cannot effectively detect faults, and the proposed method has approximately 15% higher detection rate than the cross entropy method. Thus, the proposed method can effectively improve the fault detection rate.


Introduction
The development of industrial informatization has given rise to a large amount of data in various fields. This has led to data processing becoming a difficult problem in the industry, especially for fault diagnosis. The explosive growth of data provides more information, and therefore, typical data analysis theories often fail in achieving the necessary results. The main reason for this failure can be attributed to the typical data analysis theory that often sets the data distribution type through prior information and performs analyses based on this assumption. Once the distribution type is set, the subsequent analysis can perform the estimation and parametric analysis based on only that distribution type; however, with the growth of data, more information is provided, and thus, the type of data distribution will need to be modified. As a nonparametric estimation method, kernel density estimation (KDE) is the most suitable method for the massive amount of the current data. KDE does not employ a priori assumption for the overall data distribution, and it directly starts from the sample data. When the sample size is sufficient, the KDE can approximate different distributions. Furthermore, Sheather and Jones [1] provides the optimal bandwidth estimation formula for a one-dimensional KDE and proves that the kernel function is asymptotically unbiased and consistent in the density estimation. However, with the growth of the dimension, the multidimensional KDE becomes more complex, and its optimal bandwidth formula is not provided. The distribution of multidimensional data has been described to a certain extent by estimating the kernel density of the reduced data in different dimensions Muir [2], Laurent [3]. In fact, the optimal KDE of multidimensional data is a problem that needs to be studied further.
In the field of fault diagnosis, an essential problem is measuring the difference between samples. A frequency histogram has been used to indicate the distribution difference between two samples Sugumaran and Ramachandran [4], Scott [5]; however, there are three shortcomings to this method: (1) the large number of discrete operations require a higher amount of time; (2) the process depends on the selection of the interval, which is more subjective; (3) there is no intuitive index to reflect this difference. In fact, based on KDE, the JS divergence can be used to measure the difference in data distribution, which can overcome the above shortcomings to a certain extent. For example, the failure of a rolling bearing, which is a key component of mechanical equipment, will have a serious effect on the safe and stable operation of the equipment, and the incipient fault detection of rolling bearings can help avoid equipment running with faults and avoid causing serious safety accidents and economic losses, which has important practical and engineering significance.
In Saruhan et al. [6], vibration analysis of rolling element bearings (REBs) defects is studied. The REBs are the most widely used mechanical parts in rotating machinery under high load and high rotational speeds. In addition, characteristics of bearing faults are analyzed in detail in references Razavi-Far et al. [7], Harmouche et al. [8]. Compared with traditional fault diagnosis, the fault diagnosis of rolling bearings is more complex: • The fault signal is weak: Bearing data is a type of high-frequency data, and the fault signal is often covered by these high-frequency signals, thereby leading to the failure of traditional fault diagnosis methods. KDE is highly accurate in describing data distribution, so it can identify weak signals. • Data is highly coupled: Bearing data is reflected in the form of a vibration signal, and there is strong coupling in different dimension signals, thereby making fault diagnosis difficult. Multi-dimensional KDE plays an important role in depicting the correlation of data, which can characterize the relationship between different dimensions of data. • Incomplete data set: Most bearings work under normal conditions, and the fault data collected are often fewer, which makes the data incomplete, thereby resulting in the imperfection of the fault data set and increasing the difficulty of fault detection. The fault detection method constructed by JS divergence can deal with unknown faults and incomplete data sets without using additional data sets.
To overcome these problems, in-depth research has been conducted on this topic. Fault detection technology based on trend elimination and noise reduction has been proposed previously He et al. [9], Demetriou and Polycarpou [10]. The signal trend ratio is enhanced by eliminating the trend, and the signal-noise ratio is enhanced by noise reduction, and therefore, the fault detection effect is improved. However, this method uses the traditional detection method and cannot effectively solve the problem of data coupling. In reference Zhang et al. [11], Fu et al. [12], a fault detection method based on PCA dimension reduction and modal decomposition feature extraction is proposed. For multidimensional data, PCA dimension reduction is performed to reduce data dimensions and eliminate correlation between different dimensions. Then, the modal decomposition method is used to extract features among dimensions for fault detection. This method can effectively solve the strong coupling between data; however, it will lose some information in the process of PCA dimension reduction, and it leads to a reduction in the fault detection effect. In reference Itani et al. [13], Kong et al. [14], Jones and Sheather [15], Desforges et al. [16], a bearing fault detection method based on KDE is proposed. These studies analyzed the feasibility of KDE method in fault detection, and combined different classification methods for experiments. However, these methods only use one-dimensional KDE, and cannot directly describe high-dimensional data.
The data distribution is reconstructed by KDE and the cross-entropy function is constructed to measure the distribution difference for improving the fault detection results.
However, this method cannot reflect the correlation between different dimensions, and the cross-entropy function is not precise in the description of density distribution, which leads to a reduction in the fault detection effect, especially for unknown fault detection, which is not included in the fault set.
In this study, the KDE method is extended to multidimensional data to avoid information loss caused by the KDE for each dimension, and to better describe the density probability distribution of the data. Meanwhile, this study improves the traditional method using the cross-entropy function as the measurement of density distribution difference, and it uses JS divergence as the measurement of density distribution difference, thereby avoiding the relativity caused by the cross-entropy function. Most fault identification methods are based only on distance measurement; however, only relying on distance measurement cannot effectively detect unknown faults. Based on JS divergence, distribution characteristics of JS divergence between the sample density distribution and population density distribution are derived using the sliding window principle. Thus, the detection threshold of fault identification is assigned to realize the identification of unknown faults. This paper is based on the following structure. In Section 2, the trend elimination method and detection method are introduced, and the intrinsic and extrinsic signals in the observation data are separated. Then, the fault detection threshold is constructed via statistics. In Section 3, the KDE method is extended to multidimensional data, and the optimal bandwidth is derived. Then, JS divergence is employed to measure the difference between probability distributions of different densities. In Section 4, the sliding window principle is used to sample the training data to obtain the distribution characteristics of JS divergence between the sample density distribution and the overall density distribution, and the detection threshold of fault identification is obtained using the KDE method. In Section 5, the normal data, two known faults, and one unknown fault are identified using the bearing data of the Case Western Reserve University Bearing Data Center as the fault diagnosis data. The experimental results show that the method can identify all types of faults well.

T 2 Statistics Fault Detection
In the operation process of the complex equipment or systems, the common observation state can be divided into intrinsic and extrinsic parts. In general, the intrinsic part represents the main working state of the system, which has a certain trend, monotony, and periodicity. The extrinsic part represents system noise, which has a certain zero mean value, high frequency vibration, and statistical stability. For the intrinsic part, the state equation of the system can be used to describe the law. When a fault occurs in the intrinsic part, the symptoms are relatively significant, and the corresponding fault detection methods are relatively mature. However, for high-frequency vibration signals, the incipient fault is often hidden in the extrinsic part, which is easily covered by noise. Therefore, it is necessary to analyze the observed data in depth.

Signal Decomposition
In the initial operation stage of the equipment, the unstable operation of the system causes large data fluctuations, which will not only have a great effect on the system trend, but also affect the statistical characteristics of the data. Therefore, it is necessary to truncate the data to remove unstable signals [9]. The corresponding time of the time series after removing the nonstationary period data is t 1 , t 2 , · · · , t m , and the following m observation data are obtained: Y = [y(t 1 ), y(t 2 ), · · · , y(t m )].
Each sampling y(t i ) contains n features, which are expressed as components in the form of y(t i ) = [y 1 (t i ), y 2 (t i ), · · · , y n (t i )] T , i = 1, 2, · · · , m.
Then, the data Y can be decomposed into whereŶ denotes the intrinsic part, which is composed of trend, and R denotes the extrinsic part, which is composed of observation noise and fault data. The intrinsic part is composed of multiple signals. Selecting the appropriate basis function f (t) = [ f 1 (t), f 2 (t), · · · , f s (t)] T can help describe the intrinsic part. By traversing m data to model the nonlinear data Y, Note that Then, Equation (4) can be expressed as Thus, the efficient estimator of β iŝ Using Equations (3) and (7), the signal can be decomposed into Usually, the choice of the basis function is a problem worthy of discussion, and it depends on prior knowledge of practical application scenarios; however, this is not the focus of this paper, and is therefore not covered here.

Remark 1.
For the bearing data, the data is generally stable and periodic. Therefore, Fourier transform is usually used to extract periodic features instead of more complex basis functions, such as a polynomial basis function and wavelet basis function.

T 2 Statistics Detection
For simplicity, remember r i = r(t i ), i = 1, 2, · · · , m. According to Equation (8), the training data after signal decomposition are R = [r 1 , r 2 , · · · , r m ], which is generally considered a normal random vector with expectation of 0, so that where Σ denotes the total covariance matrix. When the covariance matrix Σ is unknown, the unbiased estimation is given byΣ Let Z = z 1 , z 2 , . . . , z p be the data in the test window to be tested; the sample mean valuez isz Then,z is still normal distributed and The T 2 statistics can be constructed as Reference Solomons and Hotelling [17] reports that the distribution of the T 2 statistic satisfies m − n n(m − 1) Therefore, if the significance level is α, we can get that The testing data Z and the training data R both come from the same mode; otherwise, they are considered different. The error rate of this criterion is α.

Optimal Kernel Density Estimation
Section 2 introduces the fault detection method based on T 2 statistics, including the signal decomposition technology and fault detection method based on the T 2 statistics. However, the fault detection method based on the T 2 statistics assumes that data satisfies the normal distribution, while the actual observation data may not meet the hypothesis, which can lead the discriminant performance of the T 2 statistics to not satisfy the design requirements. In addition, the statistics test the data from the angle of the intrinsic partŶ and covariance matrixΣ. These two attributes are not sufficient to describe all statistical characteristics of the system. When the incipient fault is submerged by data noise, it is easy to miss the detection. In this study, a KDE method for multidimensional data is constructed to describe the probability and statistical characteristics of the data more accurately.

Optimal Bandwidth Theorem
For the observed data, the frequency histogram can be used to show its statistical characteristics directly. However, in the actual application process, the frequency histogram is a discrete statistical method, the interval number of the histogram is difficult to divide, and more importantly, the discretization operation inconveniences the subsequent data processing. To overcome these limitations, the KDE method is proposed. This method is a nonparametric estimation method that estimates the population probability density distribution directly by sampling data.
For any point x ∈ R n , assuming that the probability density of a certain mode is f (x), the kernel density of f (x) is estimated based on the sampling data R = [r 1 , r 2 , · · · , r m ] in Section 2.1. As reported in reference Rao [18], the estimation formula iŝ where m, n, K(·), and h m denote the number of sampling data, dimension of sampling data, kernel function, and bandwidth, respectively. For the sake of convenience in the following discussions, in the case of no doubt, The kernel function K(·) satisfies K(x)dx = 1; therefore, satisfies both positive definiteness, continuity, and normality. Therefore, it is reasonable to use it as the KDE. The Gaussian kernel function is a good choice as given by In this study, the performance of the kernel density estimator is characterized by the mean integral square error (MISE).
Reference Rao [18] shows that the estimation resultf K (x) is not sensitive to the selection of the kernel function K(·); that is, the MISE of the estimation results obtained using different kernel functions is almost the same, which is reflected in the subsequent derivation process. In addition, the MISE depends on the selection of the bandwidth h m . If h m is too small, the density estimationf K (x) shows an irregular shape because of the increase in the randomness. While h m is too large, density estimationf K (x) is too averaged to show sufficient detail.
The optimal bandwidth formula is provided in the following theorem, and it is one of the key theoretical results of this study. Theorem 1. For any dimensional probability density function f (·) and any kernel function K(·) with a symmetric form, iff K (·) in Equation (16) is used to estimate f (·), and if the function where c K and d K are two constant values given by Equation (20) is called the optimal bandwidth formula and h m denotes the optimal bandwidth.
A detailed proof of this theorem is given below.
Proof. It can be proved that the following two equations hold In fact, In addition, From Equation (23), where θ represents a constant value between 0 and 1. According to Equations (23) and (24), According to the Equations (25) and (26), the following equation holds.
To facilitate the subsequent reasoning, the following theorem is given.

Theorem 2.
For any matrix Φ, K(·) is a kernel density function with symmetric form; then, Similarly, it can be verified that the kernel function K(·) with a symmetric form satisfies Then, Thus, the Theorem 2 is proved.
For any unit length vector u ∈ R n , the Taylor expansion can be used to obtain If the bandwidth h m satisfies the condition Then, from Equations (22)-(32), we get that In fact, tr Based on Equation (33), if tr When MISE f K (·) is the smallest, the derivative of Equation (35) with respect to h m is 0, which means Thus, the optimal bandwidth h m in Theorem 1 is obtained as Remark 2. When the number of samples m is determined, the appropriate bandwidth h m can be selected using Equation (20) to construct the KDE, which can better fit the sample distribution.
In Equation (20), the influence of the kernel function on bandwidth selection is on c K and d K , which are almost the same under different kernel function selection, and they have a slight effect on the final bandwidth selection.

Optimal Bandwidth Algorithm
The optimal bandwidth formula is given by Equation (20). However, f (x) is unknown in Equation (20), and therefore, ∫ tr ∂ 2 f (x) ∂x∂x T dx is also unknown. An approximate value of the bandwidth parameter h m can be obtained by replacing f (x) withf K (x) in Equation (16). Furthermore, an iterative algorithm can be used to calculate a more accurate bandwidth parameter. Theorem 3 shows that the algorithm is convergent. Theorem 3. For any n-dimensional probability density function f (·) and Gaussian kernel function K(·), iff K (·) in Equation (16) is used to estimate f (·), then the iterative calculation formula of h m is obtained as and it is convergent, where h m,k is the value of h m during the k − th iteration.

Proof. For a particular Gaussian kernel function
d K is a χ 2 distribution with degree of freedom n, and the expectation is equal to the degree of freedom.
In addition, Substituting Equations (39)-(40) into Equation (20) and substitutingf K (x) in Equation (16) for f (x), the iterative form of calculating h m is obtained as To facilitate the subsequent reasoning, the following lemma is given as For any function f 1 , f 2 , · · · , f n , inequality If and only if f 1 (x) = f 2 (x) = · · · = f n (x) holds almost everywhere.
Proof. In fact, for any function f 1 , f 2 , · · · , f n , there are Thus, the two sides of Equation (44) are integrated as It is obvious that the sign of Equation (43) holds the condition that f 1 (x) = f 2 (x) = · · · = f n (x) is almost everywhere.
Because the second derivative of Equation (39) with respect to x i is In addition, From Lemma 1 and Equation (47) tr When h m,k is sufficiently large, we can assume that K r i −x h m,k is almost the same everywhere, i.e., the equal sign in Equation (48) is tenable.
When h m,k is large, the iterative process decreases. Because h m,k has a lower bound, the algorithm converges.
In summary, the KDE method based on optimal bandwidth is given (see Algorithm 1), and the flowchart of the KDE method is shown in Figure 1.

Fault Detection Method Based on JS Divergence Distribution
In Section 3, we construct a multidimensional KDE method based on the optimal bandwidth; this method can accurately describe the density distribution of multidimensional data. JS divergence is used to measure the distribution difference, and thus, it can highlight the difference in the statistical characteristics of different mode data.

Mode Difference Index
In Section 3, the probability density estimation of multidimensional data is obtained using the kernel function method, and the optimal bandwidth formula is derived. When the system fails, the state of the system will inevitably change, and the statistical characteristics of the system output will also change, thereby leading to significant changes in the density distribution of the observed data. For two groups of the sample window data R and Z, the cross entropy H(R, Z) can be used to measure the distribution difference of R and Z.
wheref K,R ,f K,Z represents the optimal KDE of R and Z calculated using Equation (16). H(R, Z) does not satisfy the definition of distance because H(R, Z) does not necessarily satisfy positive definiteness and symmetry; that is, H(R, Z) < 0 or H(R, Z) = H(R, Z).
• The smaller the difference of distribution, the smaller is H (R, Z), which means that even H(R, Z) < 0, and therefore, it is reasonable to use H(R, Z) to measure the distribution difference of R and Z. • However, the quantitative description of distribution difference must satisfy symmetry; otherwise, the exchange position and distribution difference will be different, which is difficult to accept.
The JS divergence JS(R, Z) was used as a measure of the distribution difference between R and Z in reference Zhang et al. [19], Bruni et al. [20] as follows: It is easy to get that In this paper, Equation (52) is used to measure the distribution difference between testing data Z and training data R for realizing fault detection and isolation.

Mode Discrimination Method
If the training data has q patterns R 1 , R 2 , · · · , R q , the JS divergence set JS(Z, R 1 ), JS(Z, R 2 ), · · · , JS Z, R q between the testing data Z and different modes R can be calculated using Equation (51).
If i 0 is the schema tag corresponding to the minimum JS divergence, it means that It is reasonable to assume that testing data Z and training data R i 0 belong to the same mode. However, for a new failure mode that may be unknown in the application, Equation (50) evaluates the testing data Z as the known failure mode of type i 0 , which is obviously unreasonable.
If JS Z, R i 0 is too large, we believe that testing data Z comes from an unknown new failure mode; its label is q + 1. However, the method to obtain the threshold JS high of JS Z, R i 0 is a problem that should be investigated. A method to determine JS high is provided below.
For the training data R i 0 = [r 1 , r 2 , · · · , r m ] of the i 0 mode, the density estimation of the data set can be obtained using Equation (16).
In addition, if the length of the sampling window is fixed as p(p < m), the new sampling data is R (j) = r j , r j+1 , · · · , r j+p ⊂ R i 0 , j = 1, 2, · · · , m − p by sliding the sampling window. For each R (j) , the density of the dataset can be estimated aŝ Using Equation (52), the divergence between the training data R and the sample data R (j) can be obtained as Using Equation (55), we can obtain a series of JS divergence calculation value sets We use this set to provide the estimation formulaf JS (x) of the density function f JS (x) of the JS divergence asf If the significance level is α, the probability off JS (x) that exceeds the threshold JS high is Because the distribution type of JS divergence is not a common random distribution, the quantile cannot be obtained by looking up the table; instead, it can only be obtained by numerical integration. If h is the step size, and it is reasonable to deduce that The following fault detection and isolation criteria are constructed by Equation (58).
The testing data Z and training data R i 0 belong to the same failure mode; otherwise, the testing data Z are considered to originate from the unknown new failure mode, and their label is marked as q + 1.
In conclusion, the fault diagnosis method based on optimal bandwidth is provided (See Algorithm 2), and the corresponding fault diagnosis method flowchart is shown in Figure 2.

Algorithm 2: Fault Diagnosis Method Based on Optimal KDE
Input: Training data: R 1 , R 2 , · · · , R p ; Significance level: α; Testing data: Z = [z 1 , z 2 , . . . , z l ]. Output: Pattern classification labels for testing data Z.  (55) show that the calculation result of JS divergence is directly related to the length of sampling data. Indeed, with the increase in the sampling data length, the density estimation obtained by Equation (54) can describe the distribution characteristics of samples more effectively, thereby significantly improving the accuracy of fault detection.

Numerical Simulation
The bearing data from Case Western Reserve University Bearing Data Center were used as the diagnosis research object, and they have been considered as a case for many fault diagnosis, such as in references Smith and Randall [21], Lou and Loparo [22], Rai and Mohanty [23].
The sampling frequency of the motor data was 12 kHz, and 12 kHz is the default sampling frequency for Case Western Reserve University Bearing Data Center. The dataset contains four groups of sample data: normal data ( f 0 ), 0.007 inch inner raceway fault data ( f 1 ), 0.014 inch inner raceway fault data ( f 2 ), and 0.014 inch outer raceway fault data( f 3 ). Each group of data had two dimensions: the acceleration data of the drive end ( f i − DE) and the acceleration data of the fan end ( f i − FE). All the experiments were conducted on an Lenovo Ryzen 3700X CPU with 3.60 GHz processor, 16 GB RAM.

Data Preprocessing
The observed data in the process of the bearing operation show obvious periodicity, which needs to be eliminated. Taking normal data f 0 as an example, the main frequency in the observed signal can be obtained by fast Fourier transform (FFT), and the result of the FFT is shown in Figure 3.
The estimation of β calculated using Equation (7) iŝ Thus, the data after removing the intrinsic signal are shown in Figure 4, where Figure 4a represents the acceleration data of the drive end and Figure 4b represents the acceleration data of the fan end.  In the later fault detection process, the data of all modes are similar to the above operation, and the results are recorded as f i .

Norm Data and Known Fault
For the norm data f 0 and the known fault f 1 , f 2 , the first 20,480 sample points are selected as the training set, which are recorded as f i−train . The last 81,920 sample points are taken as the testing set, which are recorded as f i−test . A total of 128 sample points are used as detection objects in each test. The training set data are shown in Figure 5, where Figure 5a,b represent data f i−train , i = 1, 2 of the two dimensions, respectively. . Training data f 1 , f 2 after being preprocessed. Figure 5 shows that the bearing data have high frequency, and the fault does not change the observed mean value; however, it changes the dispersion characteristics or the correlation of data.

Unknown Fault
The training data does not necessarily contain all types of patterns, and the detection of unknown faults is always a difficult problem. f 3 is used as an unknown fault for fault detection; the training set sample does not contain any information about f 3 . The unknown fault data are shown in Figure 6, where in Figure 6a represents the acceleration data at the driving end and Figure 6b represents the acceleration data at the fan end. Figure 6 shows that the data of unknown faults is close to the other two types of fault data. If the fault detection method is not sensitive, the detection rate will be reduced significantly.

Detection Effect
The characteristics of bearing data make bearing fault detection extremely challenging. The input of the training set is f 0−train , the estimation accuracy is ε = 10 −4 , and the maximum number of iterations is k max = 100, according to Algorithm 1, the optimal bandwidth is h m = 0.0445.
The KDE of the training set is obtained by Equation (15), and the results are shown in Figure 7, where Figure 7a,c,e represent the two-dimensional frequency histograms of the training data f i−train , i = 0, 1, 2, and Figure 7b,d,f represent the two-dimensional KDE of the training data f i−train , i = 0, 1, 2. Figure 7 further shows that the bearing fault changes the dispersion characteristics and data correlation. Meanwhile, Figure 7 shows that the KDE of the training data obtained by Equation (15) is in good agreement with the data distribution of the training data, and therefore, this method can really describe the distribution of multidimensional data.  When the significance level is α = 95%, the detection thresholds of the training set, which are calculated using Equation (58), are    f 0 : JS high < 0.1375 f 1 : JS high < 0.0995 f 2 : JS high < 0.1225 Thus, the detection results of using JS divergence methods on the testing data are shown in Figure 9. If the detection points fall within the threshold, the data set to be detected is in the same pattern; otherwise, the data have different patterns. Furthermore, detection rates using different methods are shown in Table 1. For the known fault, Table 1 indicates that the bearing fault identification based on multidimensional KDE and JS divergence achieves better results compared to those obtained using the T 2 statistics detection methods in the testing data. The detection rate of normal data f 0 increases from 95.08% to 97.03%, the detection rate of fault data f 1 increases from 81.33% to 95.81%, and the detection rate of fault data f 2 increases from 70.69% to 95.36%. Meanwhile, compared with the cross-entropy methods, the detection rate of normal data f 0 increased from 96.95% to 97.03%; of fault data f 1 increased from 94.41% to 95.81%; and of fault data f 2 increased from 94.19% to 95.36%.
For the unknown fault f 3 , Table 1 shows that the T 2 statistics detection method cannot detect the unknown faults. The method using cross entropy as a measure can only detect unknown faults with a detection rate of 53.16%, which is not obvious. The JS divergence method constructed in this study can identify the unknown fault accurately, and the detection rate reaches 69.49%. This is because JS divergence is more accurate at measuring the differences between distributions.

Influence of Window Width on Fault Diagnosis
The fault diagnosis effect is related to the data window width; therefore, the fault diagnosis effect under different window widths is investigated. The results are shown in Figure 10.  Figure 10 indicates that, with the increase in the detection window, the detection performance of the proposed method for the known fault detection first rises, and then, it tends to be stable. This is because when the length of the detection window increases to a certain extent, the data to be detected already contains sufficient information. Meanwhile, if the detection window continues to increase, the contribution rate to the improvement of the fault detection rate is not large. Meanwhile, for unknown faults, the detection rate increases rapidly with the length of the detection window because the longer the detection window, the higher the amount of information contained in the data to be detected, and the better is the difference characterized between the fault and the known fault.

Conclusions
In this study, a method of bearing fault detection and identification was constructed using multidimensional KDE and JS divergence. The distribution characteristics of JS divergence between the sample density distribution and population density distribution were derived using the sliding sampling window method. Thus, the threshold of fault detection was provided, and therefore, different faults, especially unknown faults, could be identified. The theory showed that the multidimensional KDE method could reduce information loss caused by processing each dimension; the JS divergence is more accurate than the traditional cross entropy to measure the difference in density distribution. The experimental results verified the above conclusions.
For a known fault, the detection effect of this method was obviously better than that of the traditional method, and it also had a certain degree of improvement compared with the cross-entropy method. Second, for unknown faults, the traditional method could not detect the distribution difference accurately, while the detection effect of the proposed method was significantly improved.
Furthermore, the detection effect of this method depends on the window width. The detection effect improved with a growth in the detection window. In this paper, under the condition of a given window width, the estimation formula for the optimal bandwidth of a multidimensional KDE was provided. The experimental results showed that the formula was applicable to any mode of data, and therefore, it had a certain universality.
However, this study has certain limitations. Firstly, although the calculation formula of multidimensional KDE is given in this study, the computational complexity will increase when the dimension is large, which may restrict the further application of the method. Secondly, the calculation of JS divergence is time consuming, which is not conducive to rapid fault diagnosis.
In future research, we can try to use the PCA dimension reduction method to solve the computational complexity caused by very large dimension, and optimize the algorithm flow of JS divergence to expedite the calculation. In the latest study Ginzarly et al. [24], prognosis of the vehicle's electrical machine is treated using a hidden Markov model after modeling the electrical machine using the finite element method. Therefore, we will try to combine this method in future work and apply it to the fault detection of other systems.

Acknowledgments:
The authors express their appreciation to the Associate Editor and anonymous reviewers for their helpful suggestions.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

KDE
kernel density estimation JS Jensen-Shannon PCA principal component analysis MISE mean integral square error