Bearing Fault Diagnosis Based on a Hybrid Classifier Ensemble Approach and the Improved Dempster-Shafer Theory

Bearing fault diagnosis of a rotating machine plays an important role in reliable operation. A novel intelligent fault diagnosis method for roller bearings has been developed based on a proposed hybrid classifier ensemble approach and the improved Dempster-Shafer theory. The improved Dempster-Shafer theory well considered the combination of unreliable evidence sources, the uncertainty information of basic probability assignment, and the relative credibility of the evidence on the weights in the process of decision making under the framework of fuzzy preference relations, which can effectively deal with conflicts of the evidences and then well improve the diagnostic accuracy for the hybrid classifier ensemble. The effectiveness of the improved Dempster-Shafer theory has been verified via a numerical example. In addition, deep neural networks, a support vector machine, and extreme learning machine techniques have been utilized in the single-stage classification based on singular spectrum entropy, power spectrum entropy, time-frequency entropy, and wavelet packet energy spectrum entropy in this work. Performances of the proposed hybrid ensemble classifier has been demonstrated on a bearing test-rig, compared with the original Dempster-Shafer theory. It can be found that the overall error rate can be greatly reduced with the hybrid ensemble classifier and the improved Dempster-Shafer theory.


Introduction
Rolling element bearings are the key components widely used in rotating machines. A sudden breakdown of the mechanical system or even a severe catastrophe, may be caused due to an unexpected failure of the rolling element bearings. Therefore, many bearing fault diagnosis methods have been developed based on vibration signal analysis and feature extraction [1][2][3]. However, some of them are performed manually with low efficiency by means of knowledge and experiences of experts, which are not practical in real applications. Thus, there is still growing attention towards the development of bearing intelligent fault diagnosis techniques. For example, a novel intelligent fault diagnosis method has been proposed based on the affinity propagation clustering algorithm and the adaptive feature selection technique [4]. Qin et al. [5] proposed a model for fault diagnosis of gearboxes in wind turbines based on deep belief networks (DBNs), using improved logistic sigmoid units and the impulsive signatures. In addition, a three-stage intelligent fault diagnosis clustering technique has been proposed for the industrial process monitoring [6]. Generally, the diagnosis results achieved by This work is organized as follows. Theories of entropy feature extraction and single-stage classifier have been briefly reviewed in Section 2. The improved DST for dealing with conflicting evidence has been given in Section 3, where the performance of the proposed approaches has also been demonstrated using two examples. The HCE approach combined with the improved DST is adopted to identify bearing fault automatically, whose effectiveness was demonstrated on a test-rig in Section 4. Conclusions are drawn in Section 5.

Methodologies
The techniques of entropy feature extraction and the classifiers mentioned in HCE have been briefly introduced in this section.

Entropy Feature Extraction
Feature extraction is crucial in pattern recognition and mechanical fault diagnosis. However, traditional signal processing methods, like Fourier transform, are not suitable for analyzing the non-linear and non-stationary bearing vibration signals. It seems that time-frequency analysis techniques are much more suitable for extracting bearing fault features. Several advanced time-frequency signal processing techniques have been adopted in feature extraction. For example, variational mode decomposition (VMD) [27] is as a self-adaptive decomposition method lately proposed with a solid theory [28].
Moreover, traditional statistical properties and frequency-domain signatures cannot meet the requirements because of the non-linear and non-stationary characteristics of the decomposed components [29]. Many non-linear parameter estimation methods have been proved to get the feature information, such as entropy theory introduced in reference [30] to estimate the complexity and stationarity of the signal. Entropy features can be also applied to quantify the malfunction and reflect the uncertainty of vibration signals. In addition, different entropy features obtained in different domains can be used to fully describe a vibration signal. Thus, singular spectrum entropy (SSE) [31], power spectrum entropy (PSE) [32], time-frequency entropy (TFE) [33], and wavelet packet energy spectrum entropy (WPESE) [34] have been used to calculate the feature sets in this work, which are associated with singular spectrum in time domain, power spectrum in frequency domain, time-frequency spectrum, and wavelet packet energy spectrum in time-frequency domain, respectively. These four entropy features will be indicated as follows.

Singular Spectrum Entropy
SSE indicates the uncertainty degree of the signal energy divided by singular spectrum analysis, which can effectively represent the signal energy change in the time domain [31]. Based on the delay embedding technique, an arbitrary signal {x i }(i = 1, 2, . . . , N) was mapped to an embedded space represented by the M × N matrix U, i.e., As explained in reference [31], the calculation of U is shown as where M is the length of the embedded space, N is the number of samples. The singular values {λ i } of the matrix U are achieved based on the singular value decomposition (SVD). Thus, the SSE of the signal via information entropy theory is defined as and p i is the ratio of the ith singular spectrum to the whole spectrum.

Power Spectrum Entropy
PSE can reflect the complexity and stability of a signal, which is also used to indicate the distribution of signal energy in frequency domain [32]. The proportional distribution of different frequencies is defined as a probability distribution. When X(ω) is obtained by using the discrete Fourier transform for a signal {x t }, as explained in reference [32], the calculation of the power spectrum is shown as where S = {S 1 , S 2 , . . . , S N } can be regarded as the partition of a signal {x t }. Hence the PSE can be defined as follows: where , and q i is the ratio of the ith power spectrum to the whole spectrum.

Time-Frequency Entropy
TFE is used to quantitatively measure the time-frequency representation [33]. Let a time-frequency plot have L equal blocks, where the information source for the entire plane is η and for each block is γ i (i = 1, 2, . . . , L). As explained in reference [33], the calculation of the time-frequency entropy is shown as where δ i = γ i /η, δ i the ratio of the i-th energy to the whole energy.

Wavelet Packet Energy Spectrum Entropy
A sequence J j k , k = 0, 1, 2, . . . , 2 j − 1 represents the decomposition result using j-layer wavelet package transform. The sum of squares of signals in each frequency band after wavelet packet transform (WPT) is selected as wavelet packet energy. As explained in reference [34], the calculation of energy value corresponding to the i-th band is given below where W i (k) is the reconstructed coefficients for each node. Thus, WPESE can be defined by

Classification Models
The difference between classifiers in HCE should be increased to enhance the complementarity between classification methods, which can comprehensively describe the diagnostic object. Three supervised classification models are selected, that is, the traditional Deep Neural Networks DNN is one of the most widely used intelligent methods in pattern recognition, fault diagnosis and classification. DNN is a kind of deep learning technique, which is comprised of unsupervised layer-by-layer greedy training and global parameter tuning using the back propagation algorithm. DNN can not only solve complex nonlinear problems but also extract features in a high-dimensional space. Presently, many different models of DNN have been developed. For example, a DNN-based model was used to identify the fault condition of roller bearing [35]. The Deep Boltzmann machine combined with multi-grained scanning forest ensemble was developed for the fault diagnosis of industrial big data [7]. Thus, DNN will be adopted as single-stage classifier in HCE in this work.
SVM is a well-known shallow learning method in classification and regression applications. SVM has good generalization capability for classification of a small sample [36], which have been widely used in fault diagnosis and prognostics. To improve the performance of SVM, PSO is adopted to optimize the parameters in SVM.
ELM is considered as a single hidden layer feed forward neural networks [37,38]. The input weights are set randomly, then the network is expressed as a linear system, and the output weights can be calculated analytically [38]. The weight between the hidden layer and the output layer of ELM does not need to be adjusted iteratively, which is obtained by generalized inverse of a matrix. The performance of ELM depends on the randomly input weights and thresholds. In this work, the fruit fly optimization algorithm (FOA) is used to improve the performance of traditional ELM. Both SVM and ELM are utilized in HCE in this work.

Dempster-Shafer Theory
DST is one of the most powerful tools for the ensemble of multiple classifiers system, which can deal with incomplete, uncertain, and unclear information in the multi-sensor information fusion [39]. DST was initially developed by Shafer in 1976. Assume Θ = {D 1 , D 2 , . . . , D n } is a set of mutually exclusive and collectively exhaustive events, which is called the frame of discernment (FOD). A basic probability assignment (BPA) is a map of m from 2 Θ to [0, 1], as explained in reference [40], the calculation of the BPA function is shown below, Based on the belief function theory, two independent BPAs can be combined by Dempster's rule, denoted as m = m 1 ⊕ m 2 , which is defined as follows.
where K = B∩C=∅ m 1 (B)m 2 (C). The conflict coefficient K is used to measure the conflict between two pieces of evidences. The larger the value of K is, the larger conflict between evidences gets. It should be noted that there may exist conflict between the evidence in the fusion of HCE. To solve this issue, a new weighted average approach is proposed, which considers not only the support degree between the pieces of evidence but also the uncertainty information of BPA. This improved version of DST is given in the following subsection.

The Improved Dempster-Shafer Theory Approach
It is crucial to detect the relatively reliable evidence in the process of information fusion. In the multiple classifier systems, the conflict problem caused by the result of the classifier cannot be ignored. Thus, an improved DST approach is developed in this work and will be introduced in detail subsequently. First, since cosine similarity reflects the confidence degree of the evidence itself, the cosine similarity is employed to indicate the support degree between the pieces of evidence. In addition, DST can be considered as a generalized probability theory, entropy can be used to measure the quantitative uncertainty in BPA. Therefore, entropy based on FPR is applied to indicate the relative reliability preference between the bodies of evidence (BOE). Considering the above two aspects, it can be found that the improved DST will be much more reasonable in dealing with conflicts compared with the original DST. The proposed technique includes three parts: The measurement of the degree of support between evidence using the cosine similarity, the calculation of the weight of BPA, and the improved fusion for BPAs, as shown in Figure 1. relative reliability preference between the bodies of evidence (BOE). Considering the above two aspects, it can be found that the improved DST will be much more reasonable in dealing with conflicts compared with the original DST. The proposed technique includes three parts: The measurement of the degree of support between evidence using the cosine similarity, the calculation of the weight of BPA, and the improved fusion for BPAs, as shown in Figure 1.

The Cosine Similarity
The cosine similarity is used to measure the confidence degree of evidence [41]. Let be a frame of discernment and = { , , … , }. Employ the cosine similarity function, as explained in reference [41], the calculation of similarity degree between evidence , is given below, where ⋅ is inner product of and . And ‖•‖ represents the norm of vector. For the nsources fusion system, the similarity measure matrix is defined as follow.
The Support degree of the evidence can be defined as follows.
Thus, the credibility degree of the evidence is denoted below.
3.2. The Uncertainty Measurement of the Weights

The Cosine Similarity
The cosine similarity is used to measure the confidence degree of evidence [41]. Let Θ be a frame of discernment and Θ = {θ 1 , θ 2 , . . . , θ n }. Employ the cosine similarity function, as explained in reference [41], the calculation of similarity degree between evidence m i , m j is given below, where m i · m j is inner product of m i and m j . And · represents the norm of vector. For the n-sources fusion system, the similarity measure matrix is defined as follow. The Support degree of the evidence m i can be defined as follows.
Thus, the credibility degree of the evidence m i is denoted below.

The Uncertainty Measurement of the Weights
Deng entropy [42], which is used to measure the quantitative uncertainty of BPA in this work. Assume m(·) is a mass function defined on the frame of discernment, as explained in reference [42], the calculation of Deng entropy E d (m) of the BPA is shown as where A is the focal element of m, |A| is the cardinality of A.
The FPR analysis based on the Deng entropy is adopted to denote the relative reliability preference between bodies of evidence. Fuzzy sets have been widely used in various applications and play an important role in the decision-making process [43]. The concepts of FPR and the additive consistency of FPR are introduced briefly.
The fuzzy preference matrix is construct by the variance of entropy. If the system has more than two pieces of evidence, as explained in reference [25], the calculation of variance of entropy is shown as where and Var i denotes the variance. Then, the off-diagonal elements ρ ij and ρ ji of the fuzzy preference matrix can be computed by.
Let P be a fuzzy preference matrix for the set M of alternatives M = {M 1 , M 2 , . . . M n }, as explained in reference [43], the defined of P is shown as where ρ ij denotes the degree of preference of alternative M i over alternative M j . Let P be a fuzzy preference relation P = ρ ij n×n , if P is a complete FPR as explained in reference [44], which satisfies the following additive consistency properties for all i, j and k. where 1 ≤ i ≤ n, 1 ≤ j ≤ n and 1 ≤ k ≤ n, then P is called an additive consistent FPR. Based on the complete fuzzy preference relation P, as explained in reference [26], a consistency matrix P which satisfies the additive consistency is shown as And then, as explained in reference [26], the calculation of the boundary constant ξ and the consistency degree ς are shown as where χ i is the average value of preference values of alternative, ε is the maximum value of all χ i , µ is the minimum value of all χ i , ξ is the boundary constant to let the preference values in the consistency matrix P is between zero and one, ς represents the consistency degree between P and P. The larger the value of ς, the more the consistency of the fuzzy preference relation. If the value of ς is close to one, then the information of fuzzy preference relation is more consistent As explained in reference [26], the calculation of the modified consistency matrix P is shown as where κ denotes the modified constant. And κ = ξ × ζ, κ ∈ [0, 1]. The ranking value R i of the alternative M i in the set M is calculation as follows where 1 ≤ i ≤ n, 1 ≤ j ≤ n and n i=1 R i = 1.

The Improved Fusion Algorithm
With the credibility degree crd i and the ranking value of alternative BPAs R i , the support degree of the BPA is denoted as P Sup i , Based on the weight P Sup i , the weighted average of the evidence (WAE) is given as follow.
where P Sup i = P Sup i / k i=1 P Sup i . Therefore, the modified mass function obtained by Equation (26) will be fused with Dempster's rule of combination n-1 times when there are n pieces of evidence.

Numerical Verification
A numerical example obtained from reference [21] is illustrated to verify the effectiveness of the improved method in dealing with conflict evidences. Suppose the recognition target is A based on multiple sensor data given in Table 1. It showed five different types of sensors, and the FOD is given by Θ = {A, B, C}. The results using different combination rules are shown in Table 2.  [20] 0.9620 0.0210 0.0138 0.0032 Deng et al. [21] 0.9820 0.0039 0.0107 0.0034 Zhang et al. [22] 0.9820 0.0034 0.0115 0.0032 The proposed method 0.9886 0.0004 0.0091 0.0032 As can be seen in Table 2, although more evidence supports target A, a wrong decision was still achieved with Dempster's method. When the number of evidence were not adequate, the performance of Murphy's method was not satisfactory. Obviously, the simple averaging and other weight averaging can provide reasonable results, but the proposed method in this work is much better in dealing with conflicting evidence.

An Example of Fault Diagnosis Application
Another example given in reference [45] has been utilized to further demonstrate the effectiveness of the improved DST in fault diagnosis. The BPAs of the sensor data are directly adopted from reference [46]. Suppose the frame of discernment is F, which have three types of fault in a motor rotor, denoted as F 1 = {Rotor unbalance}, F 2 = Rotor misalignment , and F 3 = {Pedestal looseness}, respectively. Three vibration accelerometer sensors are installed in different positions to collect the vibration signals, denoted by S = {S 1 , S 2 , S 3 }. The frequency of vibration signal locating at 1×, 2× and 3× (× denotes rotor rotating frequency) are considered as the fault features, as are shown in Table 3. Table 3. The obtained BPAs.

Freq1
Freq2 Freq3 The modified mass function could also be calculated with the proposed method. The weighted average of the evidence shown in the Table 4 can be obtained by Equation (26). It can be seen that the probability of F 2 is the largest, which can be preliminarily judged as the fault type. The modified mass function will be fused with Dempster's rule of combination. The fusion results given in reference [46] were obtained by Equation (10) using the Dempster's rule 2 times, which is also shown in Tables 5-7. The corresponding Target column represents the fault type for fusion diagnosis.

Freq1
Freq2 Freq3   The improved DST is used to solve the fusion issue in the fault diagnosis mentioned above. According to the results shown in Tables 5-7, the conflict of sensor reports has been solved with the proposed method. We can notice that the proposed method can successfully detect the fault type F 2 , which is consistent with those given in reference [46]. Thus, both the two methods can conduct the conflictive pieces of evidence and identify the fault type F 2 well. Moreover, it can be seen in Figures 2-4 that the proposed method can deal well with the conflictive pieces of evidence. The belief degrees assigned to the target F 2 at 1× frequency, 2× frequency and 3× frequency using the proposed method were separately 0.9277, 0.9858, and 0.6321, which are all higher than the method in reference [46].
were separately 0.9277, 0.9858, and 0.6321, which are all higher than the method in reference [46].

Experimental Analysis
The effectiveness of the improved Dempster-Shafer (D-S) evidence theory in dealing with conflicting evidence has been verified in the previous section. The proposed HCE framework in roller bearing fault diagnosis and the robustness of improved DST in information fusion will be illustrated in this section. The present technique is then applied for the rolling bearing fault diagnosis experiments on the Machinery Fault Simulator Magnum (MFS-MG) test-rig. The flowchart of the fault diagnosis using the proposed procedure is shown as Figure 5.

Experimental Analysis
The effectiveness of the improved Dempster-Shafer (D-S) evidence theory in dealing with conflicting evidence has been verified in the previous section. The proposed HCE framework in roller bearing fault diagnosis and the robustness of improved DST in information fusion will be illustrated in this section. The present technique is then applied for the rolling bearing fault diagnosis experiments on the Machinery Fault Simulator Magnum (MFS-MG) test-rig. The flowchart of the fault diagnosis using the proposed procedure is shown as Figure 5.

Experimental Analysis
The effectiveness of the improved Dempster-Shafer (D-S) evidence theory in dealing with conflicting evidence has been verified in the previous section. The proposed HCE framework in roller bearing fault diagnosis and the robustness of improved DST in information fusion will be illustrated in this section. The present technique is then applied for the rolling bearing fault diagnosis experiments on the Machinery Fault Simulator Magnum (MFS-MG) test-rig. The flowchart of the fault diagnosis using the proposed procedure is shown as Figure 5.

The Experimental Set-Up
As shown in Figure 6, the vibration data set were acquired on the MFS-MG test rig, and the defective bearing of the type ER-12K was installed on the left side of the shaft. Accelerometer sensors were installed in vertical and horizontal on bearing seats. Sampling frequency was set to 25,600 Hz, and the rotating frequency of the motor was 29.87 Hz (about 1792 rpm). The fault types: Ball (B), cage (C), inner race (IR) and outer race (OR), as well as a normal (N) condition were used in the experiments. Each segment of the collected original vibration signal had 10,240 data points. The original vibration data and their frequency spectra are shown in Figure 7. defective bearing of the type ER-12K was installed on the left side of the shaft. Accelerometer sensors were installed in vertical and horizontal on bearing seats. Sampling frequency was set to 25,600 Hz, and the rotating frequency of the motor was 29.87 Hz (about 1792 rpm). The fault types: Ball (B), cage (C), inner race (IR) and outer race (OR), as well as a normal (N) condition were used in the experiments. Each segment of the collected original vibration signal had 10,240 data points. The original vibration data and their frequency spectra are shown in Figure 7.

Entropy Feature Sets
We could obtain four entropy features, the features of vibration signals. The original vibration signal was decomposed with the VMD method, and the decomposed intrinsic mode function (IMF) were achieved. The key parameters used in VMD should be selected based on the empirical value, interested readers can refer to reference [47]. Assume = { , , … , }, where K is the number of data points of . The SSE, PSE, and TFE of each were extracted using Equations (2), (5), and (6), respectively. Moreover, the WPESE of each original segment was also obtained using Equation (8). Here, a 3-level decomposition was used in WPT with the selected mother wavelet Db10. Since there were 112 samples for each experimental condition, the numbers of rows and columns in the feature matrix were 560 and 4, respectively. Figure 8 shows the entropy feature sets. The datasets were divided into two parts, and the former 75% of each class of data was randomly selected as training data, while the remaining 25% was testing data. The training data and the testing data was defined as a 420(row)-5(column) matrix and a 140(row)-5(column) matrix, respectively. The desirable classes were defective bearing of the type ER-12K was installed on the left side of the shaft. Accelerometer sensors were installed in vertical and horizontal on bearing seats. Sampling frequency was set to 25,600 Hz, and the rotating frequency of the motor was 29.87 Hz (about 1792 rpm). The fault types: Ball (B), cage (C), inner race (IR) and outer race (OR), as well as a normal (N) condition were used in the experiments. Each segment of the collected original vibration signal had 10,240 data points. The original vibration data and their frequency spectra are shown in Figure 7.

Entropy Feature Sets
We could obtain four entropy features, the features of vibration signals. The original vibration signal was decomposed with the VMD method, and the decomposed intrinsic mode function (IMF) were achieved. The key parameters used in VMD should be selected based on the empirical value, interested readers can refer to reference [47]. Assume = { , , … , }, where K is the number of data points of . The SSE, PSE, and TFE of each were extracted using Equations (2), (5), and (6), respectively. Moreover, the WPESE of each original segment was also obtained using Equation (8). Here, a 3-level decomposition was used in WPT with the selected mother wavelet Db10. Since there were 112 samples for each experimental condition, the numbers of rows and columns in the feature matrix were 560 and 4, respectively. Figure 8 shows the entropy feature sets. The datasets were divided into two parts, and the former 75% of each class of data was randomly selected as training data, while the remaining 25% was testing data. The training data and the testing data was defined as a 420(row)-5(column) matrix and a 140(row)-5(column) matrix, respectively. The desirable classes were

Entropy Feature Sets
We could obtain four entropy features, the features of vibration signals. The original vibration signal was decomposed with the VMD method, and the decomposed intrinsic mode function (IMF) were achieved. The key parameters used in VMD should be selected based on the empirical value, interested readers can refer to reference [47]. Assume IMF i = {x 1 , x 2 , . . . , x K }, where K is the number of data points of IMF i . The SSE, PSE, and TFE of each IMF i were extracted using Equations (2), (5), and (6), respectively. Moreover, the WPESE of each original segment was also obtained using Equation (8). Here, a 3-level decomposition was used in WPT with the selected mother wavelet Db10. Since there were 112 samples for each experimental condition, the numbers of rows and columns in the feature matrix were 560 and 4, respectively. Figure 8 shows the entropy feature sets. The datasets were divided into two parts, and the former 75% of each class of data was randomly selected as training data, while the remaining 25% was testing data. The training data and the testing data was defined as a 420(row)-5(column) matrix and a 140(row)-5(column) matrix, respectively. The desirable classes were labeled with 1, 2, 3, 4, and 5. For example, outputs 1 and 3 were separately related to the first and the third class. In this way, three supervised classifiers could be used to identify the bearing faults. labeled with 1, 2, 3, 4, and 5. For example, outputs 1 and 3 were separately related to the first and the third class. In this way, three supervised classifiers could be used to identify the bearing faults.

Classification Using Single-Stage Classifier
DNN, SVM, and ELM were separately adopted in the single-stage classification based on the above achieved entropy signatures. In this work, a large number of neurons were tested to find an optimal structure of DNN. The number of hidden layer neurons which resulted in the highest classification accuracy was selected as the optimum number. Then, the optimum DNN structure was constructed based on the obtained number of hidden layer neurons. Figure 9 shows the classification accuracies of DNN based on the different numbers of hidden layer neurons and mini-batch gradient descent (MBGD) algorithm. It can be seen in Figure 10 that the determined optimal number of hidden layer neurons is set to 13. In the SVM technique, the Gaussian radial basis function (RBF) was selected as the kernel function, and the particle swarm optimization (PSO) was used to determine the optimized parameters in the SVM. The population size (pop), maximum number of iterations (maxgen), two acceleration constants ( , ), and the inertia weight ( ) were set to = 1.5, = 1.7 and = 1, pop = 20, maxgen = 100, respectively. In addition, the parameters of FOA used in ELM, such as the population size (pop) and maximum number of iterations (maxgen) were set to 20, 100, while the initial positions were set randomly.

Classification Using Single-Stage Classifier
DNN, SVM, and ELM were separately adopted in the single-stage classification based on the above achieved entropy signatures. In this work, a large number of neurons were tested to find an optimal structure of DNN. The number of hidden layer neurons which resulted in the highest classification accuracy was selected as the optimum number. Then, the optimum DNN structure was constructed based on the obtained number of hidden layer neurons. Figure 9 shows the classification accuracies of DNN based on the different numbers of hidden layer neurons and mini-batch gradient descent (MBGD) algorithm. It can be seen in Figure 10 that the determined optimal number of hidden layer neurons is set to 13.

Classification Using Single-Stage Classifier
DNN, SVM, and ELM were separately adopted in the single-stage classification based on the above achieved entropy signatures. In this work, a large number of neurons were tested to find an optimal structure of DNN. The number of hidden layer neurons which resulted in the highest classification accuracy was selected as the optimum number. Then, the optimum DNN structure was constructed based on the obtained number of hidden layer neurons. Figure 9 shows the classification accuracies of DNN based on the different numbers of hidden layer neurons and mini-batch gradient descent (MBGD) algorithm. It can be seen in Figure 10 that the determined optimal number of hidden layer neurons is set to 13. In the SVM technique, the Gaussian radial basis function (RBF) was selected as the kernel function, and the particle swarm optimization (PSO) was used to determine the optimized parameters in the SVM. The population size (pop), maximum number of iterations (maxgen), two acceleration constants ( , ), and the inertia weight ( ) were set to = 1.5, = 1.7 and = 1, pop = 20, maxgen = 100, respectively. In addition, the parameters of FOA used in ELM, such as the population size (pop) and maximum number of iterations (maxgen) were set to 20, 100, while the initial positions were set randomly.  In the SVM technique, the Gaussian radial basis function (RBF) was selected as the kernel function, and the particle swarm optimization (PSO) was used to determine the optimized parameters in the SVM. The population size (pop), maximum number of iterations (maxgen), two acceleration constants (c 1 , c 2 ), and the inertia weight (ψ) were set to c 1 = 1.5, c 2 = 1.7 and ψ = 1, pop = 20, maxgen = 100, respectively. In addition, the parameters of FOA used in ELM, such as the population size (pop) and maximum number of iterations (maxgen) were set to 20, 100, while the initial positions were set randomly. each classifier model for bearing fault diagnosis. The aim of classification was to assign an input pattern to one of the 5 classes concerned in the present study and represented by the classification labels. The classification results of the testing data set obtained by preliminary diagnosis are shown in Figures 10-12. The performances of DNN, ELM, and SVM are illustrated in Tables 8-10, respectively. The meaning of Y-axis in Figures 10a, 11a, and 12a represents five bearing conditions, denoted by four fault types B, C, IR, OR as well as a normal condition (N).  Figure 10a shows the desired output and the output of the trained DNN. Figure 10b shows the absolute error of the DNN output with respect to the desired output, where a sample is misclassified when the absolute error is large. As can be seen from Table 8, the average classification accuracy of DNN is 88.57%. Figure 11a illustrates the desired output and the output of the trained ELM, while Figure 11b shows the absolute error of the ELM output with respect to the desired output. As can be seen from Table 9, the average classification accuracy of the testing data set using the ELM approach is about 80.81%. Similarly, Figure 12a shows the desired output and the output of the trained SVM, and Figure 12b shows the absolute error of the SVM output with respect to the desired output. As can be seen from Table 10, the average classification accuracy of the testing data set using the SVM approach is only 77.14%. It can be found that the classification rates separately using these three techniques were not good enough. Among them, DNN achieved the best classification results based on the deep learning technique as well as its optimal structures, compared with SVM and the ELM. The accuracy using single-stage classifier was still not good enough. Therefore, the data fusion method is necessary to be employed to increase the classification accuracy. After data training using each classifier, the testing data set was used to validate the accuracy of each classifier model for bearing fault diagnosis. The aim of classification was to assign an input pattern to one of the 5 classes concerned in the present study and represented by the classification labels. The classification results of the testing data set obtained by preliminary diagnosis are shown in         Figure 10a shows the desired output and the output of the trained DNN. Figure 10b shows the absolute error of the DNN output with respect to the desired output, where a sample is misclassified when the absolute error is large. As can be seen from Table 8, the average classification accuracy of DNN is 88.57%. Figure 11a illustrates the desired output and the output of the trained ELM, while Figure 11b shows the absolute error of the ELM output with respect to the desired output. As can be seen from Table 9, the average classification accuracy of the testing data set using the ELM approach is about 80.81%. Similarly, Figure 12a shows the desired output and the output of the trained SVM, and Figure 12b shows the absolute error of the SVM output with respect to the desired output. As can be seen from Table 10, the average classification accuracy of the testing data set using the SVM approach is only 77.14%.
It can be found that the classification rates separately using these three techniques were not good enough. Among them, DNN achieved the best classification results based on the deep learning technique as well as its optimal structures, compared with SVM and the ELM. The accuracy using single-stage classifier was still not good enough. Therefore, the data fusion method is necessary to be employed to increase the classification accuracy.

Results Using the HCE Algorithm and the Improved DST
Since the classification results were separately obtained using a single classifier, their results can be syncretized further. In this work, the fusion of the primary classification results was carried out using the improved DST method. First, three types of evidence were introduced as follows. E 1 , E 2 , and E 3 were the classification results using the supervised classifiers DNN, ELM, and SVM, respectively. The original Dempster's rule and the proposed method were both used to achieve the fusion results. In fact, the counter-intuitive results are often obtained when Dempster's rule of combination is utilized in some cases, especially, when the BOEs to be combined are highly conflicting.
In order to improve the diagnostic accuracy, DST and the proposed DST were used to fuse the preliminary diagnosis of HCE. The results of different methods are given in Table 11. In the fusion stage, each testing sample corresponded to a probabilistic output, which was the body of evidence. The meaning of X-axis in Figures 13-15 represents 140 bodies of evidence, while the meaning of Y-axis in Figures 13-15 represents fusion results of evidence using different methods. The fusion result of HCE by the proposed DST is shown in Figure 13, while the fusion result using HCE and the original DST is shown in Figure 14. A sample is misclassified when its fusion result is smaller than or equal to 0.5. It can be seen in Figures 13 and 14 that the classification accuracy using the proposed HCE and the improved DST is the highest, about 97.86%. In addition, the accuracy using the original DST is about 92.86%, which is also better than those using a single-stage classifier. Figure 15 illustrates the results using the technique given in reference [25]. We can find the result is better than those achieved using original DST, but it is still worse compared with our proposed methods. This well demonstrated that the proposed HCE approach combined with the improved DST can reliably be automatically used for roller bearing fault detection. It means that the fault detection accuracy can significantly be improved by applying HCE approach.

Conclusions
It is crucial to detect the relatively reliable evidence with the collected multi-source evidence in the process of information fusion. The HCE approach combined with the improved DST has been proposed for the fault diagnosis of roller bearings. The effects of support degree among the pieces of evidence, the uncertainty information of BPA, and the relative credibility of the evidence on the weights are all considered in this improved DST. The improved DST can effectively deal with conflicts between the evidences and then improve the diagnostic accuracy. The cosine similarity is employed to indicate the confidence degree between the pieces of evidence. Entropy features are used to measure the quantitative uncertainty of BPA in the improved DST. In addition, entropy based

Conclusions
It is crucial to detect the relatively reliable evidence with the collected multi-source evidence in the process of information fusion. The HCE approach combined with the improved DST has been proposed for the fault diagnosis of roller bearings. The effects of support degree among the pieces of evidence, the uncertainty information of BPA, and the relative credibility of the evidence on the weights are all considered in this improved DST. The improved DST can effectively deal with conflicts between the evidences and then improve the diagnostic accuracy. The cosine similarity is employed to indicate the confidence degree between the pieces of evidence. Entropy features are used to measure the quantitative uncertainty of BPA in the improved DST. In addition, entropy based FPR is employed to indicate the relative reliability preference between BOEs. Thus, the improved  Figure 15. The fusion result using HCE with DST given in reference [25].

Conclusions
It is crucial to detect the relatively reliable evidence with the collected multi-source evidence in the process of information fusion. The HCE approach combined with the improved DST has been proposed for the fault diagnosis of roller bearings. The effects of support degree among the pieces of evidence, the uncertainty information of BPA, and the relative credibility of the evidence on the weights are all considered in this improved DST. The improved DST can effectively deal with conflicts between the evidences and then improve the diagnostic accuracy. The cosine similarity is employed to indicate the confidence degree between the pieces of evidence. Entropy features are used to measure the quantitative uncertainty of BPA in the improved DST. In addition, entropy based FPR is employed to indicate the relative reliability preference between BOEs. Thus, the improved DST is much more reasonable in dealing with conflicts compared with the original DST. The effectiveness of the improved Dempster-Shafer theory has been verified via two examples.
In addition, SSE, PSE, TFE, and WPESE features have been utilized in the single-stage classification with DNN, SVM, and ELM in this work. Performances of the proposed HCE approach combined with the improved DST has been demonstrated on a bearing test-rig, compared with the original DST. It can be found that the overall error rate of the HCE approach can be greatly reduced using the improved DST, while the accuracy of the rolling element bearings diagnosis is successfully raised. Since there is not enough (complete) fault data for a rotating machine in practice, it is usually difficult dealing with a small sample and incomplete data in the process of decision-making. The proposed technique will be further investigated under these cases in the future.
Author Contributions: Conceptualization and methodology, Y.W. and F.L.; data analysis and validation, F.L.; writing-review and editing and funding acquisition, Y.W. and A.Z.