An Ensemble Deep Convolutional Neural Network Model with Improved D-S Evidence Fusion for Bearing Fault Diagnosis

Intelligent machine health monitoring and fault diagnosis are becoming increasingly important for modern manufacturing industries. Current fault diagnosis approaches mostly depend on expert-designed features for building prediction models. In this paper, we proposed IDSCNN, a novel bearing fault diagnosis algorithm based on ensemble deep convolutional neural networks and an improved Dempster–Shafer theory based evidence fusion. The convolutional neural networks take the root mean square (RMS) maps from the FFT (Fast Fourier Transformation) features of the vibration signals from two sensors as inputs. The improved D-S evidence theory is implemented via distance matrix from evidences and modified Gini Index. Extensive evaluations of the IDSCNN on the Case Western Reserve Dataset showed that our IDSCNN algorithm can achieve better fault diagnosis performance than existing machine learning methods by fusing complementary or conflicting evidences from different models and sensors and adapting to different load conditions.


Introduction
As one of the core components of rotating machinery, rolling element bearings are used to constrain relative motions to only the desired motion and reduce friction between moving parts. Bearings are always expected to work 24 h per day in actual production. Any failure with the bearings may lead to unexpected consequence of the whole machine. As bearing failures always bring downtime, expensive repair and hidden cost to enterprises, real time monitoring and precise fault diagnosis are critical to avoid catastrophic damages.
In the past decades, a variety of methods have been developed in bearing fault diagnosis, such as vibration analysis [1], acoustic analysis [2], noise analysis [3], thermal imaging analysis [4] and so on, among which the vibration analysis has proven to be the most efficient [5]. Many vibration signal processing tools have been used in signal preprocessing such as Fourier spectral analysis [6], wavelet analysis [7], empirical mode decomposition [8], and multi-wavelet transformation [9]. These vibration analysis methods have achieved good performance from non-adaptive analysis to adaptive analysis and from qualitative analysis to quantitative analysis [10]. However, good performance with these methods highly depends on the expert experience and knowledge.
In addition to the application of diverse signal preprocessing approaches, many artificial intelligence methods have been applied to features extraction and classification, two main steps in bearing fault diagnosis. Time domain statistical analysis, multi-wavelet transformation and fast Fourier transformation are always used for features extraction. Feature selection methods such as principal component analysis (PCA) [11] and independent component analysis (ICA) [12] are employed to select most useful features. Classifiers such as Bayes Classifier [13], k-nearest neighbor (KNN) [14], Multi-layer Perceptron neural networks (MLP) [15], support vector machines (SVM) [16], Decision Trees (DT) [17], and Random Forests (RF) [18] have all been applied to bearing fault diagnosis. Among them, MLP is known for its capability to learn features with complex and nonlinear patterns. It has also been reported that SVM is superior to the other AI algorithms in fault diagnosis due its memory efficiency and good performance when the number of features is greater than that of samples [19][20][21].
Recently, deep learning methods have received amazing success in pattern recognition and machine learning application domains due to their outstanding capability to learn complex and robust representations. Many classification problems have been successfully solved by various deep learning models such as deep belief network [22], deep Boltzmann machine [23], deep auto encoder [24], convolutional neural network (CNN) [25] and so on. Deep learning has also been applied to fault diagnosis recently. Guo [26] proposed a hierarchical adaptive deep convolution neural network (ADCNN) on bearing fault analysis, which first output the fault type and then analyze the fault size. Chen et al. [27] proposed a 2D-CNN model for gearbox fault diagnosis through combining time domain statistical features with 256 statistical frequency domain root mean squared (RMS) values. An 1D CNN model based on the raw time series data was proposed for motor fault detection by Ince et al. [28]. Tran et al. [29] developed a deep belief model for reciprocating compressor fault analysis.
Although the deep learning methods have achieved great success in fault diagnosis, there still exist two potential issues. First, selecting appropriate network structures, parameters, and algorithms is critical to the success of deep learning approaches. Second, the signal data in all current deep learning methods for fault diagnosis are from a single sensor, which may not provide reliable information in terrible working environments. The adaptability and the anti-interference ability of different sensors varies widely so prediction models based on single source of information could lead to misdiagnosis. It is desirable to exploit multi-sensor information source for more reliable fault diagnosis by effective information fusion approach, which is complicated by the fact that different sensor locations and sensor qualities.
In this paper, we propose a novel algorithm that integrates deep neural networks with multiple sensor information with an improved Dempster-Shafer (D-S) evidence theory [30] based data fusion model for bearing fault diagnosis. Our fusion model can combine multiple uncertain evidences and give the fusion result through merging consensus information and excluding conflicting information. In recent years, D-S evidence theory and its variants have been widely used in multi-sensors data fusion, decision analysis, fault detection and other industrial fields. However, two defects in traditional D-S evidence theory still exist, which we addressed in our proposed fusion model: one is evaluating the basic function of the evidence body objectively; and the other is solving the conflict evidences coming from different sources.
The rest of this paper is structured as follows. Section 2 introduces our improved D-S evidence theory (IDS) through calculating the similarity among different evidences and assigning the weights to the original evidences by the modified Gini index. Section 3 presents the proposed IDSCNN fault diagnosis model with detailed description. We trained several CNN models with different structures and parameters, which are synthesized by our IDS method. Section 4 describes the experiment setup and presents the evaluation results on two sets of experiments that prove the superiority of the improved fusion model and the IDSCNN fault diagnosis algorithm. A conclusion is presented at the end of this paper.

Improved D-S Evidence Theory for Information Fusion
D-S Evidence Theory is a mathematical theory and general framework for reasoning with uncertainty, which allows one to combine multiple (usually conflicting) evidences from different sources and arrive at a degree of belief (represented by a mathematical object called belief function) that takes into account all the available evidence. D-S evidence theory has been shown [31,32] to achieve better performance in data fusion based classification compared to the traditional probability theory due to its capability to grasp the unknown and uncertainty. This method has been widely used in many fields in recent years such as machine health monitoring and fault diagnosis [33,34], engineering design [35], security defense [36], target recognition and tracking [36,37], decision-making [38], and information fusion [37,39]. A simple integration method [34] such as voting (used in LIBSVM (a library for Support Vector Machines) for multi-classification problem) has proven to have a relatively poor performance in bearing fault diagnosis compared to using the D-S evidence theory. Given all that, we chose the D-S evidence theory for model fusion in this paper.

Preliminaries
D-S evidence theory assumes a finite set of elements Θ = {A 1 , A 2 , . . . , A n }, which is called the frame of discernment. The symbol m is a measure on the subsets of Θ and is called a basic probability assignment function (BPAF). This function is subject to the following qualifications: D-S evidence theory provides a very useful synthesis formula, which can combine evidence from different evidence sources. For A ⊂ Θ, in the frame of discernment Θ, there are finite basic probability assignment functions m 1 , m 2 , · · · m n , the synthesis formula is defined as the following: k represents the degree of conflicting evidence and the coefficient 1/(1 − k) is called the normalization factor which ensures that the sum of BPAs can be unit.
The traditional D-S evidence theory presents a good method for evidence fusion. However, some limitations still exist, which can lead to failure on evidence fusion under certain circumstances such as some complex practical environment with probable conflicts of different evidence. Many researches have been done to solve the evidence paradox issue. Table 1 shows four common paradoxes [40], namely complete conflict paradox, 0 trust paradox, 1 trust paradox and high conflict paradox, which cause fusion difficulty for traditional D-S theory. It should be noted that, in Table 1 and  Table 4, propositions A, B, C, D, E and Θ are the elements of the frame of discernment, which contains all the categories and can be interpreted as the bearing fault types in our problem.
In the complete conflict paradox, the conflict factor k can be calculated as k = 1, which causes the denominator of Formula (2) to become zero when we apply Formula (3) to the two evidences m 1 and m 2 . Though evidences m 1 , m 3 and m 4 support the proposition, the D-S combination rule cannot be used to synthesize the evidences under this circumstance. For 0 trust paradox, the total conflict factor k can be calculated as k = 0.99 and the BPAs (basic probability assignment) can be obtained as m(A) = 0, m(B) = 0.727, m(C) = 0.273 according to Formulas (2) and (3). Though evidences m 1 , m 2 , and m 4 support the proposition A, m 2 = 0 totally negates this proposition. Under this circumstance, the BPA for proposition A will always be 0 no matter how many and how strong other evidences support A. Table 1. BPAs for four common paradoxes. To address these issues, we proposed an improved D-S evidence theory through Euclidean distance matrix and modified Gini index, as stated below.

The Improved D-S Evidence Theory
In this section, we will firstly calculate the similarity matrix through a Euclidean distance function. Then, we combine the similarity matrix with a modified Gini index to get the evidence credibility. We use the evidence credibility as weights to modify the original evidence and decrease the conflict evidence.
Define ε i as the creditable factor of the evidence E i , then all the creditable factors form the creditable vector for the evidence set can be written as ε i = (ε 1 · · · ε 2 · · · ε n ), ε i ∈ (0, 1]. Define BPA matrix B nxN (which consists of evidence m i (i = 1, 2, · · · , n)), n stands for the number of evidence and N stands for the number of propositions in the frame of discernment Θ. Thus, B ij = m i A j stands for the j th BPA value of the i th evidence. Define p i = (m i (A 1 ), m i (A 2 ), · · · m i (A n )) as the i th row in BPA matrix B nxN , then the vector ||p i − p j || is the Euclidean distance between p i and p j which stands for the similarity between E i and E j .
Note Thus, we can get a distance matrix In Equation (4), 0 ≤ m i A j ≤ 1 and Σ N j=1 m i A j = 1, so the maximum value of d ij is max d ij = 2. We define the regularized element for the following evidence credibility as below: Credibility factor ε i reflects the deviation degree among evidence set E i . This means m i is consistent with other evidence when its credibility factor ε i is relatively large and close to 1 while ε i is singular compared with others when its value is a relatively small value close to 0. Thus, ε i should be a decreasing function d ij such that ε i = Σ n j=1 f ( d ij ). Here, we employ the modified Gini Index ε i = 4Σ n j=1,j =i d ij (1 − d ij ) as our decreasing function that satisfies our requirements. Within the argument range [0.5, 1], ε i is a decreasing function. Credibility factor ε i increases to 1 when the distance d ij between two evidences decreases otherwise ε i decreases to 0 ( Figure 1).
We define the regularized element for the following evidence credibility as below: Credibility factor reflects the deviation degree among evidence set . This means is consistent with other evidence when its credibility factor is relatively large and close to 1 while is singular compared with others when its value is a relatively small value close to 0. Thus, should be a decreasing function such that = ( ). Here, we employ the modified Gini Index = 4Σ , (1 − ) as our decreasing function that satisfies our requirements. Within the argument range [0.5, 1], is a decreasing function. Credibility factor increases to 1 when the distance between two evidences decreases otherwise decreases to 0 ( Figure 1). The regularized evidence credibility is The credibility factor reflects the similarity degree of certain evidence with others. The creditable factor is a normalized result. When the similarity between certain evidence and others increases, the creditable factor increases and vice versa. After we get the creditable factor , we can correct the raw evidence. Define the as the jth proposition of ith evidence of the raw evidence and * ( ) as the rectified BPA, then * = * ( ) = 0 ∀A∈Θ,A j ≠ϕ It should be noted that we introduce the end elimination mechanism here and set the minimum BPA to zero through the above expression. This will enhance the BPAs of other propositions The regularized evidence credibility is The credibility factor ε i reflects the similarity degree of certain evidence with others. The creditable factor ε r i is a normalized result. When the similarity between certain evidence and others increases, the creditable factor ε r i increases and vice versa. After we get the creditable factor ε r i , we can correct the raw evidence. Define the m i A j as the j th proposition of i th evidence of the raw evidence and m * i A j as the rectified BPA, then Because m i A j ∈ [0, 1] and ε r i (A) ∈ [0, 1], the modified evidences satisfy Σ A⊂Θ m * (A) ≤ 1, we scaled the above result though the following expression It should be noted that we introduce the end elimination mechanism here and set the minimum BPA to zero through the above expression. This will enhance the BPAs of other propositions intuitively and give higher confidence to the survived propositions. In fault diagnosis, we prefer the diagnosis result with higher probability intuitively.
The above results are normalized for the comparison in the latter validation within the same range. The decision rule for the improved D-S evidence theory compares the modified BPAs and selects the fault proposition with the maximum BPA evidence as the fusion result.
In summary, there are five steps in our improved D-S evidence theory for evidence fusion: (1) Calculate the distances matrix D and its elements d ij among raw evidences.
(2) Calculate the evidence credibility ε i using the modified Gini Index expression.

The IDSCNN Ensemble CNN Model for Bearing Fault Diagnosis
There are three steps in our IDSCNN diagnosis model: Data preparation, Model Training, and Model testing, as shown in Figure 2. intuitively and give higher confidence to the survived propositions. In fault diagnosis, we prefer the diagnosis result with higher probability intuitively. The above results are normalized for the comparison in the latter validation within the same range. The decision rule for the improved D-S evidence theory compares the modified BPAs and selects the fault proposition with the maximum BPA evidence as the fusion result.
In summary, there are five steps in our improved D-S evidence theory for evidence fusion: (1) Calculate the distances matrix D and its elements among raw evidences. (2) Calculate the evidence credibility using the modified Gini Index expression.

The IDSCNN Ensemble CNN Model for Bearing Fault Diagnosis
There are three steps in our IDSCNN diagnosis model: Data preparation, Model Training, and Model testing, as shown in Figure 2.

Data Preparation
Data acquisition and preprocessing are needed to train our CNN models, which is described as Blocks 1-5 in Figure 2. The raw signals for our bearing fault experiments are accelerator vibration signals from two sensors. For a given raw accelerator signal, we use a sliding window of size 512 with shift step size 200 to scan the signal and generate the raw data samples. Thus, for any two consecutive samples, there will be an overlap of 300 data points.
Feature extraction from raw sensor data is critical in machine monitoring. As shown in Figure 1,

Data Preparation
Data acquisition and preprocessing are needed to train our CNN models, which is described as Blocks 1-5 in Figure 2. The raw signals for our bearing fault experiments are accelerator vibration signals from two sensors. For a given raw accelerator signal, we use a sliding window of size 512 with shift step size 200 to scan the signal and generate the raw data samples. Thus, for any two consecutive samples, there will be an overlap of 300 data points.
Feature extraction from raw sensor data is critical in machine monitoring. As shown in Figure 1, we use root mean square (RMS) over a sub band of the frequency spectrum as feature for our CNN model, which has the advantage of maintaining the energy shape at the spectrum peaks [27,41]. Our sampling length N s is set as 500 as suggested by Wade [42], who did a benchmark study on the CWRU data with trial and error and suggested that N s = 500 for the 12 k data. To avoid spectrum leakage, we multiply the selected time domain signal with a Hanning window and then obtain the FFT spectrum. We use the fft size (number of Fourier coefficients) N fft = 16,384 and get the relevant frequency spectrum. Since the frequency spectrum is symmetrical, we only take the single-sided data for constructing the RMS maps as the input data for training CNN models. The sub band length-b for RMS values will be changed according to the size of RMS maps. Each sub band should have the same length except the last one if N/2 cannot be divided without remainder by the RMS map size. For example, if the size of RMS map is 32 × 32, b = 8192/(32 × 32) = 8 and if the size of RMS is 16 × 16, b = 8192/(16 × 16) = 32. For convenience of calculation, we take two kinds of RMS map sizes (16 × 16 × 1 and 32 × 32) as the size of our training data.

The IDSCNN Model based on CNNs
As shown in Figure 2, our IDSCNN prediction model is composed of an ensemble of CNN classifiers trained with two sensor signals, whose outputs are fused using the improved D-S fusion algorithm. Convolutional neural networks are selected here due to their capability to learning hierarchical representations. The structure of our CNN models is shown in Figure 3. It is composed of three convolutional layers plus a full connection layer. To explore the effect of parameters on the prediction performance, we have evaluated different parameter configurations as shown in the Table 2. There are two choices for the input size (16 × 16 × 1 or 32 × 32 × 1), convolutional layer 2 ( [4,4,10,16] or [4,4,10,20]), convolutional layer 3 ( [4,4,16,12] or [4,4,16,16]) and strides ((1,2,1) or (1,2,2)) for the three convolutional layers, respectively. We implemented the CNN models using Google Tensorflow 1.2.0rc2. All experiments are conducted on a computer equipped with a NVIDIA GPU 960M. We take the Adam stochastic optimization algorithm as the training algorithm due to its good performance, computational efficiency and memory-saving.

The IDSCNN Model based on CNNs
As shown in Figure 2, our IDSCNN prediction model is composed of an ensemble of CNN classifiers trained with two sensor signals, whose outputs are fused using the improved D-S fusion algorithm. Convolutional neural networks are selected here due to their capability to learning hierarchical representations. The structure of our CNN models is shown in Figure 3. It is composed of three convolutional layers plus a full connection layer. To explore the effect of parameters on the prediction performance, we have evaluated different parameter configurations as shown in the Table  2. There are two choices for the input size (16 × 16 × 1 or 32 × 32 × 1), convolutional layer 2 ( [4,4,10,16] or [4,4,10,20]), convolutional layer 3 ([4,4,16,12] or [4,4,16,16]) and strides ((1,2,1) or (1,2,2)) for the three convolutional layers, respectively. We implemented the CNN models using Google Tensorflow 1.2.0rc2. All experiments are conducted on a computer equipped with a NVIDIA GPU 960M. We take the Adam stochastic optimization algorithm as the training algorithm due to its good performance, computational efficiency and memory-saving.

Model Testing
To investigate IDSCNN performance, we conduct three different ways to build the IDSCNN models: (1) we combine all the diagnosis results from CNN models trained with signals from the drive end sensor; (2) we combine all the diagnosis results from CNN models trained with signal from the fan end sensor; and (3) we combine all the results from CNN models trained with signals from both sensors. The results of the combinations will be discussed in section.

Experiment Set-Up
To facilitate experiment verification and performance comparison with other related research, we evaluate our diagnosis models over the widely used bearing data from the Case Western Reserve University (CWRU) Bearing Data Center [42,43]. There are four bearing fault types included in the datasets ( Figure 4): normal, inner race fault, outer race fault, and ball fault. Single point faults were introduced to the test bearings using the electric discharge machine (EDM). The fault datasets are further categorized by the fault size (0.007 inch, 0.014 inch, 0.021 inch). The test stand is shown in the upper right corner of Figure 2. It consists of a motor (left), a torque transducer/encoder (center), a dynamometer (right), and control electronics (not shown). Vibration data were collected using two accelerometers. At both the drive end and fan end of the motor housing, the accelerometers were attached to the housing with magnetic bases at the 6 o'clock position. All the normal baseline data were collected at 48 k samples/second while all fault bearing data were collected at 12,000 samples/second. Vibration data were recorded for motor loads of 0 to 3 horsepower (motor speeds of 1797 to 1720 RPM).

Model Testing
To investigate IDSCNN performance, we conduct three different ways to build the IDSCNN models: (1) we combine all the diagnosis results from CNN models trained with signals from the drive end sensor; (2) we combine all the diagnosis results from CNN models trained with signal from the fan end sensor; and (3) we combine all the results from CNN models trained with signals from both sensors. The results of the combinations will be discussed in section.

Experiment Set-Up
To facilitate experiment verification and performance comparison with other related research, we evaluate our diagnosis models over the widely used bearing data from the Case Western Reserve University (CWRU) Bearing Data Center [42,43]. There are four bearing fault types included in the datasets ( Figure 4): normal, inner race fault, outer race fault, and ball fault. Single point faults were introduced to the test bearings using the electric discharge machine (EDM). The fault datasets are further categorized by the fault size (0.007 inch, 0.014 inch, 0.021 inch). The test stand is shown in the upper right corner of Figure 2. It consists of a motor (left), a torque transducer/encoder (center), a dynamometer (right), and control electronics (not shown). Vibration data were collected using two accelerometers. At both the drive end and fan end of the motor housing, the accelerometers were attached to the housing with magnetic bases at the 6 o'clock position. All the normal baseline data were collected at 48 k samples/second while all fault bearing data were collected at 12,000 samples/second. Vibration data were recorded for motor loads of 0 to 3 horsepower (motor speeds of 1797 to 1720 RPM). Thus, we can get ten fault conditions for each load. The most common way for evaluating a deep learning model is using the training data for modeling and testing the performance of the model through the test data. In order to get an objective result, the test data should not appear in the training data; otherwise, the test result will be overly optimistic. Thus, we apply random uniform sampling to the original accelerator dataset. As shown in Table 3, 10,000 training samples and 2500 testing Thus, we can get ten fault conditions for each load. The most common way for evaluating a deep learning model is using the training data for modeling and testing the performance of the model through the test data. In order to get an objective result, the test data should not appear in the training data; otherwise, the test result will be overly optimistic. Thus, we apply random uniform sampling to the original accelerator dataset. As shown in Table 3, 10,000 training samples and 2500 testing samples (1000 training and 250 test data for each fault condition) are picked for each load condition and generate Datasets A, B, and C, respectively.

Results
In this section, we present two sets of experiment results. In experiment set 1, we evaluate how well our improved D-S (IDS) evidence fusion model addresses the paradoxes compared with several existing fusion methods. In experiment set 2, we evaluate the performance of IDSCNN for bearing fault diagnosis and compare its performance with other methods.

Evaluation of the Improved D-S Evidence (IDS) Fusion Algorithm
To compare the IDS method with existing evidence fusion methods, we take four paradoxes described in Section 3.1 as examples. Evidences can be divided into consistent evidences and conflict evidences. The former type supports the same proposition and the latter type disagrees with other evidences. From Table 1, we can see that m 1 , m 3 , and m 4 in complete conflict paradox; m 1 , m 2 , and m 4 in 0 trust paradox; m 2 , m 3 , and m 4 in 1 trust paradox; and m 1 , m 3 , m 4 , and m 5 in high conflict paradox are all consistent evidences in the four paradoxes groups, respectively. The remaining propositions belong to conflict evidences. Traditional D-S evidence has been proved to fail in dealing with these four common paradoxes. Here we take four modified D-S evidence fusion algorithms from Yager [44], Sun [45], Murphy [46] and Deng [47] for comprehensive analysis. The fusion results are presented in Table 4.
In Table 4, we can see that both Yager and Sun solve the conflicts through allotting the conflict factor to the unknown proposition in Θ, which however increases the uncertainty. Yager's method fails in the four conditions and cannot solve the paradoxes when the number of evidences is more than two. Under the four paradoxical circumstances, Yager and Sun fail in getting reasonable results due to the high uncertainty with unknown propositions in Θ. The synthesis results from Murphy, Deng and our IDS method have achieved good performances and relatively rational results. Overall, our IDS fusion method has achieved the best results for three paradoxes out of four and achieved the second best result for the remaining paradox. In the following diagnosis experiments, we choose the proposition with the maximum BPA as the fusion result. Thus, for the maximum BPA, the closer to 1 the more confidence it presents intuitively. As shown in Table 4, our IDS method has the largest maximum BPA of 0.9284 in complete conflict paradox, 0.7418 in 0 trust paradox, 0.9406 in 1 trust paradox and is ranked second in high conflict paradox with BPA 0.6210. In general, our proposed IDS achieves good performance in dealing with the paradox evidences for fault diagnosis.

Evaluation of IDSCNN for Bearing Fault Diagnosis on the CWRU Dataset
As described before, the CRWU bearing datasets are acquired with different loading levels. To check how the loading level affects the vibration signals, we randomly extract one sample for different fault types (0, 1, ..., 9) and load condition. We then plot the 16 × 16 input data from different Datasets A, B, and C. In Figure 5, we can find that the RMS maps for a given fault type under different loads share significant similarities (column-vise similarity) in the CRWU bearing dataset. The RMS maps, however, vary significantly from fault type to fault type. Furthermore, even under the same fault state with the same load, there is a big difference between the drive end RMS map and the fan end RMS map. This means different sensors will carry different information. Combining different sensor information should give more information for fault diagnosis. Since we fix the length of the frequency spectrum, the 16 × 16 and 32 × 32 RMS maps are actually coming from the same frequency spectrum, but the 32 × 32 RMS maps carry more information than the 16 × 16 RMS maps.
First, we conduct experiments to evaluate the performances of our individual CNN models. In actual fault diagnosis scenario, the bearing load changes at all times. It is thus desirable that the fault prediction model can adapt to different loading conditions. To test this load adaptability of our models, we trained the CNN models on different training Datasets A, B and C corresponding to three different load conditions. We then tested their performance on testing Datasets A, B and C under three load conditions. There are nine combinations between training sets and testing sets, as shown in the top part of Table 5. For each model, we use the 10,000 samples for training and test it on the 2500 samples of different conditions. We should note that A, B and C now stand for different datasets for training and testing in the subsequent sections. maps, however, vary significantly from fault type to fault type. Furthermore, even under the same fault state with the same load, there is a big difference between the drive end RMS map and the fan end RMS map. This means different sensors will carry different information. Combining different sensor information should give more information for fault diagnosis. Since we fix the length of the frequency spectrum, the 16 × 16 and 32 × 32 RMS maps are actually coming from the same frequency spectrum, but the 32 × 32 RMS maps carry more information than the 16 × 16 RMS maps. First, we conduct experiments to evaluate the performances of our individual CNN models. In actual fault diagnosis scenario, the bearing load changes at all times. It is thus desirable that the fault prediction model can adapt to different loading conditions. To test this load adaptability of our models, we trained the CNN models on different training Datasets A, B and C corresponding to three different load conditions. We then tested their performance on testing Datasets A, B and C under As described in Table 2, we have developed 16 CNN architectures to evaluate. We build a model for each of these architectures and each sensor. Thus, we built 32 CNN models in total. To avoid random sampling errors, we repeated the same modeling process 20 times and took the average values as the final results, as shown in Table 5.
In Table 5, we can see that all CNN models have almost perfect performances on their training data and their performance vary on other testing datasets. Comparing the average accuracy (AVG) of the 32 CNN models, we can find some interesting observations. First, we can find that model #2 is the best model with accuracy of 97.89%. However, its adaptability from training set C to testing set A is 89.81%, which is lower than that of model #4, #9, #11 and #13. On the other hand, model #27 might be the worst model since its AVG is only 89.11%, but its local adaptability from training set B to testing set A (91.93%) are almost better than all CNN models with 32 × 32 input at the fan end. This phenomenon tells us that even the best selected model may have poor performance on certain circumstances and even the worst model may present relatively good performance under some conditions.
We also compared the performance of models trained with driver-end and fan-end signal with 16 × 16 input sizes ( Table 5). The average accuracy of eight CNN models trained from driver-end signals is 96.54% compared to the 90.97% of the models trained with fan-end signals. This means that the signal from the drive end is more useful than the fan end for fault diagnosis. This may be caused by several reasons such as the sensor quality, sensor locations, environment effects and so on. Next, we evaluate how our improved D-S fusion algorithm helps improve the fault diagnosis performance of our CNN models. Figure 6 shows the experiment results for the individual CNNs and the IDSCNN models with different fusion sources. In Figure 6a, we trained right CNN models with parameter settings in Table 2 for each load condition (Datasets A, B, and C) with input size of 16 × 16 and test them on all three test sets A, B, and C. We then measure the minimal, maximal, and average fault prediction performances along with the performance by the IDS fusion model that takes the output of the eight models and use the improved IDS fusion algorithm to make prediction. The result at the last row of Figure 6a shows that fusion model ids-de-16 achieves the highest average performance than the maximum performance of individual CNNs. This is still true for the ids-de-32 model in Figure 6c. For the fusion models trained with the fan-end sensor, which has lower diagnosis quality, the fusion model achieves better performance than the average performance of the 16 models. Their performances are also higher than those of individual CNN models for most test scenarios.  Table 5 and subsequent contents, A, B and C stand for three different datasets. The first letter stands for the training dataset, the second letter stands for the testing dataset.
Example: A→B means we trained our models on Dataset A and tested our models on Dataset B. For further analysis, we applied our IDS fusion method to all the 16 CNN models at the drive end, all the 16 CNN models at the fan end, and all 32 CNN models at both ends (ids-de-all, ids-fe-all, and ids-all, respectively). In Figure 6e, we have the following observations. First, the fusion models of both fan-end and drive-end have improved the accuracy by about 2%, a significant improvement for this challenging problem. After combining all 32 CNN models through our IDS method, the final diagnosis accuracy based on two sensors can reach 98.92%, which has increased by almost 9.81% from the worst CNN model based on single sensor in Table 5.
To further validate the robustness of the proposed method, we selected three best IDSCNN models (ids-all), one for each load condition. We then evaluate their performance on 20 randomly sampled test datasets for each load condition and plot the Boxplot in Figure 7. Our first observation is that these IDSCNN models all achieved 100% accuracy for the test datasets of the same load condition, which is rare with the individual CNN models, as shown in Table 5. For test datasets generated from different load conditions, our IDSCNN models achieved an average accuracy higher than 97% with small performance variation for A→B, A→C, B→A, B→C, and C→B, except for the case C→A. It means that the best IDSCNN model trained with signals from load condition C reached an average accuracy of 93% with large variation, which is much lower than the other cases.
We compare our best DSCNN (CNN models fused with traditional D-S method) models and the best IDSCNN models with 5 other bearing fault diagnosis models in Figure 8. We can see that the DSCNN method, which combines the CNN models with traditional D-S evidence theory has higher accuracy than FFT-SVM, FFT-MLP, FFT-DNN [48], WDCNN [49] and WDCNN (AdaBN) [49] on all nine conditions but has lower accuracy than the WDCNN (AdaBN) model on C→A. As shown in the last row of Figure 8, our IDSCNN models achieved the best diagnosis results under all six test scenarios when compared with the first five models. Especially, the accuracy of C→A has been improved from 88.3% for WDCNN (AdaBN)  We compare our best DSCNN (CNN models fused with traditional D-S method) models and the best IDSCNN models with 5 other bearing fault diagnosis models in Figure 8. We can see that the DSCNN method, which combines the CNN models with traditional D-S evidence theory has higher accuracy than FFT-SVM, FFT-MLP, FFT-DNN [48], WDCNN [49] and WDCNN (AdaBN) [49] on all nine conditions but has lower accuracy than the WDCNN (AdaBN) model on C→A. As shown in the last row of Figure 8, our IDSCNN models achieved the best diagnosis results under all six test scenarios when compared with the first five models. Especially, the accuracy of C→A has been improved from 88.3% for WDCNN (AdaBN) and 86.0% for DSCNN to 93.8%. This can be attributed to the larger number of paradoxical evidences under the C→A diagnosis condition, which the traditional D-S fusion method cannot handle while our improved IDS fusion algorithm can. The diagnosis evidences under other test conditions may have relatively higher consistence, so the diagnosis accuracies of DSCNN and IDSCNN are very similar.    To figure out how our improved D-S fusion algorithm improves the prediction performance, we compared the confusion matrices of a drive-end CNN model, a fan-end CNN model, and the IDSCNN fusion model, as shown in Figure 9. The vertical axis of Figure 9, represents the true labels, To figure out how our improved D-S fusion algorithm improves the prediction performance, we compared the confusion matrices of a drive-end CNN model, a fan-end CNN model, and the IDSCNN fusion model, as shown in Figure 9. The vertical axis of Figure 9, represents the true labels, while the lateral axis represents the predicted labels. The values in the matrices are the number of predicted samples for each fault type. In Figure 9a, we found that CNN model #7 trained on Dataset C with driver-end signal and tested on test Dataset A performs well except a large number of misclassifications of Type 3, type4 and Type 5 samples. Its total accuracy is 83.8%. On the other hand, Figure 9b shows that CNN model #25 trained on Dataset C with fan-end signal and tested on test Dataset A performs well except significant number of misclassifications of Type 1 and Type 2 samples. Overall, it has an accuracy of 79.2%. These two models have complementary fault classification capabilities, which are exploited by the IDSCNN model. Figure 9c shows that the fusion model achieved the highest performance among the three. The total diagnosis accuracy has improved from 83.8% of the drive-end CNN model, 79.2% of the fan-end CNN model to 92.4% after information fusion. Figure 7. Accuracy of the best IDSCNN models on 20 repeated experiments. To figure out how our improved D-S fusion algorithm improves the prediction performance, we compared the confusion matrices of a drive-end CNN model, a fan-end CNN model, and the IDSCNN fusion model, as shown in Figure 9. The vertical axis of Figure 9, represents the true labels, while the lateral axis represents the predicted labels. The values in the matrices are the number of predicted samples for each fault type. In Figure 9a, we found that CNN model #7 trained on Dataset C with driver-end signal and tested on test Dataset A performs well except a large number of misclassifications of Type 3, type4 and Type 5 samples. Its total accuracy is 83.8%. On the other hand, Figure 9b shows that CNN model #25 trained on Dataset C with fan-end signal and tested on test Dataset A performs well except significant number of misclassifications of Type 1 and Type 2 samples. Overall, it has an accuracy of 79.2%. These two models have complementary fault classification capabilities, which are exploited by the IDSCNN model. Figure 9c shows that the fusion model achieved the highest performance among the three. The total diagnosis accuracy has improved from 83.8% of the drive-end CNN model, 79.2% of the fan-end CNN model to 92.4% after information fusion.  Table 5 To validate that the performance difference between DSCNN and IDSCNN is statistically significant, we repeated C→A test for twenty times. The DSCNN models and IDSCNN models were trained respectively with samples from the load condition C (Dataset C) and each test datum is randomly sampled under the load condition A (Dataset A). We calculated five statistical parameters (max, min, median, mean and standard deviation) according to the 20 times results of the DSCNN and IDSCNN models. In Figure 10, we can find that the best performance of DSCNN is 86.8% while the worst performance of IDSCNN is 93.0%. The average classification accuracy of the DSCNN models is 86.0 ± 0.4% while the IDSCNN models can classify 93.8 ± 0.4% of the testing data correctly. Though both DSCNN and IDSCNN methods have the same standard deviation 0.4%, it is apparent in Figure 10 that the classification accuracy of the proposed IDSCNN model is higher than that of DSCNN model on C→A test with an average 7.8% improvement.  Table 5) for C→A test; (b) confusion matrix of #25 CNN model trained with fan-end signal (See Table 5 To validate that the performance difference between DSCNN and IDSCNN is statistically significant, we repeated C→A test for twenty times. The DSCNN models and IDSCNN models were trained respectively with samples from the load condition C (Dataset C) and each test datum is randomly sampled under the load condition A (Dataset A). We calculated five statistical parameters (max, min, median, mean and standard deviation) according to the 20 times results of the DSCNN and IDSCNN models. In Figure 10, we can find that the best performance of DSCNN is 86.8% while the worst performance of IDSCNN is 93.0%. The average classification accuracy of the DSCNN models is 86.0 ± 0.4% while the IDSCNN models can classify 93.8 ± 0.4% of the testing data correctly. Though both DSCNN and IDSCNN methods have the same standard deviation 0.4%, it is apparent in Figure 10 that the classification accuracy of the proposed IDSCNN model is higher than that of DSCNN model on C→A test with an average 7.8% improvement. and IDSCNN models. In Figure 10, we can find that the best performance of DSCNN is 86.8% while the worst performance of IDSCNN is 93.0%. The average classification accuracy of the DSCNN models is 86.0 ± 0.4% while the IDSCNN models can classify 93.8 ± 0.4% of the testing data correctly. Though both DSCNN and IDSCNN methods have the same standard deviation 0.4%, it is apparent in Figure 10 that the classification accuracy of the proposed IDSCNN model is higher than that of DSCNN model on C→A test with an average 7.8% improvement.

Conclusions
This paper presents IDSCNN, an ensemble convolutional neural network model with improved D-S evidence fusion for bearing fault diagnosis. We proposed an improved fusion algorithm based on the traditional D-S evidence theory by rectifying the raw evidence through evidence credibility, which helps to address its paradox evidence issues. Our extensive experiments on the Case Western Reserve University (CWRU) bearing datasets showed that the traditional D-S fusion method fails to combine the evidences/signals from two sensors located at the fan-end and drive-end of the testing rig due to evidence conflicts. On the other hand, our IDSCNN model can deal with four common

Conclusions
This paper presents IDSCNN, an ensemble convolutional neural network model with improved D-S evidence fusion for bearing fault diagnosis. We proposed an improved fusion algorithm based on the traditional D-S evidence theory by rectifying the raw evidence through evidence credibility, which helps to address its paradox evidence issues. Our extensive experiments on the Case Western Reserve University (CWRU) bearing datasets showed that the traditional D-S fusion method fails to combine the evidences/signals from two sensors located at the fan-end and drive-end of the testing rig due to evidence conflicts. On the other hand, our IDSCNN model can deal with four common paradoxical evidences and achieved higher diagnosis accuracy by fusing signal from two sensors when compared with SVM, MLP, DNN, WDCNN, WDCNN (AdaBN) and DSCNN models. Our IDSCNN model has shown good adaptability on the CWRU bearing fault datasets under different load conditions, which makes it suitable for high-performance bearing fault diagnosis under varying load conditions with multi-sensors signals.