In this section, we present two sets of experiment results. In experiment set 1, we evaluate how well our improved D-S (IDS) evidence fusion model addresses the paradoxes compared with several existing fusion methods. In experiment set 2, we evaluate the performance of IDSCNN for bearing fault diagnosis and compare its performance with other methods.

#### 3.1. Evaluation of the Improved D-S Evidence (IDS) Fusion Algorithm

To compare the IDS method with existing evidence fusion methods, we take four paradoxes described in

Section 3.1 as examples. Evidences can be divided into consistent evidences and conflict evidences. The former type supports the same proposition and the latter type disagrees with other evidences. From

Table 1, we can see that

m_{1},

m_{3}, and

m_{4} in complete conflict paradox;

m_{1},

m_{2}, and

m_{4} in 0 trust paradox;

m_{2},

m_{3}, and

m_{4} in 1 trust paradox; and

m_{1},

m_{3},

m_{4}, and

m_{5} in high conflict paradox are all consistent evidences in the four paradoxes groups, respectively. The remaining propositions belong to conflict evidences. Traditional D-S evidence has been proved to fail in dealing with these four common paradoxes. Here we take four modified D-S evidence fusion algorithms from Yager [

44], Sun [

45], Murphy [

46] and Deng [

47] for comprehensive analysis. The fusion results are presented in

Table 4.

In

Table 4, we can see that both Yager and Sun solve the conflicts through allotting the conflict factor to the unknown proposition in

$\mathsf{\Theta}$, which however increases the uncertainty. Yager’s method fails in the four conditions and cannot solve the paradoxes when the number of evidences is more than two. Under the four paradoxical circumstances, Yager and Sun fail in getting reasonable results due to the high uncertainty with unknown propositions in

$\mathsf{\Theta}$. The synthesis results from Murphy, Deng and our IDS method have achieved good performances and relatively rational results. Overall, our IDS fusion method has achieved the best results for three paradoxes out of four and achieved the second best result for the remaining paradox.

In the following diagnosis experiments, we choose the proposition with the maximum BPA as the fusion result. Thus, for the maximum BPA, the closer to 1 the more confidence it presents intuitively. As shown in

Table 4, our IDS method has the largest maximum BPA of 0.9284 in complete conflict paradox, 0.7418 in 0 trust paradox, 0.9406 in 1 trust paradox and is ranked second in high conflict paradox with BPA 0.6210. In general, our proposed IDS achieves good performance in dealing with the paradox evidences for fault diagnosis.

#### 3.2. Evaluation of IDSCNN for Bearing Fault Diagnosis on the CWRU Dataset

As described before, the CRWU bearing datasets are acquired with different loading levels. To check how the loading level affects the vibration signals, we randomly extract one sample for different fault types (0, 1, ..., 9) and load condition. We then plot the 16 × 16 input data from different Datasets A, B, and C. In

Figure 5, we can find that the RMS maps for a given fault type under different loads share significant similarities (column-vise similarity) in the CRWU bearing dataset. The RMS maps, however, vary significantly from fault type to fault type. Furthermore, even under the same fault state with the same load, there is a big difference between the drive end RMS map and the fan end RMS map. This means different sensors will carry different information. Combining different sensor information should give more information for fault diagnosis. Since we fix the length of the frequency spectrum, the 16 × 16 and 32 × 32 RMS maps are actually coming from the same frequency spectrum, but the 32 × 32 RMS maps carry more information than the 16 × 16 RMS maps.

First, we conduct experiments to evaluate the performances of our individual CNN models. In actual fault diagnosis scenario, the bearing load changes at all times. It is thus desirable that the fault prediction model can adapt to different loading conditions. To test this load adaptability of our models, we trained the CNN models on different training Datasets A, B and C corresponding to three different load conditions. We then tested their performance on testing Datasets A, B and C under three load conditions. There are nine combinations between training sets and testing sets, as shown in the top part of

Table 5. For each model, we use the 10,000 samples for training and test it on the 2500 samples of different conditions. We should note that A, B and C now stand for different datasets for training and testing in the subsequent sections.

As described in

Table 2, we have developed 16 CNN architectures to evaluate. We build a model for each of these architectures and each sensor. Thus, we built 32 CNN models in total. To avoid random sampling errors, we repeated the same modeling process 20 times and took the average values as the final results, as shown in

Table 5.

In

Table 5, we can see that all CNN models have almost perfect performances on their training data and their performance vary on other testing datasets. Comparing the average accuracy (AVG) of the 32 CNN models, we can find some interesting observations. First, we can find that model #2 is the best model with accuracy of 97.89%. However, its adaptability from training set C to testing set A is 89.81%, which is lower than that of model #4, #9, #11 and #13. On the other hand, model #27 might be the worst model since its AVG is only 89.11%, but its local adaptability from training set B to testing set A (91.93%) are almost better than all CNN models with 32 × 32 input at the fan end. This phenomenon tells us that even the best selected model may have poor performance on certain circumstances and even the worst model may present relatively good performance under some conditions.

We also compared the performance of models trained with driver-end and fan-end signal with 16 × 16 input sizes (

Table 5). The average accuracy of eight CNN models trained from driver-end signals is 96.54% compared to the 90.97% of the models trained with fan-end signals. This means that the signal from the drive end is more useful than the fan end for fault diagnosis. This may be caused by several reasons such as the sensor quality, sensor locations, environment effects and so on. Next, we evaluate how our improved D-S fusion algorithm helps improve the fault diagnosis performance of our CNN models.

Figure 6 shows the experiment results for the individual CNNs and the IDSCNN models with different fusion sources. In

Figure 6a, we trained right CNN models with parameter settings in

Table 2 for each load condition (Datasets A, B, and C) with input size of 16 × 16 and test them on all three test sets A, B, and C. We then measure the minimal, maximal, and average fault prediction performances along with the performance by the IDS fusion model that takes the output of the eight models and use the improved IDS fusion algorithm to make prediction. The result at the last row of

Figure 6a shows that fusion model ids-de-16 achieves the highest average performance than the maximum performance of individual CNNs. This is still true for the ids-de-32 model in

Figure 6c. For the fusion models trained with the fan-end sensor, which has lower diagnosis quality, the fusion model achieves better performance than the average performance of the 16 models. Their performances are also higher than those of individual CNN models for most test scenarios.

For further analysis, we applied our IDS fusion method to all the 16 CNN models at the drive end, all the 16 CNN models at the fan end, and all 32 CNN models at both ends (ids-de-all, ids-fe-all, and ids-all, respectively). In

Figure 6e, we have the following observations. First, the fusion models of both fan-end and drive-end have improved the accuracy by about 2%, a significant improvement for this challenging problem. After combining all 32 CNN models through our IDS method, the final diagnosis accuracy based on two sensors can reach 98.92%, which has increased by almost 9.81% from the worst CNN model based on single sensor in

Table 5.

To further validate the robustness of the proposed method, we selected three best IDSCNN models (ids-all), one for each load condition. We then evaluate their performance on 20 randomly sampled test datasets for each load condition and plot the Boxplot in

Figure 7. Our first observation is that these IDSCNN models all achieved 100% accuracy for the test datasets of the same load condition, which is rare with the individual CNN models, as shown in

Table 5. For test datasets generated from different load conditions, our IDSCNN models achieved an average accuracy higher than 97% with small performance variation for A→B, A→C, B→A, B→C, and C→B, except for the case C→A. It means that the best IDSCNN model trained with signals from load condition C reached an average accuracy of 93% with large variation, which is much lower than the other cases.

We compare our best DSCNN (CNN models fused with traditional D-S method) models and the best IDSCNN models with 5 other bearing fault diagnosis models in

Figure 8. We can see that the DSCNN method, which combines the CNN models with traditional D-S evidence theory has higher accuracy than FFT-SVM, FFT-MLP, FFT-DNN [

48], WDCNN [

49] and WDCNN (AdaBN) [

49] on all nine conditions but has lower accuracy than the WDCNN (AdaBN) model on C→A. As shown in the last row of

Figure 8, our IDSCNN models achieved the best diagnosis results under all six test scenarios when compared with the first five models. Especially, the accuracy of C→A has been improved from 88.3% for WDCNN (AdaBN) and 86.0% for DSCNN to 93.8%. This can be attributed to the larger number of paradoxical evidences under the C→A diagnosis condition, which the traditional D-S fusion method cannot handle while our improved IDS fusion algorithm can. The diagnosis evidences under other test conditions may have relatively higher consistence, so the diagnosis accuracies of DSCNN and IDSCNN are very similar.

To figure out how our improved D-S fusion algorithm improves the prediction performance, we compared the confusion matrices of a drive-end CNN model, a fan-end CNN model, and the IDSCNN fusion model, as shown in

Figure 9. The vertical axis of

Figure 9, represents the true labels, while the lateral axis represents the predicted labels. The values in the matrices are the number of predicted samples for each fault type. In

Figure 9a, we found that CNN model #7 trained on Dataset C with driver-end signal and tested on test Dataset A performs well except a large number of misclassifications of Type 3, type4 and Type 5 samples. Its total accuracy is 83.8%. On the other hand,

Figure 9b shows that CNN model #25 trained on Dataset C with fan-end signal and tested on test Dataset A performs well except significant number of misclassifications of Type 1 and Type 2 samples. Overall, it has an accuracy of 79.2%. These two models have complementary fault classification capabilities, which are exploited by the IDSCNN model.

Figure 9c shows that the fusion model achieved the highest performance among the three. The total diagnosis accuracy has improved from 83.8% of the drive-end CNN model, 79.2% of the fan-end CNN model to 92.4% after information fusion.

To validate that the performance difference between DSCNN and IDSCNN is statistically significant, we repeated C→A test for twenty times. The DSCNN models and IDSCNN models were trained respectively with samples from the load condition C (Dataset C) and each test datum is randomly sampled under the load condition A (Dataset A). We calculated five statistical parameters (max, min, median, mean and standard deviation) according to the 20 times results of the DSCNN and IDSCNN models. In

Figure 10, we can find that the best performance of DSCNN is 86.8% while the worst performance of IDSCNN is 93.0%. The average classification accuracy of the DSCNN models is 86.0 ± 0.4% while the IDSCNN models can classify 93.8 ± 0.4% of the testing data correctly. Though both DSCNN and IDSCNN methods have the same standard deviation 0.4%, it is apparent in

Figure 10 that the classification accuracy of the proposed IDSCNN model is higher than that of DSCNN model on C→A test with an average 7.8% improvement.