Fault Diagnosis of Motor Bearings Based on a One-Dimensional Fusion Neural Network

Deep learning has been an important topic in fault diagnosis of motor bearings, which can avoid the need for extensive domain expertise and cumbersome artificial feature extraction. However, existing neural networks have low fault recognition rates and low adaptability under variable load conditions. In order to solve these problems, we propose a one-dimensional fusion neural network (OFNN), which combines Adaptive one-dimensional Convolution Neural Networks with Wide Kernel (ACNN-W) and Dempster-Shafer (D-S) evidence theory. Firstly, the original vibration time-domain signals of a motor bearing acquired by two sensors are resampled. Then, four frameworks of ACNN-W optimized by RMSprop are utilized to learn features adaptively and pre-classify them with Softmax classifiers. Finally, the D-S evidence theory is used to comprehensively determine the class vector output by the Softmax classifiers to achieve fault detection of the bearing. The proposed method adapts to different load conditions by incorporating complementary or conflicting evidences from different sensors through experiments on the Case Western Reserve University (CWRU) motor bearing database. Experimental results show that the proposed method can effectively enhance the cross-domain adaptive ability of the model and has a better diagnostic accuracy than other existing experimental methods.


Introduction
Bearings are one of the most important mechanical components in rotating machinery. According to statistics, 40% of motor failures are bearing failures [1]. In order to reduce maintenance costs and prevent harmful or even damaging consequences of faults and failures, we need to find and replace fault bearings early. There are many ways to detect motor bearing faults, such as fault diagnosis through stray flux analysis [2], Park's vector method (PVA) [3], instantaneous power factor (IPF) monitoring [4] and so on. Among them, using the acceleration of the motor housing to determine the existence of these faults is the most accurate method, which has become a very well-developed field in recent years [5]. Real time bearing vibration signal fault detection methods, including traditional methods and deep learning methods, have become a hot topic recently [6].
The traditional methods require feature extraction of the bearing vibration signal, dimension reduction, and classification. Feature extraction is mainly for signals in the time-domain, frequency domain and time-frequency domain. Traditional time-domain techniques extract features from deep learning and ensemble learning. Therefore, the proposed method uses D-S evidence theory to ensemble deep neural networks with multiple sensor information to construct a data fusion model for bearing fault diagnosis. The proposed fusion model can combine multiple uncertain evidences and provide fusion results by combining consensus information and exclusion information. It has the following characteristics: (1) Compared with the machine learning-based damage assessment method proposed by the traditional method, the proposed method can adaptively extract features directly from the original vibration signal without manual feature extraction. Traditional machine learning damage detection methods use hand-crafted features that are not only sub-optimal, but also have high computational complexity. (2) The ACNN-W can effectively suppress over-fitting and improve the generalization performance of the network. (3) In order to overcome the limitations of individual deep learning models, reduce the impact of random initialization of neural networks, and make full use of information in different fields, we use D-S evidence theory to make comprehensive decisions on OFNN.
The remainder of this paper is as follows: Section 2 introduces the related work. Section 3 presents our method. Section 4 illustrates the evaluation of the ACNN-W model and optimizer and learning rate selection. Experiments in Section 5 expound the evaluation and analysis of the OFNN network. Section 6 is our conclusion.

Related Work
The method of classifying based on artificial extraction can be used in bearing fault diagnosis, but the method relies on manually acquired prior knowledge and cumbersome manual feature extraction, and the final accuracy is greatly affected by the selected features. Therefore, in this paper, we adopt a deep learning scheme to combine feature extraction and feature classification into one step. Some present methods of bearing fault diagnosis based on deep learning are described in the following paragraphs.
Xie [28] proposed rotating machinery fault diagnosis based on convolutional neural network and empirical mode decomposition (EMD). The EMD is used to extract the frequency domain information of the signal, and CNN is used as the classification. Guo et al. [29] proposed multi-scale continuous wavelet transform combined with CNN for bearing fault detection, but they only used CNN for classification without fully utilizing CNN's powerful feature extraction capability. Ince [30] proposed to combine feature extraction and classification tasks into a network. Abdeljaber [31] used a one-dimensional convolutional neural network for the vibration detection of steel frames, and directly used the powerful learning ability of one-dimensional convolutional neural networks to classify faults.
Pan [32] believed that the above works used the same load to collect signals for training and testing CNN, which limits the further application of the model. Therefore, the LiftingNet deep learning network is proposed to train and test signals with different sampling frequencies and different loads. However, this method normalizes the amplitude, which leads the model not to recognize the severity of the fault.
Zhang [23] proposed WDCNN, which is an end-to-end diagnostic method for the case of variable loads. However, this method is only a single model, which has certain limitations and is more susceptible to random initialization. In addition, they performed a min-max regularization operation on the input, which made the signal distribution susceptible to individual extreme values, resulting in excessive distribution differences under different loads. Li [33] proposed to treat the original signal as a two-dimensional frequency domain signal as an input and use a multi-sensor classifier to integrate fault diagnosis. It can well overcome the limitations of a single model, but the original signal is a one-dimensional time domain signal, and transforming it into a two-dimensional signal may destroy some spatial information. Therefore, we propose a one-dimensional fusion neural network based on D-S evidence theory, using the original signal without min-max regularization as input, and directly using the one-dimensional convolutional neural network to adaptively extract and classify the original time domain signal, and adopt it in the network. Moreover, RMSprop optimization and BN are used to suppress over-fitting of the network.

Proposed OFNN for Bearing Fault Diagnosis
We propose the Adaptive one-dimensional Convolution Neural Networks with Wide Kernel (ACNN-W) as the basis for the One-dimensional Fusion Neural Network (OFNN). As shown in Figure 1, ACNN-W is proposed to learn features adaptively from raw mechanical data without prior knowledge. And we use ACNN-W as sub-classifiers of OFNN, combined with D-S evidence theory for evidence fusion. In this Section, we introduce the design of ACNN-W's framework and the application of D-S evidence theory in one-dimensional fusion neural networks.

Proposed OFNN for Bearing Fault Diagnosis
We propose the Adaptive one-dimensional Convolution Neural Networks with Wide Kernel (ACNN-W) as the basis for the One-dimensional Fusion Neural Network (OFNN). As shown in Figure 1, ACNN-W is proposed to learn features adaptively from raw mechanical data without prior knowledge. And we use ACNN-W as sub-classifiers of OFNN, combined with D-S evidence theory for evidence fusion. In this Section, we introduce the design of ACNN-W's framework and the application of D-S evidence theory in one-dimensional fusion neural networks.

Architecture Design for ACNN-W
CNN consists of three layers [34], which are convolutional layers, pooling layers and fully connected layers. As shown in Figure 2, the first few layers of a typical ACNN-W consist of a combination of two types of layers-convolutional layers, followed by pooling layers-and the last layer is a fully-connected layer. Next, we will describe them in more detail. The structural parameters of ACNN-W are shown in Table 1. The one-dimensional convolutional neural network extracts feature information from the first eight layers. The first nine layers use the ReLU function as the activation function, and the Softmax layer behind the network converts the output of the neural network into a probability distribution. Compared with the

Architecture Design for ACNN-W
CNN consists of three layers [34], which are convolutional layers, pooling layers and fully connected layers. As shown in Figure 2, the first few layers of a typical ACNN-W consist of a combination of two types of layers-convolutional layers, followed by pooling layers-and the last layer is a fully-connected layer. Next, we will describe them in more detail.

Proposed OFNN for Bearing Fault Diagnosis
We propose the Adaptive one-dimensional Convolution Neural Networks with Wide Kernel (ACNN-W) as the basis for the One-dimensional Fusion Neural Network (OFNN). As shown in Figure 1, ACNN-W is proposed to learn features adaptively from raw mechanical data without prior knowledge. And we use ACNN-W as sub-classifiers of OFNN, combined with D-S evidence theory for evidence fusion. In this Section, we introduce the design of ACNN-W's framework and the application of D-S evidence theory in one-dimensional fusion neural networks.

Architecture Design for ACNN-W
CNN consists of three layers [34], which are convolutional layers, pooling layers and fully connected layers. As shown in Figure 2, the first few layers of a typical ACNN-W consist of a combination of two types of layers-convolutional layers, followed by pooling layers-and the last layer is a fully-connected layer. Next, we will describe them in more detail. The structural parameters of ACNN-W are shown in Table 1. The one-dimensional convolutional neural network extracts feature information from the first eight layers. The first nine layers use the ReLU function as the activation function, and the Softmax layer behind the network converts the output of the neural network into a probability distribution. Compared with the  The structural parameters of ACNN-W are shown in Table 1. The one-dimensional convolutional neural network extracts feature information from the first eight layers. The first nine layers use the ReLU function as the activation function, and the Softmax layer behind the network converts the output of the neural network into a probability distribution. Compared with the WDCNN [23], the ACNN-W network has fewer layers, requires less network parameters and has a stronger ability to avoid over-fitting. In order to effectively suppress the neural network over-fitting and improve the ability to express, we use the Dropout algorithm in the fully connected layer. The algorithm turns the value of a layer of neurons to 0 with a certain probability, thereby preventing co-adaptation between neurons. Similarly, Zhang [23] proposed that when the first layer of the one-dimensional convolutional network is a large convolution kernel, a larger receptive field can be obtained, and the useful features for diagnosis can be learned autonomously and the over-fitting can be suppressed, so the network we propose uses a large convolution kernel at the first level.
The convolutional layer consists of a number of 1-D filters with weighting parameters. These filters combine with the input data and get an output called feature map, and each filter shares the same weighting parameters for all slices of input data to reduce training time and model complexity.
In order to improve the training efficiency of the network, reduce the internal covariate transfer, and enhance the generalization ability of the neural network. We used Batch Normalization (BN), which can accelerate deep network training by reducing internal covariate shift, reduce the fluctuation and improve the recognition rate [35].
Commonly used neural network activation functions are the sigmoid function (Sigmoid), the hyperbolic tangent function (Tanh), and the rectified linear unit (ReLU). When the absolute value of the input value is relatively large, the derivative values of the Sigmoid and Tanh functions are close to 0, which causes the error value to be propagated downward when updating the weight with the error back propagation, and the vanishing gradient problem during training the underlying network. Conversely, when the input value of the ReLU function is greater than 0, its reciprocal value is 1, which can well overcome the vanishing gradient problem.

The Cost Function of the ACNN-W
Two cost functions have been developed to measure the error between the input vector and the target vector, which are the traditional mean square error cost function and the newly developed cross entropy cost function. Compared with the mean square error cost function, the cross entropy cost function exhibits faster convergence speed and stronger global optimization ability. So cross entropy is a widely used loss function that characterizes the distance between two probability distributions. The smaller the cross entropy, the closer the two probability distributions are. Given two probability distributions p and q, we can use q to denote the cross entropy of p as: where the probability distribution function satisfies: Combined with mini-batch, the loss function cross entropy can be expressed as: where m is the mini-batch size, q is the actual output value of the Softmax layer, and p is its target distribution [36].

The Optimizer of the ACNN-W
The optimizer plays an important role in the training speed and classification accuracy of the neural network model. The commonly used optimizer has three kinds of optimizers: RMSprop, Adam and Adadelta. Qu [21] demonstrates the superiority of the RMSProp optimizer in bearing fault diagnosis. The optimizer can effectively prevent the premature convergence problem in the deep learning process by adaptively retaining the learning rate based on the mean value of the nearest magnitude of the weight gradient, and is suitable for processing non-stationary data such as a vibration signal. So we use the RMSprop optimizer, and the RMSprop optimizer training steps are as follows: Input: Global learning rate ε, decay rate p, Initial parameter θ, Constant σ is standing at 10 −6 (for stable values) Initialize cumulative variables r = 0 While not reach the stop criterion do Take a small batch of m samples {x 1 , x 2 , · · · x m } from the training set and the corresponding label is {y 1 , y 2 , · · · y m } Gradient calculation:

The Application of D-S Evidence Theory in OFNN
To reduce the impact of random initialization of the network, we use at least two ACNN-W networks to train the same data set. Therefore, as shown in Figure 1, the proposed method constructs four sub-classifiers based on ACNN-W. The vibration data measured by the two sensors in the same period is cut and used as the input of the classifiers 1, 2 and the classifiers 3, 4, respectively. In addition, the classifiers 1, 2, 3, and 4 use random initialization operations and are trained individually. After the training is completed, the vibration signals that need to be verified are respectively input into the trained network, and the class vectors of the Softmax layers of the four networks are taken out for D-S evidence fusion, thereby obtaining the final diagnosis result. D-S evidence theory is mainly used to deal with uncertainty reasoning, which was proposed by Harvard mathematician Dempster and his student Shafer. In the currently used combination strategy, voting is simple and convenient, and has been widely applied to different ensemble learning methods. However, the main disadvantage of voting majority consent rules is that all individual models have the same weight and cannot fully exploit hidden information. The D-S evidence theory can be considered as a general extension of Bayesian theory, which can effectively deal with incomplete data. D-S evidence theory is not the probability of calculating propositions, but the probability of supporting evidence to support propositions and provides an alternative method for dealing with uncertainty inference based on incomplete information, solving the prior probability problem by tracking explicit probability measures that may lack information [37][38][39].
We use the fault type as the identification framework of D-S evidence theory Θ = {A 1 , A 2 , . . . , An}, and the probability value of the Softmax layer output of the sub-classifiers can well satisfy the condition of the basic probability assignment function (BPA) m: where proposition A is a non-zero subset that identifies the frame, the symbol m is a measure of the subset of Θ and m(A) represents the degree of trust in A.
For A ⊂ Θ, in the recognition framework, the Dempster synthesis rules for the finite basic probability distribution functions m 1 , m 2 , . . . , m n are as follows: where k represents the degree of conflict evidence, and the coefficient 1/(1 − k) is called the normalization factor, ensuring that the sum of the probabilities of BPA is 1.
According to the Dempster synthesis rule, we can fuse the class vectors output of the softmax layer of each ACNN-W, and use the category with the highest probability value of output as the final diagnostic category.

Enhancement and Division of Data Sets
We used a test database from the Case Western Reserve University (CWRU) bearing database with a sampling frequency of 12 kHz for verification experiments. The experiment used a 1.49 kW Reliance Electric motor, whose bearings were replaceable, and the bearing type used in this drive end is 6205-2RS JEM SKF. Figure 3 shows the data sampling system. The bearings used are processed by electrical discharge machining (EDM). In the inner raceway, the rolling elements and the outer raceway introduce faults ranging from 0.1778 mm in diameter to 1.016 mm in diameter, respectively. The faulty bearing was then reinstalled into the test motor and the vibration data of the motor load from 0.75 to 2.24 kW was recorded [40]. We use the fault type as the identification framework of D-S evidence theory Θ = {A1, A2, …, An}, and the probability value of the Softmax layer output of the sub-classifiers can well satisfy the condition of the basic probability assignment function (BPA) m: where proposition A is a non-zero subset that identifies the frame, the symbol m is a measure of the subset of Θ and m(A) represents the degree of trust in A.
For A ⊂ Θ, in the recognition framework, the Dempster synthesis rules for the finite basic probability distribution functions m1, m2, …, mn are as follows: where k represents the degree of conflict evidence, and the coefficient 1/(1 − k) is called the normalization factor, ensuring that the sum of the probabilities of BPA is 1.
According to the Dempster synthesis rule, we can fuse the class vectors output of the softmax layer of each ACNN-W, and use the category with the highest probability value of output as the final diagnostic category.

Enhancement and Division of Data Sets
We used a test database from the Case Western Reserve University (CWRU) bearing database with a sampling frequency of 12 kHz for verification experiments. The experiment used a 1.49 kW Reliance Electric motor, whose bearings were replaceable, and the bearing type used in this drive end is 6205-2RS JEM SKF. Figure 3 shows the data sampling system. The bearings used are processed by electrical discharge machining (EDM). In the inner raceway, the rolling elements and the outer raceway introduce faults ranging from 0.1778 mm in diameter to 1.016 mm in diameter, respectively. The faulty bearing was then reinstalled into the test motor and the vibration data of the motor load from 0.75 to 2.24 kW was recorded [40]. Data set enhancement techniques are commonly used in the field of computer vision to increase the generalization of the network by adding training samples. In view of the fact that the vibration signal collected by the acceleration sensor is a one-dimensional signal, overlap sampling is used in this paper. As shown in Figure 4, we take 2048 data points of the continuous vibration signal as a sample, and offset it by a set amount to be the second sample. Data set enhancement techniques are commonly used in the field of computer vision to increase the generalization of the network by adding training samples. In view of the fact that the vibration signal collected by the acceleration sensor is a one-dimensional signal, overlap sampling is used in this paper. As shown in Figure 4, we take 2048 data points of the continuous vibration signal as a sample, and offset it by a set amount to be the second sample. As shown in Table 2, in this paper, the shock signals of 0.75 kW, 1.49 kW, and 2.24 kW are respectively corresponding to three large data sets of A, B, and C. Each large data set contains a training set and a test set. Ten states under 0.75 kW, 1.49 kW, and 2.24 kW were numbered 1 to 10, and each state was divided by an offset of 20 to make 5000 samples. Each state of the same horsepower is randomly selected from 500 samples into the test set, and the rest is placed in the training set. Each sample consists of two parts, the drive sensor signal and the fan end sensor signal at the same time.

Evaluation and Settings of ACNN-W
Since the OFCNN model is based on ACNN-W, we need to evaluate ACNN-W first. In the next, we evaluate the ACNN-W network using the driver signal as an example.

Settings of the ACNN-W
We experimented with the Tensorflow deep learning framework, using a computer configured with CPU i3-3120M and 12 GB of memory, with a selected mini-batch size of 256.
In order to observe the influence of the BN layer on the model, as shown in Figures 5 and 6, experiments with the BN layer were performed using the training set C and the test set C. It can be seen that the BN layer can accelerate the steady decline of the loss function and the steady increase of the recognition rate, and reduce the fluctuation of the recognition rate. As shown in Table 2, in this paper, the shock signals of 0.75 kW, 1.49 kW, and 2.24 kW are respectively corresponding to three large data sets of A, B, and C. Each large data set contains a training set and a test set. Ten states under 0.75 kW, 1.49 kW, and 2.24 kW were numbered 1 to 10, and each state was divided by an offset of 20 to make 5000 samples. Each state of the same horsepower is randomly selected from 500 samples into the test set, and the rest is placed in the training set. Each sample consists of two parts, the drive sensor signal and the fan end sensor signal at the same time.

Evaluation and Settings of ACNN-W
Since the OFCNN model is based on ACNN-W, we need to evaluate ACNN-W first. In the next, we evaluate the ACNN-W network using the driver signal as an example.

Settings of the ACNN-W
We experimented with the Tensorflow deep learning framework, using a computer configured with CPU i3-3120M and 12 GB of memory, with a selected mini-batch size of 256.
In order to observe the influence of the BN layer on the model, as shown in Figures 5 and 6, experiments with the BN layer were performed using the training set C and the test set C. It can be seen that the BN layer can accelerate the steady decline of the loss function and the steady increase of the recognition rate, and reduce the fluctuation of the recognition rate.
Considering the effect of different learning rate settings on the actual convergence rate, different learning rates are used for model training. Each group of optimizers and learning rates are tested 10 times, two epochs are trained each time. The average recognition rate of training set C and test set A was used as an evaluation index, and the experimental results are shown in Table 3: It can be seen from Table 3 that the Adadelta optimizer has a poor classification result when the learning rate is low, and has a high classification result when the learning rate is large. The experimental results of the RMSprop and Adam optimizers are the opposite. After comparing the best classification results of each optimizer horizontally, we found that the RMSprop optimizer can achieve the best classification result by taking the highest recognition rate of the training set and the test set at the learning rate of 0.01. Therefore, in this paper, we select the RMSprop optimizer and set the optimization rate to 0.01.

Evaluation and Settings of ACNN-W
Since the OFCNN model is based on ACNN-W, we need to evaluate ACNN-W first. In the next, we evaluate the ACNN-W network using the driver signal as an example.

Settings of the ACNN-W
We experimented with the Tensorflow deep learning framework, using a computer configured with CPU i3-3120M and 12 GB of memory, with a selected mini-batch size of 256.
In order to observe the influence of the BN layer on the model, as shown in Figures 5 and 6, experiments with the BN layer were performed using the training set C and the test set C. It can be seen that the BN layer can accelerate the steady decline of the loss function and the steady increase of the recognition rate, and reduce the fluctuation of the recognition rate.  Considering the effect of different learning rate settings on the actual convergence rate, different learning rates are used for model training. Each group of optimizers and learning rates are tested 10 times, two epochs are trained each time. The average recognition rate of training set C and test set A was used as an evaluation index, and the experimental results are shown in Table 3: It can be seen from Table 3 that the Adadelta optimizer has a poor classification result when the learning rate is low, and has a high classification result when the learning rate is large. The experimental results of the RMSprop and Adam optimizers are the opposite. After comparing the best classification results of each optimizer horizontally, we found that the RMSprop optimizer can achieve the best classification result by taking the highest recognition rate of the training set and the test set at the learning rate of 0.01. Therefore, in this paper, we select the RMSprop optimizer and set the optimization rate to 0.01.
For the experimental data set, we believe that if the input data is processed by Min-Max Normalization, a very small number of mutation values will affect the overall data distribution. Therefore, we will enter whether to perform Min-Max Normalization processing for experimental comparison. In order to avoid random sampling errors, each of the following experiments was repeated 20 times to average. As shown in the following figure (A-B, A on the left means the training set label, and B on the right means the verification set label): It can be seen from Figure 7 that for the CWRU bearing database, the Min-Max Normalization processing of the input is beneficial to improve the generalization ability of the network, and is more conducive to learn more accurate features of the ACNN-W network. Even the performance of the four ACNN-Ws fused by D-S evidence is not too prominent, which takes the signal collected by the drive end as input. Therefore, the proposed ACNN-W uses BN and does not use Min-Max  For the experimental data set, we believe that if the input data is processed by Min-Max Normalization, a very small number of mutation values will affect the overall data distribution. Therefore, we will enter whether to perform Min-Max Normalization processing for experimental comparison. In order to avoid random sampling errors, each of the following experiments was repeated 20 times to average. As shown in the following figure (A-B, A on the left means the training set label, and B on the right means the verification set label): It can be seen from Figure 7 that for the CWRU bearing database, the Min-Max Normalization processing of the input is beneficial to improve the generalization ability of the network, and is more conducive to learn more accurate features of the ACNN-W network. Even the performance of the four ACNN-Ws fused by D-S evidence is not too prominent, which takes the signal collected by the drive end as input. Therefore, the proposed ACNN-W uses BN and does not use Min-Max Normalization for experiments.

Network Visualization Assessment
CNN is often seen as a black box, and the internal operating mechanism is difficult to understand. The nonlinear dimensionality reduction algorithm, t-SNE, was proposed for visualization of neural networks [41]. Based on the principle that t-SNE can convert the Euclidean distance into a conditional probability to express the similarity between points, we can use it to project points in highdimensional space into low-dimensional space. Here, we select C-A with a lower recognition rate for visualization to further understand the internal operation of the model. Figure 8 shows the characteristics of the 16 large convolution kernels of the first layer after training the C training set. Figure 9 shows the visualization of the feature distribution of the verification set A in the four convolutional layers and the fully connected layers after training by the C training set. Using t-SNE we can find that as the layers get deeper, the features become more and more separable. It is not easily distinguishable when the original signal is input, while after feature extraction by the neural network, these features become easily divided in the fully connected layer. The results show that the low-level original signal in the input layer is converted into high-level features layer by layer, which can improve the accuracy and robustness of classification. This reflects the powerful adaptive feature extraction capability of the ACNN-W networks. In addition, we can see from the visualization of the fully connected layer that there are still some overlapping regions in the two clusters, which is consistent with our final experimental results. Therefore, the results show that our method has good feature learning ability and can learn basic fault features from the original vibration signal adaptively.

Network Visualization Assessment
CNN is often seen as a black box, and the internal operating mechanism is difficult to understand. The nonlinear dimensionality reduction algorithm, t-SNE, was proposed for visualization of neural networks [41]. Based on the principle that t-SNE can convert the Euclidean distance into a conditional probability to express the similarity between points, we can use it to project points in high-dimensional space into low-dimensional space. Here, we select C-A with a lower recognition rate for visualization to further understand the internal operation of the model. Figure 8 shows the characteristics of the 16 large convolution kernels of the first layer after training the C training set. Figure 9 shows the visualization of the feature distribution of the verification set A in the four convolutional layers and the fully connected layers after training by the C training set. Using t-SNE we can find that as the layers get deeper, the features become more and more separable. It is not easily distinguishable when the original signal is input, while after feature extraction by the neural network, these features become easily divided in the fully connected layer. The results show that the low-level original signal in the input layer is converted into high-level features layer by layer, which can improve the accuracy and robustness of classification. This reflects the powerful adaptive feature extraction capability of the ACNN-W networks. In addition, we can see from the visualization of the fully connected layer that there are still some overlapping regions in the two clusters, which is consistent with our final experimental results. Therefore, the results show that our method has good feature learning ability and can learn basic fault features from the original vibration signal adaptively.

Network Visualization Assessment
CNN is often seen as a black box, and the internal operating mechanism is difficult to understand. The nonlinear dimensionality reduction algorithm, t-SNE, was proposed for visualization of neural networks [41]. Based on the principle that t-SNE can convert the Euclidean distance into a conditional probability to express the similarity between points, we can use it to project points in highdimensional space into low-dimensional space. Here, we select C-A with a lower recognition rate for visualization to further understand the internal operation of the model. Figure 8 shows the characteristics of the 16 large convolution kernels of the first layer after training the C training set. Figure 9 shows the visualization of the feature distribution of the verification set A in the four convolutional layers and the fully connected layers after training by the C training set. Using t-SNE we can find that as the layers get deeper, the features become more and more separable. It is not easily distinguishable when the original signal is input, while after feature extraction by the neural network, these features become easily divided in the fully connected layer. The results show that the low-level original signal in the input layer is converted into high-level features layer by layer, which can improve the accuracy and robustness of classification. This reflects the powerful adaptive feature extraction capability of the ACNN-W networks. In addition, we can see from the visualization of the fully connected layer that there are still some overlapping regions in the two clusters, which is consistent with our final experimental results. Therefore, the results show that our method has good feature learning ability and can learn basic fault features from the original vibration signal adaptively.  Original Signal Conv1 Conv2 Conv3 Conv4 FC Figure 9. Feature visualization via t-SNE: feature representations for all test signals extracted from raw signal, four convolutional layers and the fully connected layer respectively.    Original Signal Conv1

Evaluation of the OFNN
Conv2 Conv3 Conv4 FC Figure 9. Feature visualization via t-SNE: feature representations for all test signals extracted from raw signal, four convolutional layers and the fully connected layer respectively.   ACNN-W-AVG-DA in Table 4 represents the average of the accuracy obtained by using only the drive end sensor signal experiment, ACNN-W-AVG-FA is the average of the accuracy obtained by the experiment using only the fan end sensor signal. OFNN-DE indicates the mean of the accuracy rate obtained by DS evidence fusion, using the drive end sensor signals. OFNN-FA indicates the mean of the accuracy rate obtained by DS evidence fusion, using the fan end sensor signals. The OFNN indicates that both the drive end sensor signals and the fan end sensor signals are used. The average accuracy of the four CNN models trained from the drive end signal is 98.66%, while the average accuracy of the model trained using the fan end signal is 92.31%. That means that the signal from the drive is more conducive to fault diagnosis than the fan end, and the multi-sensor fusion model achieves better performance than a single sensor signal training model. For most test cases, the performance of it is also higher than the performance of a single CNN model. In order to clearly observe the classification of the network, we select the most representative C-A to draw the confusion matrix.  As shown in Figure 12, the vertical axis of the confusion matrix represents the actual label for each bearing state, while the horizontal axis represents the predicted label. We can find that the classification effect of the model trained by the drive end signal is better than that of the model trained by the fan end signal. The model in Figure 12a has a poor classification effect on labels 1, 2, 3, and 10. Correspondingly, the model in Figure 12b is mainly that the classification effects of labels 3 and 5 are poor. Although the recognition rate of the third type of labels is relatively low, the two models still have complementary functions in other fault type classifications. After the fusion of evidence, it can ACNN-W-AVG-DA in Table 4 represents the average of the accuracy obtained by using only the drive end sensor signal experiment, ACNN-W-AVG-FA is the average of the accuracy obtained by the experiment using only the fan end sensor signal. OFNN-DE indicates the mean of the accuracy rate obtained by DS evidence fusion, using the drive end sensor signals. OFNN-FA indicates the mean of the accuracy rate obtained by DS evidence fusion, using the fan end sensor signals. The OFNN indicates that both the drive end sensor signals and the fan end sensor signals are used. The average accuracy of the four CNN models trained from the drive end signal is 98.66%, while the average accuracy of the model trained using the fan end signal is 92.31%. That means that the signal from the drive is more conducive to fault diagnosis than the fan end, and the multi-sensor fusion model achieves better performance than a single sensor signal training model. For most test cases, the performance of it is also higher than the performance of a single CNN model. In order to clearly observe the classification of the network, we select the most representative C-A to draw the confusion matrix.  As shown in Figure 12, the vertical axis of the confusion matrix represents the actual label for each bearing state, while the horizontal axis represents the predicted label. We can find that the classification effect of the model trained by the drive end signal is better than that of the model trained by the fan end signal. The model in Figure 12a has a poor classification effect on labels 1, 2, 3, and 10. Correspondingly, the model in Figure 12b is mainly that the classification effects of labels 3 and 5 are poor. Although the recognition rate of the third type of labels is relatively low, the two models still have complementary functions in other fault type classifications. After the fusion of evidence, it can be seen from the picture c that the recognition rate of label 3 has been greatly improved, and other types of labels can almost all be recognized. The proposed method is more accurate and stable than a single CNN. be seen from the picture c that the recognition rate of label 3 has been greatly improved, and other types of labels can almost all be recognized. The proposed method is more accurate and stable than a single CNN.

Algorithm Comparison and Analysis
In order to verify the recognition performance of the proposed algorithm and the current mainstream intelligent fault diagnosis algorithm, FFT-SVM, FFT-DNN [20], WDCNN [21] and other newly proposed algorithms are compared: As shown in the Table 5, the average accuracy of the proposed OFNN model combined with D-S evidence theory under variable load conditions is higher than 99%. In A-B, B-A, B-C and C-B, the performance of FFT-SVM, FFT-DNN, WDCNN, TICNN, Ensemble TICNN [26] and IDSCNN [33] has less variation, and the accuracy of A-C and C-A is much higher than other methods. Moreover, in conjunction with Figure 7, we can see that ACNN-W exhibits better performance than WDCNN. Therefore, the excellent performance of OFNN can be attributed to the good adaptive and generalization capabilities of the proposed ACNN-W and the excellent performance of the D-S evidence theory. In particular, the accuracy of C-A increased from 93.8% of IDSCNN to 97.0%. This shows that one-dimensional convolutional neural networks are more suitable for extracting spatially related information in the original sequence than two-dimensional convolutional networks in dealing with one-dimensional signals. Compared to WDCNN, we have more comprehensively retained the original input signal, better retaining the hidden features inside the sample. The proposed method relies entirely on the one-dimensional convolutional neural network with RMSprop and BN for feature extraction without excessive manual intervention. However, due to the increase in the number of integrated individual models, the proposed method requires more computation time than the individual model. With the rapid development of modern hardware technology and training algorithms, it is believed that the integrated deep learning method will be completed more efficiently [42].

Algorithm Comparison and Analysis
In order to verify the recognition performance of the proposed algorithm and the current mainstream intelligent fault diagnosis algorithm, FFT-SVM, FFT-DNN [20], WDCNN [21] and other newly proposed algorithms are compared: As shown in the Table 5, the average accuracy of the proposed OFNN model combined with D-S evidence theory under variable load conditions is higher than 99%. In A-B, B-A, B-C and C-B, the performance of FFT-SVM, FFT-DNN, WDCNN, TICNN, Ensemble TICNN [26] and IDSCNN [33] has less variation, and the accuracy of A-C and C-A is much higher than other methods. Moreover, in conjunction with Figure 7, we can see that ACNN-W exhibits better performance than WDCNN. Therefore, the excellent performance of OFNN can be attributed to the good adaptive and generalization capabilities of the proposed ACNN-W and the excellent performance of the D-S evidence theory. In particular, the accuracy of C-A increased from 93.8% of IDSCNN to 97.0%. This shows that one-dimensional convolutional neural networks are more suitable for extracting spatially related information in the original sequence than two-dimensional convolutional networks in dealing with one-dimensional signals. Compared to WDCNN, we have more comprehensively retained the original input signal, better retaining the hidden features inside the sample. The proposed method relies entirely on the one-dimensional convolutional neural network with RMSprop and BN for feature extraction without excessive manual intervention. However, due to the increase in the number of integrated individual models, the proposed method requires more computation time than the individual model. With the rapid development of modern hardware technology and training algorithms, it is believed that the integrated deep learning method will be completed more efficiently [42].

Conclusions
In this paper, we propose a method of OFNN, combined Dempster-Shafer evidence theory and the proposed Adaptive One-dimensional Convolution Neural Networks with Wide Kernel, for intelligent fault diagnosis of rolling bearings. It can be effectively achieved higher accuracy under variable load conditions, compared with the FFT-SVM, FFT-DNN, WDCNN, TICNN, Ensemble TICNN and IDSCNN models. Under the premise that we assume that the evidence provided by each classifier is independent of each other, it proves that DS evidence theory can effectively synthesize the output information of multi-classifiers, and better overcomes the local optimal problem caused by random initialization of networks. The proposed method can get rid of the dependence on manual feature extraction during diagnosing experimental bearing vibration data, and give full play to the advantages of one-dimensional neural network for feature extraction of one-dimensional original vibration signals. It is more effective and more robust than the existing intelligent diagnosis methods, overcoming the limitations of individual deep learning model. The new application of combining deep learning with integrated learning is very promising. However, as the number of integrated individual models increases, so does the amount of computer resources that are occupied. It is a research direction for bearing fault diagnosis in the future how to use integrated learning for diagnosis faster and at a lower cost. We will continue to study this topic in the future, combined with the hardware implementation of neural networks.