A Lighted Deep Convolutional Neural Network Based Fault Diagnosis of Rotating Machinery

To improve the fault diagnosis performance for rotating machinery, an efficient, noise-resistant end-to-end deep learning (DL) algorithm is proposed based on the advantages of the wavelet packet transform in vibration signal processing (the capability to extract multiscale information and more spectral distribution features) and deep convolutional neural networks (good classification performance, data-driven design and high transfer-learning ability). First, a vibration signal is subjected to pyramid wavelet packet decomposition, and each sub-band coefficient is used as the input for each channel of a deep convolutional network (DCN). Then, based on the lightweight modeling requirements and techniques, a new DCN structure is designed for the fault diagnosis. The proposed algorithm is compared with the support vector machine algorithm and the published DL algorithms based on a bearing dataset produced by Case Western Reserve University. The experimental results show that the proposed algorithm is superior to the existing algorithms in terms of accuracy, memory space, computational complexity, noise resistance, and transfer performance, producing good results.


Introduction
Rotating machinery systems have been extensively applied in a number of engineering fields (e.g., aviation, ships and warships, machine tools and vehicles) and play an increasingly pivotal role. Rotating machinery damage and faults not only severely affect the reliability and safety of the entire system but also cause tremendous economic losses. Therefore, researchers have been continuously conducting relevant research. One current focus of research is to extract and classify fault features, i.e., to further improve the identification accuracy and system monitorability by analyzing initial weak fault signals to achieve real-time monitoring and diagnosis.
Based on the differences in the feature extraction and fault diagnosis algorithms, rotating machinery fault diagnosis algorithms can be classified into two types, namely, vibration analysis and intelligent diagnosis [1][2][3][4][5]. In a vibration analysis, signal decomposition techniques (e.g., wavelet transform (WT) [6] and empirical mode decomposition (EMD)) are used to directly detect fault frequencies in the original data [7]. Most fault frequencies are hidden in low-frequency deterministic components and high-frequency noise components and are consequently very difficult to observe in the spectra. As a result, vibration analysis algorithms have a relatively poor practical performance. Intelligent diagnosis is a new research direction and includes artificial neural networks (ANNs) and support vector machines (SVMs) as the main algorithms. Rojas  signals using the Fourier transform and classified them using an SVM. Xia et al. [9] proposed to use the Volterra series as a feature for describing the operating conditions of machinery systems and used a backpropagation neural network for classification and discrimination. Lei et al. [10] proposed to extract features using EMD combined with wavelet packet decomposition (WPD) and input selected sensitive features into an ANN for a fault diagnosis. Gangsar et al. [11] proposed to extract fault features from signals using the WT and classify them using an SVM. While conventional intelligent diagnosis algorithms have been extensively applied in machinery signal fault diagnosis, they have the following disadvantages [12][13][14]. (1) The accuracy of the fault diagnosis is dependent on the quality of the feature extraction. Vibration signals collected in an industrial environment are always complex, unstable and contain high levels of noise. Thus, to ensure the quality of the feature extraction, a manual design and suitable features are needed based on the characteristics of the different faults. The quality of the features directly determines the system performance. Therefore, the system feature extraction is not automatic. (2) Conventional intelligent diagnosis algorithms are based on shallow learning models, which are unable to effectively learn nonlinear relations in complex systems.
To address these deficiencies, deep learning (DL) has been gradually used in machinery signal fault diagnosis. Based on the model used in the DL, the available studies are classified into three types, namely, those involving autoencoders (AEs), convolutional neural networks (CNNs) and recurrent neural networks (RNNs) [4]. Jia et al. [15] pretrained a three-layer deep neural network (DNN) by stacking AEs and obtained final prediction results by fine-tuning the DNN. Li et al. [16] proposed a deep random forest fusion structure, which extracts signal features using the WT in conjunction with a deep Boltzmann machine and produces classification results by feature fusion using a random forest. Gan et al. [17] extracted the wavelet packet energy of the original signal as the feature input and then established a layered deep belief network-based diagnostic network, with its first layer diagnosing the fault type and the second layer determining the severity of the fault. Relying on network optimization using sparse regularization, Sun et al. [18] proposed an algorithm for predicting motor faults. This type of algorithm can be relatively easily realized and can obtain feature representation by learning, but it is slow in convergence and has a weak transfer-learning ability. The main RNN algorithms include RNNs and long short-term memory neural networks (LSTMNNs). Zhao et al. [19] proposed an LSTMNN-based fault diagnosis algorithm. This type of algorithm is effective in detecting faults in time-series data and is able to discover problems that arise as time elapses, but it has a relatively high network complexity and a weak transfer-learning ability.
There are a gradually increasing number of studies involving deep CNNs (DCNNs) in the fault diagnosis field. Abdeljaber et al. [20] established a one-dimensional (1D) LeNet5 network based on the LeNet5 architecture for detecting structural damage in machinery. DCNN models that differ in structure have also been proposed, based on 1D CNNs, for predicting faults in various types of rotating machinery [21][22][23][24]. However, most of these models use the original signal as the input but neglect its frequency-domain characteristics. DCNN models that differ in structure have been proposed based on CNNs for fault diagnosis [25][26][27]. In [25], a 1D problem was converted into a two-dimensional (2D) signal processing problem using a continuous WT scalogram. In [26,27], a 1D problem was converted into a 2D signal-processing problem by a non-overlapping equal division of the original signal, before a fault prediction was performed using a 2D convolution algorithm. Zhao et al. [28] proposed to use a wavelet packet-residual network hybrid algorithm for predicting faults in gearboxes. This type of algorithm is effective in analyzing multidimensional data and can effectively extract local features. Compared to 1D processing, 2D representation has a relatively complex structure, and 2D processing requires more time and computational resources in the training and testing processes. For example, the computational load of a 3 × 1 convolution is less than one-third that of a 3 × 3 convolution.
While the available DL algorithms are effective, the collection of original signals in an actual industrial production environment is relatively significantly affected by various components, requiring that the algorithms be highly resistant to disturbances. In addition, with the development of the industrial internet of things (IIoT), the installation of a health management unit in rotating machinery has become a trend. Constrained by costs, as well as by actual hardware and environmental conditions, the improvement of the functional intelligence level of software has high requirements for operating environment resources. As a result, the requirements for the computational complexity of DL algorithms and the space occupied by models are becoming increasingly stringent. Consequently, there are increasingly higher "small-size, lightweight, and rapidness" requirements for DL algorithms [25,29]. The complex and varied conditions of rotating machinery (with significant changes in the rotational speed and load, particularly when the system starts and shuts down) result in unstable collected data. As the conditions change and time elapses, the sample distribution no longer meets the same distribution requirement, leading to a demand for a high transfer-learning ability. The available studies are deficient in lightweight DL (as shown in Table 1), cannot meet the new requirements of the IIoT, and fail to sufficiently consider the noise resistance and transfer-learning ability of algorithms. Hence, in this study, a lightweight network structure is proposed based on the advantages of the WT and DL. The main contributions of this study are as follows.
(1) A 1D CNN structure is proposed that can effectively improve the identification accuracy and that has a relatively high transfer-learning ability and noise resistance.
(2) The proposed network structure is relatively lightweight and has a high computational speed, and it occupies little space.
The paper is organized as follows: Section 2 elaborates the theoretical basis of the WPT and DCN for a fault diagnosis based on the proposed method; The lightweight CNN structure design is described in Section 3; Experiments and analysis are implemented based on a bearing dataset produced by Case Western Reserve University in Section 4 to verify that the proposed algorithm is superior to the existing algorithms; Finally, the concluding remarks are given in Section 5.

Wavelet Packet Transform (WPT)
The framework of a wavelet packet transform is the extension of the wavelet transform [30]. The wavelet packet function is also a time-frequency function and can be described as: where j and k are integers; they are the index scale and translation operations, respectively. The index n is an operation modulation parameter or oscillation parameter. The first two wavelet packet functions are the scaling and mother wavelet functions: When n = 2, 3, . . . , the function can be defined by the following recursive relationships: where h(k) and g(k) are the quadrature mirror filters (QMFs) associated with the predefined scaling function and the mother wavelet function, respectively. The wavelet packet coefficients, w n i,j are computed by the inner product < f (t) w n i,j >, which can be defined as: The framework of the WPT algorithm broken up into three resolution levels is shown in Figure 1, and the structure of WPT is a Perfect Binary Tree.
The framework of the WPT algorithm broken up into three resolution levels is shown in Figure  1, and the structure of WPT is a Perfect Binary Tree. WPT can further obtain the detail wavelet coefficients of the signal in the high frequency area and provides a more detailed and comprehensive time-frequency plane tiling than the discrete wavelet transform (DWT) does [31]. As shown in Figure 2, the many advantages of WPT are used in the discrete signal processing, such as the fault diagnosis of rotating machinery [11] image processing [32] and video processing [33].

Deep Convolutional Networks (DCNs)
DCNs, proposed by LeCun, are an important branch of DL [34,35]. DCNs consist of convolutional and pooling layers as well as activation and loss functions. The convolutional layers use various types of convolution kernels and perform convolution operations on the input signals. Convolution operations are translation invariant for 1D signals and can support neurons in learning features with a relatively high robustness [36].
The pooling layers perform down-sampling operations, which use a specific value as the output value in each specific small area. Generally, the maximum or mean value is used as the output value. Down-sampling operations are performed to non-linearize 1D signals.
The activation functions stimulate neurons based on a series of input values, weights of interneuron connections and activation rules. The loss functions in a CNN are used to evaluate the WPT can further obtain the detail wavelet coefficients of the signal in the high frequency area and provides a more detailed and comprehensive time-frequency plane tiling than the discrete wavelet transform (DWT) does [31]. As shown in Figure 2, the many advantages of WPT are used in the discrete signal processing, such as the fault diagnosis of rotating machinery [11] image processing [32] and video processing [33].
The framework of the WPT algorithm broken up into three resolution levels is shown in Figure  1, and the structure of WPT is a Perfect Binary Tree. WPT can further obtain the detail wavelet coefficients of the signal in the high frequency area and provides a more detailed and comprehensive time-frequency plane tiling than the discrete wavelet transform (DWT) does [31]. As shown in Figure 2, the many advantages of WPT are used in the discrete signal processing, such as the fault diagnosis of rotating machinery [11] image processing [32] and video processing [33].

Deep Convolutional Networks (DCNs)
DCNs, proposed by LeCun, are an important branch of DL [34,35]. DCNs consist of convolutional and pooling layers as well as activation and loss functions. The convolutional layers use various types of convolution kernels and perform convolution operations on the input signals. Convolution operations are translation invariant for 1D signals and can support neurons in learning features with a relatively high robustness [36].
The pooling layers perform down-sampling operations, which use a specific value as the output value in each specific small area. Generally, the maximum or mean value is used as the output value. Down-sampling operations are performed to non-linearize 1D signals.
The activation functions stimulate neurons based on a series of input values, weights of interneuron connections and activation rules. The loss functions in a CNN are used to evaluate the difference between the output and the actual value at the training stage. Afterwards, the values of

Deep Convolutional Networks (DCNs)
DCNs, proposed by LeCun, are an important branch of DL [34,35]. DCNs consist of convolutional and pooling layers as well as activation and loss functions. The convolutional layers use various types of convolution kernels and perform convolution operations on the input signals. Convolution operations are translation invariant for 1D signals and can support neurons in learning features with a relatively high robustness [36].
The pooling layers perform down-sampling operations, which use a specific value as the output value in each specific small area. Generally, the maximum or mean value is used as the output value. Down-sampling operations are performed to non-linearize 1D signals.
The activation functions stimulate neurons based on a series of input values, weights of interneuron connections and activation rules. The loss functions in a CNN are used to evaluate the difference between the output and the actual value at the training stage. Afterwards, the values of the loss functions are used to update the weights between the neurons. The purpose of the training is to minimize the values of the loss functions.
A batch normalization (BN) layer normalizes the data input into each layer of the network (normalized to a mean of 0 and a variance of 1). The BN layers not only improve the gradient flow in the network, allow a higher learning rate, and improve the training speed, but they also reduce the strong dependence on initialization.

One-by-One Convolutions
A 1 × 1 convolution filter is the same as a normal filter except that it has a size of 1 × 1 and does not consider inter-information relationships in the local areas of the previous layer. One-by-one convolutions first appeared in the Network in Network paper and were used to deepen and widen network structures by reducing the dimensionality in inception networks [37]. To reduce the computational complexity of a system, a 1 × 1 convolution can be used to reduce the dimensionality to compress the channels. In [38], the effect of a 1 × 1 convolution with a different proportion of channel numbers on the performance is discussed and verified, and a squeeze ratio of 0.5 was found to be relatively satisfactory after careful consideration. Here, a two-layer convolutional network is presented in Figure 3 as an example. Figure 3a shows a dimensionality reduction without performing a 1 × 1 convolution, and Figure 3b shows a dimensionality reduction by performing a 1 × 1 convolution.
The effect of a 1 × 1 convolution with different proportions of channel numbers on the performance is discussed and verified. difference between the output and the actual value at the training stage. Afterwards, the values of the loss functions are used to update the weights between the neurons. The purpose of the training is to minimize the values of the loss functions. A batch normalization (BN) layer normalizes the data input into each layer of the network (normalized to a mean of 0 and a variance of 1). The BN layers not only improve the gradient flow in the network, allow a higher learning rate, and improve the training speed, but they also reduce the strong dependence on initialization.

One-by-One Convolutions
A 1 × 1 convolution filter is the same as a normal filter except that it has a size of 1 × 1 and does not consider inter-information relationships in the local areas of the previous layer. One-by-one convolutions first appeared in the Network in Network paper and were used to deepen and widen network structures by reducing the dimensionality in inception networks [37]. To reduce the computational complexity of a system, a 1 × 1 convolution can be used to reduce the dimensionality to compress the channels. In [38], the effect of a 1 × 1 convolution with a different proportion of channel numbers on the performance is discussed and verified, and a squeeze ratio of 0.5 was found to be relatively satisfactory after careful consideration. Here, a two-layer convolutional network is presented in Figure 3 as an example. Figure 3a shows a dimensionality reduction without performing a 1 × 1 convolution, and Figure 3b shows a dimensionality reduction by performing a 1 × 1 convolution.
The effect of a 1 × 1 convolution with different proportions of channel numbers on the performance is discussed and verified.

Lightweight CNN Structure Design
For 1D fault signal classification problems, an end-to-end network structure (as shown in Figure  4) is proposed. Based on the function, this network can be divided into two layers. The first layer performs the wavelet packet transform (WPT), with an aim to extract finer information from a frequency-domain perspective, and the second layer is the designed relatively lightweight CNN.

DCNN Design
The computational resources in a DCNN are affected by the number of channels, kernel size, data length and connection mode. To improve the ability to predict faults in the rotating machinery,

Lightweight CNN Structure Design
For 1D fault signal classification problems, an end-to-end network structure (as shown in Figure 4) is proposed. Based on the function, this network can be divided into two layers. The first layer performs the wavelet packet transform (WPT), with an aim to extract finer information from a frequency-domain perspective, and the second layer is the designed relatively lightweight CNN. difference between the output and the actual value at the training stage. Afterwards, the values of the loss functions are used to update the weights between the neurons. The purpose of the training is to minimize the values of the loss functions. A batch normalization (BN) layer normalizes the data input into each layer of the network (normalized to a mean of 0 and a variance of 1). The BN layers not only improve the gradient flow in the network, allow a higher learning rate, and improve the training speed, but they also reduce the strong dependence on initialization.

One-by-One Convolutions
A 1 × 1 convolution filter is the same as a normal filter except that it has a size of 1 × 1 and does not consider inter-information relationships in the local areas of the previous layer. One-by-one convolutions first appeared in the Network in Network paper and were used to deepen and widen network structures by reducing the dimensionality in inception networks [37]. To reduce the computational complexity of a system, a 1 × 1 convolution can be used to reduce the dimensionality to compress the channels. In [38], the effect of a 1 × 1 convolution with a different proportion of channel numbers on the performance is discussed and verified, and a squeeze ratio of 0.5 was found to be relatively satisfactory after careful consideration. Here, a two-layer convolutional network is presented in Figure 3 as an example. Figure 3a shows a dimensionality reduction without performing a 1 × 1 convolution, and Figure 3b shows a dimensionality reduction by performing a 1 × 1 convolution.
The effect of a 1 × 1 convolution with different proportions of channel numbers on the performance is discussed and verified.

Lightweight CNN Structure Design
For 1D fault signal classification problems, an end-to-end network structure (as shown in Figure  4) is proposed. Based on the function, this network can be divided into two layers. The first layer performs the wavelet packet transform (WPT), with an aim to extract finer information from a frequency-domain perspective, and the second layer is the designed relatively lightweight CNN.

DCNN Design
The computational resources in a DCNN are affected by the number of channels, kernel size, data length and connection mode. To improve the ability to predict faults in the rotating machinery,

DCNN Design
The computational resources in a DCNN are affected by the number of channels, kernel size, data length and connection mode. To improve the ability to predict faults in the rotating machinery, different numbers of channels and structures are used in the CNN partly based on the main factors affecting the network convolutions. In this study, seven DCNs are designed for the 1D convolutional networks and their computational loads and storage space (as shown in Table 2). The number of channels and the kernel size differ among the different structures (models). In Table 2, K/C represents the kernel size/number of channels (e.g., 11 × 1/4 represents a kernel size of 11 × 1 and 4 kernels (channels)). In network design, deepening the network can enhance its ability to learn features and can reduce the number of network parameters and the training load.
As shown in Figure 5, the computational complexity and number of parameters of network structure model 2 are 16% and 9% lower, respectively, than those of model 1.
The WPT-CNN structure is described as follows. 1.
The conv_1 to conv_5 convolutional layers of the network include convolution operations, rectified linear unit (ReLU) activation functions and BN (as shown in Figure 6). The use of the ReLU activation functions and BN can increase the rate of convergence and prevent a gradient explosion and vanishing problems. Maximum pooling is used in the poo1_1 to pool_4 pooling layers to allow the features that pass through the pooling layers to be translation invariant, thereby removing some noise while reducing the dimensionality of the features.

2.
The lengths of the conv_1 to conv_5 convolutional layers of the network are set to 11, 3, 3, 3 and 3, respectively, using the smoothing function of the convolution kernel. The length of the conv_1 layer is set to 11 to suppress noise.

3.
An alternating cascade of convolutional layers with kernel sizes of 3 × 3 and 1 × 1 is used in the conv_3 to conv_5 convolutional blocks to reduce the dimensionality while increasing the depth of the network, thereby reducing the number of training parameters and realizing a lightweight model. This method has been proven effective elsewhere [29]. 4.
In the second half of the network, the conv_6 and pool_6 layers are used to substitute for a commonly used fully connected layer to reduce the risk of overfitting caused by a fully connected layer while reducing the number of network parameters. A linear activation function is used in the conv_6 convolutional layer because a linear mapping is required to map the number of channels of the input feature to the same value of the classification. Other activation functions are not able to achieve this and may even slow down the convergence of the network. Global average pooling is used in the pool_5 pooling layer to allow the feature map input into each channel to correspond to each output feature, thereby improving the consistency between the feature maps and output types. The stability of the pooling process is also enhanced by the summation of the spatial information.  The maximum number of channels is 512 The WPT-CNN network structures of models 1 and 2 are shown in Figure 5a,b, respectively.  Here, the example shown in the network map is used for the analysis. The signal has 16 dimensions and 16 channels after passing through the pool_5 pooling layer. After the feature translation, 16 × 16 = 256 dimensions are obtained. A fully connected layer is then needed to reduce the dimensionality to 10 for the classification. Thus, a total of 256 × 10 = 2560 parameters need to be trained. However, after the conv_6 and pool_5 layers are used to substitute for the fully connected layer, only the parameters in the conv_6 layer (a total of 1 × 16 × 10 = 160 parameters) need to be trained, whereas no parameters in the pool_6 layer need to be trained. Hence, the number of parameters decreases 16-fold from 2560 to 160, thereby saving the memory space occupied by the model parameters. 5. The cross-entropy loss is used as the loss function, as shown below: where () px is the label of the training set and () qx is the label value predicted by the network.
In classification problems, the cross-entropy function is often used as a loss function because the gradient of the cross-entropy loss is only related to the correct classification prediction results in the model optimization process. In this way, updating the network parameters only increases the correct classification but does not affect other classifications.

Random number initialization
In DL, each convolution kernel is initialized with a random value, which has a certain impact on the training and system performance of the network. In this study, a Xavier initialization is used to   Here, the example shown in the network map is used for the analysis. The signal has 16 dimensions and 16 channels after passing through the pool_5 pooling layer. After the feature translation, 16 × 16 = 256 dimensions are obtained. A fully connected layer is then needed to reduce the dimensionality to 10 for the classification. Thus, a total of 256 × 10 = 2560 parameters need to be trained. However, after the conv_6 and pool_5 layers are used to substitute for the fully connected layer, only the parameters in the conv_6 layer (a total of 1 × 16 × 10 = 160 parameters) need to be trained, whereas no parameters in the pool_6 layer need to be trained. Hence, the number of parameters decreases 16-fold from 2560 to 160, thereby saving the memory space occupied by the model parameters. 5. The cross-entropy loss is used as the loss function, as shown below: where () px is the label of the training set and () qx is the label value predicted by the network.
In classification problems, the cross-entropy function is often used as a loss function because the gradient of the cross-entropy loss is only related to the correct classification prediction results in the model optimization process. In this way, updating the network parameters only increases the correct classification but does not affect other classifications.

Random number initialization
In DL, each convolution kernel is initialized with a random value, which has a certain impact on the training and system performance of the network. In this study, a Xavier initialization is used to  Here, the example shown in the network map is used for the analysis. The signal has 16 dimensions and 16 channels after passing through the pool_5 pooling layer. After the feature translation, 16 × 16 = 256 dimensions are obtained. A fully connected layer is then needed to reduce the dimensionality to 10 for the classification. Thus, a total of 256 × 10 = 2560 parameters need to be trained. However, after the conv_6 and pool_5 layers are used to substitute for the fully connected layer, only the parameters in the conv_6 layer (a total of 1 × 16 × 10 = 160 parameters) need to be trained, whereas no parameters in the pool_6 layer need to be trained. Hence, the number of parameters decreases 16-fold from 2560 to 160, thereby saving the memory space occupied by the model parameters. 5. The cross-entropy loss is used as the loss function, as shown below: where p(x) is the label of the training set and q(x) is the label value predicted by the network. In classification problems, the cross-entropy function is often used as a loss function because the gradient of the cross-entropy loss is only related to the correct classification prediction results in the model optimization process. In this way, updating the network parameters only increases the correct classification but does not affect other classifications.

Random number initialization
In DL, each convolution kernel is initialized with a random value, which has a certain impact on the training and system performance of the network. In this study, a Xavier initialization is used to promote the convergence of the model [39]. The weight w = R d in ×d out is initialized using the following equation:

Performance Analysis of Network Parameters
The proposed WPT-CNN hybrid network structure is a lightweight network structure that has significant advantages over some commonly used network structures in terms of the computational load and number of parameters. In other network structures, convolutional layers and fully connected layers are the parts that participate in the computation of the floating numbers and contain parameters that need to be trained. The calculation equations are as follows: where Parameters conv and FLOP conv represent the value of the number of parameters and the computation of the floating numbers in the convolutional layer; Params sfc and FLOP sfc represent the values of the number of parameters and the computation of floating number in the fully connected layer, respectively; H, W and C in are the height, width and number of channels of the input feature map, respectively; K h and K w represent the size of the convolution kernel; C out is the number of convolution kernels; I is the dimensionality of the input; and O is the dimensionality of the output. Various network structures, including WPT-CNN, DNN, 1D LeNet-5, residual network ResNet-18, ResNet-50 and Visual Geometry Group (VGG)-16, are compared in terms of the number of parameters and the computation of the floating numbers. DNN is the deep neural network proposed in [15] for a machinery fault diagnosis, which is pretrained by stacking AEs. DNN contains three hidden layers, each of which contains 600,200,100 neurons. 1D LeNet5 is a 1D CNN that is obtained by improving the LeNet-5 and that is used for motor fault detection. Resnet-18, Resnet-50 and VGG-16 are commonly used CNN frameworks. To accommodate 1D input data, 2D convolutions in these three networks were altered to 1D convolutions for calculation. The input length was determined by the input length of the methods reported in the literature and is an evaluation basis for the computation complexity and number of parameters (see Table 3). Table 3. Comparison of networks in terms of the parameters and computational load. The calculation results show that the proposed algorithm has the lowest computation of floating number computational load (3-500 times lower than that of the available algorithms) and the fewest number of parameters (3-2000 times fewer than those of the available algorithms). In addition, the proposed hybrid network contains 14 convolutional layers and is thus much deeper, and it has a higher ability to extract features than some of the available algorithms (e.g., [23][24][25][26][27]). See the subsequent comparison experiments for details.

Experiment and Analysis
The hardware and software environments used to implement the algorithm are shown in Table 4.

Introduction of the Dataset
The bearing dataset collected by the Case Western Reserve University (CWRU)'s Bearing Data Center were used to validate the effectiveness of the proposed WPT-CNN network structure. Figure 7 shows the experimental platform for the data collection [40]. The dataset contains the data for bearings under four different conditions, that is, normal bearings, bearings with a faulty ball (ball), bearings with a faulty inner race (inner_race) and bearings with a faulty outer race (outer_race). For each type of fault, there are three fault diameters, that is, 0.007 in., 0.014 in. and 0.021 in. Thus, with the inclusion of normal bearings, this dataset contains the data for 10 classification groups of bearings (as shown in Table 5).

Experiment and Analysis
The hardware and software environments used to implement the algorithm are shown in Table 4.

Introduction of the Dataset
The bearing dataset collected by the Case Western Reserve University (CWRU)'s Bearing Data Center were used to validate the effectiveness of the proposed WPT-CNN network structure. Figure  7 shows the experimental platform for the data collection [40]. The dataset contains the data for bearings under four different conditions, that is, normal bearings, bearings with a faulty ball (ball), bearings with a faulty inner race (inner_race) and bearings with a faulty outer race (outer_race). For each type of fault, there are three fault diameters, that is, 0.007 in., 0.014 in. and 0.021 in. Thus, with the inclusion of normal bearings, this dataset contains the data for 10 classification groups of bearings (as shown in Table 5).

. Data Preparation
Because their volume was limited, the experimental data were augmented using the overlap sampling method reported in [15,24], as shown in Figure 8. In this study, the overlap length was determined based on the data length.

. Data Preparation
Because their volume was limited, the experimental data were augmented using the overlap sampling method reported in [15,24], as shown in Figure 8. In this study, the overlap length was determined based on the data length. In the experiments, the dataset was divided into eight training sets, one validation set, and one test set. In addition, 800 training samples, 100 test samples and 100 validation samples were randomly selected. In the training process, the validation set was used to examine the identification accuracy after every 10 epochs.
As shown in Table 5, the dataset was divided into four sub-datasets based on the number of loads. These are A, B, C and D, corresponding to the data collected under 0, 1, 2 and 3 loads, respectively. The datasets A, B, C and D each contain 800 training samples and 100 test samples of each type, totaling 8000 training samples and 1000 test samples.

Experimental Description
The CWRU dataset was collected in an environment with a relatively low level of ambient noise, and there are a number of noise sources in an actual environment. Therefore, in order to reflect the ability of the algorithm proposed in this paper, noise was added into the samples in the original test set.
A 10%, 30%, 50%, 70%, 90% and 100% white Gaussian noise signal was added to the data to simulate actual conditions. The signal-to-noise ratio (SNR) was used when adding the noise signal and is defined as follows: 10 =10 l S og NR / signal niose PP (13) where signal P and niose P are the intensity of the original and noise signals, respectively. Figure 9 shows the original vibration signal from a bearing with a 0.007 in. faulty inner race under 0 load and the signals with various percentages of added noise. Note that the data with 0% noise represent the results under normal conditions. The Daubechies-1 wavelet function was selected for all experiments. The filter of this function has various lengths and a low smoothing effect, which is favorable for the extraction of detailed signal features. In the experiments, the dataset was divided into eight training sets, one validation set, and one test set. In addition, 800 training samples, 100 test samples and 100 validation samples were randomly selected. In the training process, the validation set was used to examine the identification accuracy after every 10 epochs.
As shown in Table 5, the dataset was divided into four sub-datasets based on the number of loads. These are A, B, C and D, corresponding to the data collected under 0, 1, 2 and 3 loads, respectively. The datasets A, B, C and D each contain 800 training samples and 100 test samples of each type, totaling 8000 training samples and 1000 test samples.

Experimental Description
The CWRU dataset was collected in an environment with a relatively low level of ambient noise, and there are a number of noise sources in an actual environment. Therefore, in order to reflect the ability of the algorithm proposed in this paper, noise was added into the samples in the original test set.
A 10%, 30%, 50%, 70%, 90% and 100% white Gaussian noise signal was added to the data to simulate actual conditions. The signal-to-noise ratio (SNR) was used when adding the noise signal and is defined as follows: SNR =10 log 10 P signal /P niose (13) where P signal and P niose are the intensity of the original and noise signals, respectively. Figure 9 shows the original vibration signal from a bearing with a 0.007 in. faulty inner race under 0 load and the signals with various percentages of added noise. Note that the data with 0% noise represent the results under normal conditions. The Daubechies-1 wavelet function was selected for all experiments. The filter of this function has various lengths and a low smoothing effect, which is favorable for the extraction of detailed signal features.  Various types of noise are present in an industrial production environment. Therefore, fault prediction algorithms must be relatively highly resistant to noise. Because the number of loads and the operating rotational speed of a machine change when it is in operation, the distribution of the collected sample data varies between different conditions. Therefore, algorithms are required to have a relatively high transfer-learning ability for various loads.
To examine the noise resistance of the proposed algorithm, the learning model with the highest accuracy for the validation set in 1000 iterations was selected. Noise-containing data were randomly generated 10 times for each sample in the training set of dataset A and used for testing and statistically analyzing the experimental results, which were represented in the form of the mean and standard deviation. Similarly, dataset A was used as a training set to train the model, and datasets B, C and D were used to test the accuracy of the algorithm (denoted by AB, AC and AD, respectively) in order to determine the transfer-learning ability for various loads.

Experiment 1:
The effect of the different number of decompositions of WPT on the accuracy and noise resistance of the algorithm.
In this section, to determine the best number of decompositions of WPT, the network structure model 2 with three convolution kernels in the first convolutional layer is selected. The accuracy of the proposed network structure in the time domain and the different number of decompositions of WPT under different percentages of noise are obtained for the validation and test sets. Table 6 summarizes the statistical experimental results. Because the original signal was used as the input signal, the input signal had a length of 1024. In WPT, as the decomposition level increases, the length of each sub-band decreases by a factor of 2, Various types of noise are present in an industrial production environment. Therefore, fault prediction algorithms must be relatively highly resistant to noise. Because the number of loads and the operating rotational speed of a machine change when it is in operation, the distribution of the collected sample data varies between different conditions. Therefore, algorithms are required to have a relatively high transfer-learning ability for various loads.
To examine the noise resistance of the proposed algorithm, the learning model with the highest accuracy for the validation set in 1000 iterations was selected. Noise-containing data were randomly generated 10 times for each sample in the training set of dataset A and used for testing and statistically analyzing the experimental results, which were represented in the form of the mean and standard deviation. Similarly, dataset A was used as a training set to train the model, and datasets B, C and D were used to test the accuracy of the algorithm (denoted by AB, AC and AD, respectively) in order to determine the transfer-learning ability for various loads.

Experiment 1:
The effect of the different number of decompositions of WPT on the accuracy and noise resistance of the algorithm.
In this section, to determine the best number of decompositions of WPT, the network structure model 2 with three convolution kernels in the first convolutional layer is selected. The accuracy of the proposed network structure in the time domain and the different number of decompositions of WPT under different percentages of noise are obtained for the validation and test sets. Table 6 summarizes the statistical experimental results. Because the original signal was used as the input signal, the input signal had a length of 1024. In WPT, as the decomposition level increases, the length of each sub-band decreases by a factor of 2, whereas the number of sub-bands increases by a factor of 2. For example, when decomposed at levels 3, 4 and 5, the length of each sub-band is 128, 64 and 32, and there are 8, 16 and 32 sub-bands, respectively. The experimental results show that for a 1D input, a sub-band length of 64 resulted in a satisfactory division of the frequency domain, a moderate length and a large amount of information.
Experiment 2: Effects of various sizes of convolution kernels in the first layer on the noise resistance of the algorithm. Table 6 shows that the number of decompositions of the WPT level is 4, which has the highest noise resistance, so the WPT 4 and network structure model 2 are thus selected in this section. An experiment was conducted to compare the effects of the various sizes of convolution kernels in the first layer. Table 7 summarizes the experimental results. The experimental data in Table 7 show that when taking the computational load and performance into account, a first-layer convolution kernel size of 11 produced a satisfactory noise-suppressing effect. Experiment 3: Comparison of the performance of three different CNN models The effects of the number of channels on the noise resistance and transfer-learning ability of the network structures were comparatively analyzed. The data with 0% noise represents the results under normal conditions.
As demonstrated in Table 8, model 2 was superior to model 1, with the same number of channels in each metric, suggesting that a 1 × 1 convolution can help improve the performance.
The number of channels of each model can be set to 4, 8, 16, 32, 64, and 128. When there are 4 or 8 channels, to ensure that there are sufficient channels for depicting features, a dimensionality reduction is generally not performed. When there are a large number of channels (e.g., 16, 32, 64 and 128), a dimensionality reduction can be performed but does not significantly improve the performance. This is because inter-elemental relationships in a 1D signal are simple compared to a 2D signal, and because an excessive number of channels cannot facilitate an improved performance. However, an increase in the number of channels is accompanied by a multifold increase in the number of parameters and in the computational complexity, lowering the cost effectiveness. Experiment 4: Comparison of the performance of the network structures with different depths. In DL, the network depth affects the performance and computational complexity of an algorithm. Here, network structures with various depths were designed based on model 2 (see Table 9) and compared in terms of the noise resistance and transferability. Table 10 summarizes the experimental results.   The experimental data show that model 2 with a five-layer structure was satisfactory in terms of the performance, number of parameters and computational complexity.
Experiment 5: Comparison of the transfer-learning ability for various levels of noise. The transfer-learning ability of DL model 2, which had the best performance, was comparatively analyzed for various levels of noise. Table 11 summarizes the experimental results under different dataset combinations. For example, AB represents the test accuracy on sub-dataset B using a model trained by sub-dataset A. The experimental data show that various noise intensity levels weakened the transfer-learning ability of model 2 to a certain extent (3-10%). Experiment 6: Comparison of the noise resistance and transfer-learning ability of various algorithms.
To determine its noise resistance and transfer-learning ability, the proposed algorithm was compared with the available algorithms by using dataset A. Tables 12 and 13 summarize the experimental results. The experimental data show that the proposed algorithm exhibited the highest resistance to various noise intensity levels, 5-58% higher than that of the available algorithms.
The experimental data show that the proposed algorithm with the structure of model 2 exhibited the highest performance for various datasets, and its transfer-learning ability was 0.5-44% higher than that of the available algorithms.

Visualization of the Network Learning Process
It is impossible to illustrate how a CNN works because its working principle is similar to that of a black box. Therefore, the output characteristics of the network training process and the model testing process were subjected to a dimensionality reduction using the t-distributed stochastic neighbor embedding technique, and they were then visualized to facilitate the understanding of the operating process of the entire network. Figure 10 shows the visualized network training process, demonstrating the feature distribution corresponding to the 0th to 500th iteration of the proposed algorithm on dataset A. As shown in Figure 10, as the number of times of the network training increases, the feature distance between the labels of different types increases, whereas the feature distance between the labels of the same type decreases, and they gradually form a cluster. This observation suggests that as the number of training iterations increases, the network's ability to learn features and its classification accuracy improve. Figure 11 shows the visualized network testing process, demonstrating the feature distribution corresponding to when the data pass through the conv_1 to conv_6 layers of the hybrid network. As shown in Figure 11, the features of all the samples are mixed and indistinguishable at the network initialization stage. As the number of network layers increases, the features gradually become distinguishable. In the conv_5 layer, the features of the same labels self-organize into one cluster.

Conclusions
This study investigates fault prediction in rotating machinery. Based on the ability of the wavelet packet transform to extract more frequency-domain information and based on the low computational load of 1D convolutions, a convolutional network-based lightweight deep learning fault prediction algorithm is proposed. This network can be regarded as having two layers. The first layer performs the wavelet packet transform (WPT), with an aim to extract finer information from a frequencydomain perspective, and the second layer is the designed relatively lightweight CNN (as shown in Figure 4). To validate the effectiveness of the proposed WPT-CNN network structure, the bearing dataset collected by the Case Western Reserve University (CWRU)'s Bearing Data Center were used to implement the experiments. The comparison results demonstrate that the proposed algorithm is superior to the available algorithms in noise resistance (5-58% higher) and transfer-learning ability (0.5-44% higher). In addition, the proposed algorithm also has the lowest computational complexity and memory space requirement (72.5% and 88.5% lower than those of the available algorithms,  Figure 11 shows the visualized network testing process, demonstrating the feature distribution corresponding to when the data pass through the conv_1 to conv_6 layers of the hybrid network. As shown in Figure 11, the features of all the samples are mixed and indistinguishable at the network initialization stage. As the number of network layers increases, the features gradually become distinguishable. In the conv_5 layer, the features of the same labels self-organize into one cluster.  Figure 11 shows the visualized network testing process, demonstrating the feature distribution corresponding to when the data pass through the conv_1 to conv_6 layers of the hybrid network. As shown in Figure 11, the features of all the samples are mixed and indistinguishable at the network initialization stage. As the number of network layers increases, the features gradually become distinguishable. In the conv_5 layer, the features of the same labels self-organize into one cluster.

Conclusions
This study investigates fault prediction in rotating machinery. Based on the ability of the wavelet packet transform to extract more frequency-domain information and based on the low computational load of 1D convolutions, a convolutional network-based lightweight deep learning fault prediction algorithm is proposed. This network can be regarded as having two layers. The first layer performs the wavelet packet transform (WPT), with an aim to extract finer information from a frequencydomain perspective, and the second layer is the designed relatively lightweight CNN (as shown in Figure 4). To validate the effectiveness of the proposed WPT-CNN network structure, the bearing dataset collected by the Case Western Reserve University (CWRU)'s Bearing Data Center were used to implement the experiments. The comparison results demonstrate that the proposed algorithm is superior to the available algorithms in noise resistance (5-58% higher) and transfer-learning ability (0.5-44% higher). In addition, the proposed algorithm also has the lowest computational complexity and memory space requirement (72.5% and 88.5% lower than those of the available algorithms, Figure 11. Feature distribution corresponding to number of layers.

Conclusions
This study investigates fault prediction in rotating machinery. Based on the ability of the wavelet packet transform to extract more frequency-domain information and based on the low computational load of 1D convolutions, a convolutional network-based lightweight deep learning fault prediction algorithm is proposed. This network can be regarded as having two layers. The first layer performs the wavelet packet transform (WPT), with an aim to extract finer information from a frequency-domain perspective, and the second layer is the designed relatively lightweight CNN (as shown in Figure 4). To validate the effectiveness of the proposed WPT-CNN network structure, the bearing dataset collected by the Case Western Reserve University (CWRU)'s Bearing Data Center were used to implement the experiments. The comparison results demonstrate that the proposed algorithm is superior to the available algorithms in noise resistance (5-58% higher) and transfer-learning ability (0.5-44% higher).
In addition, the proposed algorithm also has the lowest computational complexity and memory space requirement (72.5% and 88.5% lower than those of the available algorithms, respectively). Therefore, the proposed algorithm can effectively improve the identification accuracy and is relatively lightweight.
Future research will consist in developing a new model for the deep convolution neural network optimization problem by improving the regularization term and optimization strategy with the results of this paper to predict the fault characteristics of the rotating machinery. Furthermore, to meet the actual working conditions of industrial production, future research will also focus on improving the generalization ability and the lightweight problem of the model.