An Intelligent Gear Fault Diagnosis Methodology Using a Complex Wavelet Enhanced Convolutional Neural Network

As a typical example of large and complex mechanical systems, rotating machinery is prone to diversified sorts of mechanical faults. Among these faults, one of the prominent causes of malfunction is generated in gear transmission chains. Although they can be collected via vibration signals, the fault signatures are always submerged in overwhelming interfering contents. Therefore, identifying the critical fault’s characteristic signal is far from an easy task. In order to improve the recognition accuracy of a fault’s characteristic signal, a novel intelligent fault diagnosis method is presented. In this method, a dual-tree complex wavelet transform (DTCWT) is employed to acquire the multiscale signal’s features. In addition, a convolutional neural network (CNN) approach is utilized to automatically recognise a fault feature from the multiscale signal features. The experiment results of the recognition for gear faults show the feasibility and effectiveness of the proposed method, especially in the gear’s weak fault features.


Introduction
With the rapid development of modern industry and technology, industrial applications are becoming more complicated and more precise. These changes put forward higher requirements for equipment maintenance. Rotating machinery is an important component in industrial applications, and has been widely used in many crucial areas, but any potential fault may lead to enormous economic loss [1]. Therefore, mechanical condition monitoring and fault diagnosis (CM-FD) for rotating machinery to avoid accidents and increase machine reliability has become an important research area in industry [2].
Known as key elements in rotating machinery, gears are widely used in manufacturing industry and have received significant attention in the field of CM-FD. Typical gear faults include chipped teeth, tooth breakage, root crack, wear, pitting, and surface damage [3]. These failure forms may lead to system imbalance and machining precision deterioration. To diagnose multiple gear faults, fault identification has become an important subject for extensive research for the past few decades [4,5].
The framework of traditional fault diagnosis includes three main steps: (1) signal acquisition; (2) feature extraction and selection; and (3) fault classification [6], as shown in Figure 1. A proper signal acquisition method is the premise and crux of the effective usage of accurate CM-FD. According to the data acquisition media, CM-FD can be divided into the following categories: The pattern recognition concept can be defined as identifying or classifying complex signal samples or objects [18]. Therefore, CM-FD can be also considered as a pattern recognition problem. Among various frameworks in pattern recognition, supervised and unsupervised learning are the two major manners. Unsupervised learning mainly focuses on the hidden structure description of unlabeled data. Some literature [19,20] has exhibited the potential possibility to perform CM-FD in a completely unsupervised manner. On the other hand, the supervised manner exhibited an extraordinary ability for the classification problem. The supervised learning manner mainly focuses on the relationship between the input explanatory independent vector of a feature and the dependent class or cluster [21]. N. Saravanan and K.I. Ramachandran [22] proposed a method based on a discrete wavelet transform (DWT) and an artificial neural network (ANN), and an accuracy of 95% was obtained. In [23], the authors engaged Continuous wavelet transform (CWT) and ANN into a fault classification, and the fault estimation was 98.28% accurate. A method in [24] using empirical mode decomposition (EMD) and an ANN was proposed by Ali, J.B. et al., and the classification accuracy result was 93%. Xian [25] proposed a mechanical failure classification using DWT and a Support vector machine (SVM); the classification result of failure in validation was 94.33%. In [26], the authors presented a feature extraction method, and the average classification result was reported as 95.76%. A method [27] using EMD and SVM was proposed by Babu, N.R. and Mohan, B.J., and the fault classification accuracy result was 95.33%. In this research, supervised classification is employed for the gear fault diagnosis. a depth estimation method using two CNN architectures from raw images [30]. Zhu proposed a framework with a fully convolutional network (FCN) and deep CNN for traffic sign detecting and recognizing [31]. In [32], Chen, Z.Q. et al. used 256 signal statistic features to construct a 16 × 16 feature map and then utilize CNN for the gearbox fault identification, and the classification was reported as 98.35% accurate. According to the studies mentioned above, the CNN received better results in comparison with the peer method.
Inspired by the idea of CNN, we present an intelligent fault diagnosis method using wavelet enhanced CNN. The schematic of the proposed method is shown in Figure 2. In this method, DTCWT is employed to acquire the multiscale signal features with a fixed decomposition level from the raw vibration signal. The CNN approach is utilized to automatically enable fault feature recognition from the multiscale signal's features. After the network weight coefficients are set via a training set (labeled data), the novel method is more efficient for fault recognition compared with traditional methods, and also makes mechanical fault diagnosis move toward real artificial intelligence. More recently, convolutional neural networks (CNNs) have aroused a heated discussion in the scientific and industrial communities [28]. Qin reported a relation classification task utilizing a CNN approach to automatically control feature learning from raw sentences [29]. Yan reported a depth estimation method using two CNN architectures from raw images [30]. Zhu proposed a framework with a fully convolutional network (FCN) and deep CNN for traffic sign detecting and recognizing [31]. In [32], Chen, Z.Q. et al. used 256 signal statistic features to construct a 16 × 16 feature map and then utilize CNN for the gearbox fault identification, and the classification was reported as 98.35% accurate. According to the studies mentioned above, the CNN received better results in comparison with the peer method.
Inspired by the idea of CNN, we present an intelligent fault diagnosis method using wavelet enhanced CNN. The schematic of the proposed method is shown in Figure 2. In this method, DTCWT is employed to acquire the multiscale signal features with a fixed decomposition level from the raw vibration signal. The CNN approach is utilized to automatically enable fault feature recognition from the multiscale signal's features. After the network weight coefficients are set via a training set (labeled data), the novel method is more efficient for fault recognition compared with traditional methods, and also makes mechanical fault diagnosis move toward real artificial intelligence. The contributions of this paper are summarized as follows.
(1) The paper proposes an intelligent fault diagnosis method, which combines the traditional decomposition signal analysis technology and artificial intelligence technology. Different level DTCWT decomposition signals comprise a component matrix of multiscale signal features. Then, CNN is employed for fault pattern recognition. Because of the engagement of CNN to learn the features, the model does not depend on any prior knowledge. (2) A gear fault case study is used to verify the proposed method. The experimental result shows that the proposed method has good generalization ability for fresh signals.
The rest of this paper is organized as follows. The signal decomposition method DTCWT is briefly described in Section 2. The learning method CNN is presented in Section 3. Section 4 gives the proposed fault diagnosis method. Section 5 details a simulation experiment for the fault classification based on the CNN. A typical gear fault is carried out in Section 6, and the model training process and a validation experiment are also presented. The major findings of this work are summarized in Section 7.

Signal Decomposition
The useful transient features are usually buried in heavy background noise and other irrelevant vibrations [33]. A basic challenge of CM-FD is how to properly extract the fault feature Raw Signal DTCWT

2D Signal
Layer #1 Layer #2 SoftMax The contributions of this paper are summarized as follows.
(1) The paper proposes an intelligent fault diagnosis method, which combines the traditional decomposition signal analysis technology and artificial intelligence technology. Different level DTCWT decomposition signals comprise a component matrix of multiscale signal features. Then, CNN is employed for fault pattern recognition. Because of the engagement of CNN to learn the features, the model does not depend on any prior knowledge. (2) A gear fault case study is used to verify the proposed method. The experimental result shows that the proposed method has good generalization ability for fresh signals.
The rest of this paper is organized as follows. The signal decomposition method DTCWT is briefly described in Section 2. The learning method CNN is presented in Section 3. Section 4 gives the proposed fault diagnosis method. Section 5 details a simulation experiment for the fault classification based on the CNN. A typical gear fault is carried out in Section 6, and the model training process and a validation experiment are also presented. The major findings of this work are summarized in Section 7.

Signal Decomposition
The useful transient features are usually buried in heavy background noise and other irrelevant vibrations [33]. A basic challenge of CM-FD is how to properly extract the fault feature under a lowerlevel signal noise ratio (SNR) [34]. To acquire an acceptable calculation time for the pattern recognition in this paper, proper data pre-processing is necessary. In the literature, DTCWT is reported to enjoy Materials 2017, 10, 790 4 of 18 merits such as a higher degree of designing freedom, approximate shift-invariance, and inhibited frequency aliasing [35]. Therefore, compared with conventional waveforms implemented in the time domain, DTCWT has a better extraction ability for the periodic non-stationary fault features. In this research, DTCWT is utilized to perform the multiscale decomposition on the raw acquired data.

DTCWT Framework
The wavelet transform has been exploited with great success across many applications. In wavelet theory, a record of a finite energy signal x(t) can be decomposed in terms of wavelets and scaling functions, shown as below.
where φ(t) is the scaling function, and ψ(t) is the wavelet function. The scaling coefficients c(n) and wavelet coefficients d(j, n) are computed via the inner products: Although wavelet transform has many advantages, there are still some fundamental problems such as fixed oscillatory behavior, shift variance, aliasing, and lack of directionality. Inspirited by a Fourier transform, Complex wavelet transform (CWT) ψ C (t) is proposed with a complex-valued scaling function and complex-valued wavelet: The filterbank topology of DTCWT is shown in Figure 3, where the wavelet functions in 'Tree e' and in 'Tree m' form an approximate Hilbert transform pair: where Hilbert[·] denotes the Hilbert transform operator. under a lower-level signal noise ratio (SNR) [34]. To acquire an acceptable calculation time for the pattern recognition in this paper, proper data pre-processing is necessary. In the literature, DTCWT is reported to enjoy merits such as a higher degree of designing freedom, approximate shift-invariance, and inhibited frequency aliasing [35]. Therefore, compared with conventional waveforms implemented in the time domain, DTCWT has a better extraction ability for the periodic non-stationary fault features. In this research, DTCWT is utilized to perform the multiscale decomposition on the raw acquired data.

DTCWT Framework
The wavelet transform has been exploited with great success across many applications. In wavelet theory, a record of a finite energy signal ( ) x t can be decomposed in terms of wavelets and scaling functions, shown as below.
where ( ) t f is the scaling function, and ( ) t y is the wavelet function. The scaling coefficients ( ) c n and wavelet coefficients ( , ) d j n are computed via the inner products: Although wavelet transform has many advantages, there are still some fundamental problems such as fixed oscillatory behavior, shift variance, aliasing, and lack of directionality. Inspirited by a Fourier transform, Complex wavelet transform (CWT) is proposed with a complex-valued scaling function and complex-valued wavelet: The filterbank topology of DTCWT is shown in Figure 3, where the wavelet functions in 'Tree e Â ' and in 'Tree m Á ' form an approximate Hilbert transform pair: where Hilbert é ù ê ú ë û  denotes the Hilbert transform operator.  In the time domain, there is an equivalent expression, as shown in Equation (6).

Imag. Tree
where h e 1 (n) and h m 1 (n) are real-valued finite impulse response (FIR) filters corresponding to ψ e (t) and ψ m (t). In each filtering tree, the scaling functions of ψ (·) (t) and ϕ (·) (t) satisfy the following two-scale relationship: where the superscript (·) can be either e or m. The complex-valued wavelet coefficient series d C l (k) is calculated via inner product computation between the input signal and the wavelet systems of {Ξ j,k [ψ e ]} and {Ξ j,k [ψ m ]}. These complex-valued series are computed using the following expression: where the notation Ξ j,k [·] denotes the translation and dilation operations simultaneously on a function belonging to where the binary operator ·, · represents the inner product operation. In the reconstruction phase, d l (t) and a i (t) can be retrieved via Let J be the decomposition stage depth of the dual tree wavelet decomposition in Figure 3, then J + 1 wavelet sub-bands, including d 1 (t), . . . , d J (t) as a detail coefficient series and c 1 (t) as approximation series, will be produced.

Wavelet Basis Construction
In this paper, a dual-tree complex wavelet basis, constructed in Ref [16], is employed to acquire the multiscale signal features. The time-frequency atoms of the wavelet basis are shown in Figure 4. As can be observed in Figure 4, this quarter shift basis is advantageous owing to its smooth envelope and annihilated energy leakage.

Convolutional Layer
Generally, a CNN is designed to deal with the variability of two-dimensional (2D) shapes. A basic stage in a CNN is composed of a convolutional layer and a pooling layer [36]. Each level consists of a certain number of feature maps, which means that CNNs have a good hierarchical feature representation ability from a lower level to a higher level [37]. Through the propagation of a CNN, the feature map's size will decrease layer by layer and the extracted features are more global. Related works show that CNNs are also most popular for audio signal processing in view of its efficiency and higher-level information detection ability through a series of lower-level detectors [38,39].
Given a series of time-domain signals ( ) x t , after the DTCWT multiscale decomposition, the signal can be represented as 1 2 [ , ] , where S is the number of the training samples and L is the decomposition level. The corresponding network output can be written as 1 2 [ , ] S y y y y =  .
Each S y means the output class from the finite set of classes. A convolution operation is the feature extraction process [29]. Defining l ji w as the filters with a sliding filter bank and l j b as the bias, the convolutional layer output feature maps can be expressed as where i means the i-th input feature map, j means the j-th output feature map, l means the l layer, means the i-th input feature map in the (l-1)-th layer, and (.) relu means the activation function in the network is rectified linear units (ReLU).
A typical example of a convolutional layer is shown in Figure 5. In Figure 5a, multiscale wavelet sub-bands after DTCWT decomposition are displayed and each column of the colored matrix means the corresponding DTCWT sub-band signal. Each rectangle marked by different colors represents a different convolutional kernel. With the slide of the convolutional kernel, output feature maps are generated (Figure 5b). After the sliding filtering, several feature maps are acquired according to the filter setting.

Convolutional Layer
Generally, a CNN is designed to deal with the variability of two-dimensional (2D) shapes. A basic stage in a CNN is composed of a convolutional layer and a pooling layer [36]. Each level consists of a certain number of feature maps, which means that CNNs have a good hierarchical feature representation ability from a lower level to a higher level [37]. Through the propagation of a CNN, the feature map's size will decrease layer by layer and the extracted features are more global. Related works show that CNNs are also most popular for audio signal processing in view of its efficiency and higher-level information detection ability through a series of lower-level detectors [38,39].
Given a series of time-domain signals x(t), after the DTCWT multiscale decomposition, the signal can be represented as where S is the number of the training samples and L is the decomposition level. The corresponding network output can be written as y = [y 1 , y 2 . . . y S ]. Each y S means the output class from the finite set of classes. A convolution operation is the feature extraction process [29]. Defining w l ji as the filters with a sliding filter bank and b l j as the bias, the convolutional layer output feature maps can be expressed as where i means the i-th input feature map, j means the j-th output feature map, l means the l layer, x l−1 i means the i-th input feature map in the (l-1)-th layer, and relu(.) means the activation function in the network is rectified linear units (ReLU).
A typical example of a convolutional layer is shown in Figure 5. In Figure 5a, multiscale wavelet sub-bands after DTCWT decomposition are displayed and each column of the colored matrix means the corresponding DTCWT sub-band signal. Each rectangle marked by different colors represents a different convolutional kernel. With the slide of the convolutional kernel, output feature maps are generated (Figure 5b). After the sliding filtering, several feature maps are acquired according to the filter setting.

Pooling Layer
Pooling significantly reduces the computational complexity for the processing steps. Max-pooling and average-pooling are two of the most common pooling methods across various tasks [40]. In this research, max-pooling is chosen for the resolution reduction. Max-pooling can be written as [41]: where (.) down is the sub-sampling function to compute the max value of each m×n (m is the vertical downscale, and n is the horizontal downscale) region in the

Output Layer
The output layer determines the relation label of input signal, and consists of a full-connected layer and a softmax layer [29]. The full connected layer can be presented as where (.) sig means that the activation function in the network is sigmoid. The final layer is composed of softmax units. Accordingly, the conditional probability is computed as: where s y is the actual output of the network, K is the number of the class, s a is the feature vector derived by the full connected layer, and q is the parameter set to be learned via an algorithm for the first-order gradient-based optimization of stochastic objective functions, Adam [41].

The Proposed Mechanical Fault Diagnosis Method
DTCWT possesses a powerful ability for extracting useful features from vibration signals because of its tight frame and shift invariance [42]. Besides, as a type of feed-forward artificial neural network, CNNs possess a good hierarchical feature representation ability from a lower level to a higher level [32]. Therefore, in this paper, a novel intelligent mechanical fault diagnosis method based on DTCWT and a CNN is proposed to improve the identifying accuracy of mechanical faults. A flow chart of the proposed method is presented in Figure 6, and is illustrated in the following steps.

Pooling Layer
Pooling significantly reduces the computational complexity for the processing steps. Max-pooling and average-pooling are two of the most common pooling methods across various tasks [40]. In this research, max-pooling is chosen for the resolution reduction. Max-pooling can be written as [41]: where down(.) is the sub-sampling function to compute the max value of each m×n (m is the vertical downscale, and n is the horizontal downscale) region in the X i l−1 map.

Output Layer
The output layer determines the relation label of input signal, and consists of a full-connected layer and a softmax layer [29]. The full connected layer can be presented as where sig(.) means that the activation function in the network is sigmoid. The final layer is composed of softmax units. Accordingly, the conditional probability is computed as: where y s is the actual output of the network, K is the number of the class, a s is the feature vector derived by the full connected layer, and θ is the parameter set to be learned via an algorithm for the first-order gradient-based optimization of stochastic objective functions, Adam [41].

The Proposed Mechanical Fault Diagnosis Method
DTCWT possesses a powerful ability for extracting useful features from vibration signals because of its tight frame and shift invariance [42]. Besides, as a type of feed-forward artificial neural network, CNNs possess a good hierarchical feature representation ability from a lower level to a higher level [32]. Therefore, in this paper, a novel intelligent mechanical fault diagnosis method based on DTCWT and a CNN is proposed to improve the identifying accuracy of mechanical faults. A flow chart of the proposed method is presented in Figure 6, and is illustrated in the following steps.
Step 1: Place the necessary sensors in the measured equipment, and the physical signal can be acquired by a data acquisition system. Meanwhile, the necessary preprocess for the raw signal (anti-aliasing filtering and low pass filtering) is also processed.
Step 2: The acquired signals are decomposed into wavelet sub-bands using DTCWT with a decomposition depth n. After that, place the resulting DTCWT wavelet sub-bands as the multiple rows of a matrix, and the DTCWT components are confused into a 2D signal map for the following CNN fault classification. Theoretically, a higher decomposition level will lead to a better result at the cost of higher computational burden. However, in a practice application, computational efficiency is also an indispensable factor. In this paper, the DTCWT decomposition level is set as 7. Therefore, the constructed 2D signal map dimension is 8 × L, where L denotes the length of the signal.
Step 3: Randomly separate the acquired signal records into two groups, named as the training dataset and testing dataset, and collect an identical number of signal records for each fault type. The training dataset is used to train the CNN framework, which is presented in Figure 2. Due to the limited capacity of the dataset, sixfold cross validation [43] is engaged for the performance evaluation. The proportion of training dataset to testing dataset is 5:1.
After the iteration, the model has been saved. The testing dataset is utilized to validate the trained CNN model. In this paper, two convolutional layers are employed for the fault classification in the CNN framework.
Materials 2017, 10, 790 8 of 18 Step 1: Place the necessary sensors in the measured equipment, and the physical signal can be acquired by a data acquisition system. Meanwhile, the necessary preprocess for the raw signal (anti-aliasing filtering and low pass filtering) is also processed.
Step 2: The acquired signals are decomposed into wavelet sub-bands using DTCWT with a decomposition depth n. After that, place the resulting DTCWT wavelet sub-bands as the multiple rows of a matrix, and the DTCWT components are confused into a 2D signal map for the following CNN fault classification. Theoretically, a higher decomposition level will lead to a better result at the cost of higher computational burden. However, in a practice application, computational efficiency is also an indispensable factor. In this paper, the DTCWT decomposition level is set as 7. Therefore, the constructed 2D signal map dimension is 8 × L, where L denotes the length of the signal.
Step 3: Randomly separate the acquired signal records into two groups, named as the training dataset and testing dataset, and collect an identical number of signal records for each fault type. The training dataset is used to train the CNN framework, which is presented in Figure 2. Due to the limited capacity of the dataset, sixfold cross validation [43] is engaged for the performance evaluation. The proportion of training dataset to testing dataset is 5:1. After the iteration, the model has been saved. The testing dataset is utilized to validate the trained CNN model. In this paper, two convolutional layers are employed for the fault classification in the CNN framework.

Simulation Experiment
The changing health state of gear teeth can lead to variations of amplitude and phase modulations of the meshing vibrations. As such, a trend analysis on the intensity of the modulation components can be effectively used to track the health state of gear pairs [44]. To verify the effectiveness of the proposed method, as well as that of the neural network structure for fault diagnosis applications, simulated gear crack fault signals are established as below.

Simulation Experiment
The changing health state of gear teeth can lead to variations of amplitude and phase modulations of the meshing vibrations. As such, a trend analysis on the intensity of the modulation components can be effectively used to track the health state of gear pairs [44]. To verify the effectiveness of the proposed method, as well as that of the neural network structure for fault diagnosis applications, simulated gear crack fault signals are established as below.
where x um (t) = e −β(t−m×T h ) sin(2π512t + φ m ) for 1 ≤ m ≤ 10; L denotes the amplitude of the impulse; wgn(t) is the white Gaussian noise series with 4 dB; the term a um (t) = e −β(t−m×T h ) represents the periodic amplitude modulation of the i-th impulse; and β = 90 + 0.05 * (i − 1) represents the system's damping characteristic. The impulse period T h is 0.1025. Meanwhile, random variables {φ m |m = 1, 2, · · · , 10}, which are ranged in (−π, π], are utilized to simulate the inconsistency inherent in the periodic impacts due to a variety of factors such as slip, varying load angle, and the transition path of engineering mechanical systems. The sampling frequency is 2048 Hz. In this simulation experiment, 10 simulation signals are constructed for the classification. One of the time domain signals is shown in Figure 7a, and one of the frequency domain is shown in Figure 7b. in the periodic impacts due to a variety of factors such as slip, varying load angle, and the transition path of engineering mechanical systems. The sampling frequency is 2048 Hz. In this simulation experiment, 10 simulation signals are constructed for the classification. One of the time domain signals is shown in Figure 7a, and one of the frequency domain is shown in Figure 7b.   Figure 8b is the corresponding frequency domain of Figure 8a. As can be seen in Figure 8, the main energy of the signal is located in the four relatively high frequency components. In the vibration measurement, each sampling record contains 2048 discretized sampling points. That is, the duration of each record is 1 second. For each fault class, 1200 records are used for model training and 240 records are used for performance testing. The network used in this simulation experiment is shown in Figure 2.
There are 32 kernels in convolutional Layer #1, and the size of each kernel is set as 3 × 3. Following the convolutional layer, there is an activation Relu layer. After that, there is an additional layer to drop ten percent of the nodes in order to prevent over-fitting. In layer #2, there are 10 feature   Figure 8b is the corresponding frequency domain of Figure 8a. As can be seen in Figure 8, the main energy of the signal is located in the four relatively high frequency components. in the periodic impacts due to a variety of factors such as slip, varying load angle, and the transition path of engineering mechanical systems. The sampling frequency is 2048 Hz. In this simulation experiment, 10 simulation signals are constructed for the classification. One of the time domain signals is shown in Figure 7a, and one of the frequency domain is shown in Figure 7b.   Figure 8b is the corresponding frequency domain of Figure 8a. As can be seen in Figure 8, the main energy of the signal is located in the four relatively high frequency components. In the vibration measurement, each sampling record contains 2048 discretized sampling points. That is, the duration of each record is 1 second. For each fault class, 1200 records are used for model training and 240 records are used for performance testing. The network used in this simulation experiment is shown in Figure 2.
There are 32 kernels in convolutional Layer #1, and the size of each kernel is set as 3 × 3. Following the convolutional layer, there is an activation Relu layer. After that, there is an additional layer to drop ten percent of the nodes in order to prevent over-fitting. In layer #2, there are 10 feature In the vibration measurement, each sampling record contains 2048 discretized sampling points. That is, the duration of each record is 1 second. For each fault class, 1200 records are used for model training and 240 records are used for performance testing. The network used in this simulation experiment is shown in Figure 2.
There are 32 kernels in convolutional Layer #1, and the size of each kernel is set as 3 × 3. Following the convolutional layer, there is an activation Relu layer. After that, there is an additional layer to drop ten percent of the nodes in order to prevent over-fitting. In layer #2, there are 10 feature maps, including convolution, activation, and dropout layers. The configuration of Layer 2 is set similarly to that of Layer 1, except that the kernel size of Layer 2 is chosen as 2 × 2. There is also a full connection layer in the output dimension, which is equal to the fault class number 10. In the output layer, softmax activation is chosen for the classification to represent the categorical distribution, where the Adam optimizer is used to minimize the categorical cross entropy.
A confusion matrix is an effective visualization tool to estimate the performance of a classification algorithm [32]. Each column of the confusion matrix represents the instances in a predicted class (output class), while each row represents the instances in an actual class (target class). Figure 9 presented the confusion matrix using the CNN model for 10 patterns, where Ci means the simulated condition in Equation (17). maps, including convolution, activation, and dropout layers. The configuration of Layer 2 is set similarly to that of Layer 1, except that the kernel size of Layer 2 is chosen as 2 × 2. There is also a full connection layer in the output dimension, which is equal to the fault class number 10. In the output layer, softmax activation is chosen for the classification to represent the categorical distribution, where the Adam optimizer is used to minimize the categorical cross entropy. A confusion matrix is an effective visualization tool to estimate the performance of a classification algorithm [32]. Each column of the confusion matrix represents the instances in a predicted class (output class), while each row represents the instances in an actual class (target class). Figure 9 presented the confusion matrix using the CNN model for 10 patterns, where Ci means the simulated condition in equation 17. After 600 epoch iterations, the result shows a great classification effect. As can be seen in Figure 9, the trained CNN model represents a high predicted effect, with a 99.58% accuracy rate and total error of 0.42%.
It is undeniable that the above simulation result shows that the CNN model is of proper fault pattern recognition ability and exhibits good generalization ability. However, the proposed method is only applied to the simulation signal; further actual experiments are also indispensable for its actual performance validation.

Experiment and Data Acquisition
The data used to train the proposed algorithm in this paper are collected on a custom-built gearbox test rig; the structure sketch of the experimental set-up is shown in Figure 10. The set-up is composed of a speed controller, an alternating current (AC) servo motor, a cylindrical reduction gearbox, a load rotor, balance disk mass, and other auxiliary mechanisms. After starting the set-up, the speed controller is engaged to control the machine such that it works at a constant speed. The load motor is used to provide mechanical loads. The loading force is similar to that of the actual working condition. There is a one stage reduction gearbox in this experiment. The driving gear has 55 teeth, and the driven gear has 75 teeth. The faulty gear is used as the driven gear. Details about the pair of gearboxes are available in Table 1. After 600 epoch iterations, the result shows a great classification effect. As can be seen in Figure 9, the trained CNN model represents a high predicted effect, with a 99.58% accuracy rate and total error of 0.42%.
It is undeniable that the above simulation result shows that the CNN model is of proper fault pattern recognition ability and exhibits good generalization ability. However, the proposed method is only applied to the simulation signal; further actual experiments are also indispensable for its actual performance validation.

Experiment and Data Acquisition
The data used to train the proposed algorithm in this paper are collected on a custom-built gearbox test rig; the structure sketch of the experimental set-up is shown in Figure 10. The set-up is composed of a speed controller, an alternating current (AC) servo motor, a cylindrical reduction gearbox, a load rotor, balance disk mass, and other auxiliary mechanisms. After starting the set-up, the speed controller is engaged to control the machine such that it works at a constant speed. The load motor is used to provide mechanical loads. The loading force is similar to that of the actual working condition. There is a one stage reduction gearbox in this experiment. The driving gear has 55 teeth, and the driven gear has 75 teeth. The faulty gear is used as the driven gear. Details about the pair of gearboxes are available in Table 1.  In this research, by removing the driven gear (fault gear), four different fault conditions were researched. Four typical gear faults are simulated in the gearbox test bed: a normal condition tooth crack fault (shown in Figure 11a), a tooth crack fault, a tooth break fault (shown in Figure 11b), and a weak tooth crack fault (shown in Figure 11c). The description of the four conditions of gearbox fault is listed in Table 2.  The comprehensive fault diagnosis experimental platform is presented in Figure 12. The Sony EX data acquisition system is also employed to acquire the fault signal data. An LC0101T accelerometer is used to collect the fault signal data. As can be seen in Figure 12, the measuring point position is located in the box on the lateral wall of the fault gear. The vibration signals of the gearbox in all operational conditions in Table 2 are measured by the accelerometer and then stored by the data acquisition system, which is equipped with antialias filtering. The sample frequency is set as 12,800 Hz.  In this research, by removing the driven gear (fault gear), four different fault conditions were researched. Four typical gear faults are simulated in the gearbox test bed: a normal condition tooth crack fault (shown in Figure 11a), a tooth crack fault, a tooth break fault (shown in Figure 11b), and a weak tooth crack fault (shown in Figure 11c). The description of the four conditions of gearbox fault is listed in Table 2.  In this research, by removing the driven gear (fault gear), four different fault conditions were researched. Four typical gear faults are simulated in the gearbox test bed: a normal condition tooth crack fault (shown in Figure 11a), a tooth crack fault, a tooth break fault (shown in Figure 11b), and a weak tooth crack fault (shown in Figure 11c). The description of the four conditions of gearbox fault is listed in Table 2.  The comprehensive fault diagnosis experimental platform is presented in Figure 12. The Sony EX data acquisition system is also employed to acquire the fault signal data. An LC0101T accelerometer is used to collect the fault signal data. As can be seen in Figure 12, the measuring point position is located in the box on the lateral wall of the fault gear. The vibration signals of the gearbox in all operational conditions in Table 2 are measured by the accelerometer and then stored by the data acquisition system, which is equipped with antialias filtering. The sample frequency is  The comprehensive fault diagnosis experimental platform is presented in Figure 12. The Sony EX data acquisition system is also employed to acquire the fault signal data. An LC0101T accelerometer is used to collect the fault signal data. As can be seen in Figure 12, the measuring point position is located in the box on the lateral wall of the fault gear. The vibration signals of the gearbox in all operational conditions in Table 2 are measured by the accelerometer and then stored by the data acquisition system, which is equipped with antialias filtering. The sample frequency is set as 12,800 Hz. The weak tooth crack fault signal in 0.5 s with sampling rate fs =12,800 Hz (shown in Figure 13) is composed of a periodic sequence of transients occurring with 13 Hz. The current rotating speed is approximately 780 rpm, and the test gear rotating speed is 572 rpm (9.53 Hz). As can be seen in Figure 14a, the potential fault modes are masked by noises and irrelevant interference in the time domain vibration signal. Periodic group sparse signals are buried in strong background noise and irrelevant interference. The corresponding Fourier spectrum is shown in Figure 14b. It can be observed from the figure that the energy of the signal is distributed along the whole frequency range. The constituent frequency component is too complicated to identify the characteristic frequency component.
In this research, the fault frequencies are generally lower than 512 Hz, therefore, low-pass filter (1024 Hz) and down sampling operations are used to pre-process the signal so as to enhance the calculation's efficiency. The time domain signal and the Fourier spectrum after pre-processing are presented in Figure 14. The weak tooth crack fault signal in 0.5 s with sampling rate fs = 12,800 Hz (shown in Figure 13) is composed of a periodic sequence of transients occurring with 13 Hz. The current rotating speed is approximately 780 rpm, and the test gear rotating speed is 572 rpm (9.53 Hz). The weak tooth crack fault signal in 0.5 s with sampling rate fs =12,800 Hz (shown in Figure 13) is composed of a periodic sequence of transients occurring with 13 Hz. The current rotating speed is approximately 780 rpm, and the test gear rotating speed is 572 rpm (9.53 Hz). As can be seen in Figure 14a, the potential fault modes are masked by noises and irrelevant interference in the time domain vibration signal. Periodic group sparse signals are buried in strong background noise and irrelevant interference. The corresponding Fourier spectrum is shown in Figure 14b. It can be observed from the figure that the energy of the signal is distributed along the whole frequency range. The constituent frequency component is too complicated to identify the characteristic frequency component.
In this research, the fault frequencies are generally lower than 512 Hz, therefore, low-pass filter (1024 Hz) and down sampling operations are used to pre-process the signal so as to enhance the calculation's efficiency. The time domain signal and the Fourier spectrum after pre-processing are presented in Figure 14. As can be seen in Figure 14a, the potential fault modes are masked by noises and irrelevant interference in the time domain vibration signal. Periodic group sparse signals are buried in strong background noise and irrelevant interference. The corresponding Fourier spectrum is shown in Figure 14b. It can be observed from the figure that the energy of the signal is distributed along the whole frequency range. The constituent frequency component is too complicated to identify the characteristic frequency component.
In this research, the fault frequencies are generally lower than 512 Hz, therefore, low-pass filter (1024 Hz) and down sampling operations are used to pre-process the signal so as to enhance the calculation's efficiency. The time domain signal and the Fourier spectrum after pre-processing are presented in Figure 14.
characteristic frequency component.
In this research, the fault frequencies are generally lower than 512 Hz, therefore, low-pass filter (1024 Hz) and down sampling operations are used to pre-process the signal so as to enhance the calculation's efficiency. The time domain signal and the Fourier spectrum after pre-processing are presented in Figure 14.

DTCWT Decomposition and Normalization
After applying DTCWT to the time domain signal, the decomposition signals of the wavelet subspaces and the approximation sub-space are displayed in the zoom-in plots of Figure 15 (1024 points) where the x axis indicates the sub-space signal (1 is the lowest frequency component, 8 is the highest frequency part), the y axis is the time axis, and the z axis is the corresponding physical quantity (amplitude in Figure 15a and energy in Figure 15b). A seven-stage DTCWT decomposition was performed on the acquired signal. As mentioned before, fault features of the signal are submerged in overwhelming interfering contents. Therefore, the fault symptoms are easier to identify from the multiscale signal sub-spaces.

DTCWT Decomposition and Normalization
After applying DTCWT to the time domain signal, the decomposition signals of the wavelet sub-spaces and the approximation sub-space are displayed in the zoom-in plots of Figure 15 (1024 points) where the x axis indicates the sub-space signal (1 is the lowest frequency component, 8 is the highest frequency part), the y axis is the time axis, and the z axis is the corresponding physical quantity (amplitude in Figure 15a and energy in Figure 15b). A seven-stage DTCWT decomposition was performed on the acquired signal. As mentioned before, fault features of the signal are submerged in overwhelming interfering contents. Therefore, the fault symptoms are easier to identify from the multiscale signal sub-spaces. Since the eight decomposition sub-bands, generated by DTCWT, can be considered as a lower dimensional subset in the 2D signal, the one-dimensional (1D) time domain signal can be used to construct a high dimensional signal. As shown in Figure 2, after concatenating the decompositions along the vertical dimension, two dimensional data are constructed.
Different decomposition sub-bands may be diversified in difference in value owing to different energy distributions. Therefore, adjusting the measured values into uniform scales is necessary. In this paper, the step of feature scaling is used to limit all values within the range [0, 1]. The feature scaling step is defined as where X is the original signal, and X' is the new signal after normalization.

CNN Training
The vibration signals were collected from the test rig, mentioned in previous part, under four Since the eight decomposition sub-bands, generated by DTCWT, can be considered as a lower dimensional subset in the 2D signal, the one-dimensional (1D) time domain signal can be used to construct a high dimensional signal. As shown in Figure 2, after concatenating the decompositions along the vertical dimension, two dimensional data are constructed.
Different decomposition sub-bands may be diversified in difference in value owing to different energy distributions. Therefore, adjusting the measured values into uniform scales is necessary. In this paper, the step of feature scaling is used to limit all values within the range [0, 1]. The feature scaling step is defined as where X is the original signal, and X' is the new signal after normalization.

CNN Training
The vibration signals were collected from the test rig, mentioned in previous part, under four different operation conditions: (1) normal condition; (2) tooth crack fault; (3) tooth break fault; and (4) weak tooth crack fault. All of the raw vibration signals were collected at a uniform sampling frequency of 12,800 Hz. In the experiment, 630 records of vibration signals were collected for each condition. Therefore, the dataset totally contains 2520 records of signals. Among the 2520 records, 480 records are randomly selected as the testing dataset and the others are used as the training dataset.
The network used in this paper is shown in Figure 2. The input shape of the network for each signal is an normalized patch, which is convolved by a series of two convolutional layers. The size of the kernels in the first layer was chosen to be 3 × 3. Following the first convolutional layer, there is an activation layer to increase the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolution layer. Layer #2 is also a convolutional layer with 10 feature maps. The size of the kernels was chosen as 2 × 2. After a convolutional and a max pooling layer, the high-reasoning in the neural network is done via fully connected layers with full connections to all activations in the previous layer.
After dropping 10 percent of network nodes, there is also a resulting full connection layer. The output dimension of this layer exactly equals to the number of fault types tested in the experiment. In the final output layer, softmax activation is chosen for the classification to represent the categorical distribution.
In this study, we adopt the Adam optimizer [41] to minimize the categorical cross entropy. The cross entropy represents the dissimilarity of the approximated output distribution (after softmax) from the true distribution of labels. Adam is a first-order gradient-based algorithm, designed for the optimization of stochastic objective functions with adaptive weight updates based on lower-order moments.
Batch size and learning rate are two important parameters for the algorithm's performance. Batch size defines the number of samples that are going to be propagated through the network. Learning rate means how quickly network weights change. Proper parameters can optimize the network training process and reach the best accuracy rate. In this research, different configuration experiments are made to acquire the best performance. All of the experiments in this research were performed under a Linux OS on a machine with CPU Intel Core i5-4460 @ 3.2 GHz. The performance for different configurations of the network's architecture is presented in Table 3. As can be seen in the table, the F score presented an amazing result with 0.9980. Therefore, we propose to use 60 in batch size, 0.003 in learning rate (red rectangle box in Table 3), and 50 epochs to improve the performance. The performance curves during the training of the established model are shown in Figure 16. The red solid descending curve corresponds to the loss function values for the training sets and for the testing sets during training in Figure 16a. The Figure 16b blue solid ascending curve shows the accuracy rate change during the training process. The results show that the loss function value reaches a stable value after 30 epochs, and the accuracy rate achieves stability after 20 epochs with almost 0.998. Therefore, we propose to use 60 in batch size, 0.003 in learning rate (red rectangle box in Table 3), and 50 epochs to improve the performance. The performance curves during the training of the established model are shown in Figure 16. The red solid descending curve corresponds to the loss function values for the training sets and for the testing sets during training in Figure 16a. The Figure 16b blue solid ascending curve shows the accuracy rate change during the training process. The results show that the loss function value reaches a stable value after 30 epochs, and the accuracy rate achieves stability after 20 epochs with almost 0.998.

Experiment Results
The confusion matrix of the gear fault diagnosis experiment is shown in Figure 17, where the label meaning of the four conditions of the gearbox can be acquired in Table 2. As can be seen in the figure, the trained model presents a good generalization result, with only one misclassification in the entire 480 testing records. Therefore, the accuracy rate of proposed method is calculated at 99.79%. This result implies that the proposed classification method is not only valid in the simulation signal but also valid in the actual fault diagnosis for gear.

Experiment Results
The confusion matrix of the gear fault diagnosis experiment is shown in Figure 17, where the label meaning of the four conditions of the gearbox can be acquired in Table 2. As can be seen in the figure, the trained model presents a good generalization result, with only one misclassification in the entire 480 testing records. Therefore, the accuracy rate of proposed method is calculated at 99.79%. This result implies that the proposed classification method is not only valid in the simulation signal but also valid in the actual fault diagnosis for gear. Performance comparisons among different methods are displayed in Table 4, where the second column is the reported accuracy rates in the corresponding literatures; the last column is the tested accuracy rate for the presented gear fault diagnosis experiment in this research. Figure 18 is the tested accuracy rate performance comparison for the gear fault diagnosis experiment. Compared with the methods mentioned above, the proposed method obtains a higher accuracy, which means that a DTCWT and CNN combination is suitable for gear fault diagnosis.

Conclusions
In this paper, we propose an intelligent fault diagnosis method using a wavelet enhanced CNN for gear fault pattern recognition, in order to promote recognition accuracy and calculation efficiency. DTCWT is employed to implement multiscale decompositions on gearbox vibration signals. Different wavelet sub-band signals are used to construct the high dimension signals. After normalization, the high dimension signal is used to train and validate the established model. The major findings of this work can be summarized as follows: (1) A wavelet enhanced CNN is verified to be an effective method to recognize the fault type in mechanical systems. Compared with the traditional CM-FD method, the proposed method is less dependent on prior knowledge as well as excessive artificial diagnosticians. (2) Different configurations and parameters of the network's architecture are also studied in this paper (Table 3). Optimized configuration and parameters were identified during the network training process. The proposed diagnosis method for gearbox applications can also be extended to other rotating mechanical systems. In the future, it is worthwhile to investigate its applications to more complicated mechanical fault pattern recognition problems in a completely unsupervised manner. Meanwhile, additional advanced signal processing approaches using some a priori knowledge may enhance its applicability and can enable a more quantitative analysis.