Construction of a Sensitive and Speed Invariant Gearbox Fault Diagnosis Model Using an Incorporated Utilizing Adaptive Noise Control and a Stacked Sparse Autoencoder-Based Deep Neural Network

Gearbox fault diagnosis based on the analysis of vibration signals has been a major research topic for a few decades due to the advantages of vibration characteristics. Such characteristics are used for early fault detection to guarantee the enhanced safety of complex systems and their cost-effective operation. There exist many fault diagnosis models that have been developed for classifying various fault types in gearboxes. However, the classification results of the conventional fault classification models degrade when they are applied to gearbox systems with multi-level tooth cut gear (MTCG) faults operating under variable shaft speeds. These conditions cause difficulty in discriminating the gear fault types. Due to the improved computational capabilities of modern systems, the application of deep neural networks (DNNs) is getting popular in a variety of research fields, such as image and natural language processing. DNNs are capable of improving the classification results even when addressing complex problems such as diagnosing gearbox MTCG faults. In this research, an adaptive noise control (ANC) and a stacked sparse autoencoder–based deep neural network (SSA-DNN) are used to construct a sensitive fault diagnosis model that can diagnose a gearbox system with MTCG fault types under varying shaft rotation speeds, despite its complicatedness. An ANC is applied to gear vibration characteristics to remove a significant level of noise along the frequency spectrum of vibration signals to fix the most fault-informative components of each fault case. Next, the autoencoder learns the gear faults characteristic features from these fault-informative components to separate the fault types considered in this study. Furthermore, the implementation of the SSA-DNN is substituted for feature extraction, feature selection, and the classification processes in traditional fault diagnosis schemes by high-performance unity. The experimental results show that the proposed model outperforms conventional methodologies with higher classification accuracy.


Introduction
Different types of gearboxes are used in various equipment such as vehicles, industrial machinery, and electrical generators. However, they are prone to defects due to harsh and continuous working conditions. Gear defects can lead to damage of the gearbox system and become a root cause of damaging the whole mechanical device, which may lead to serious economic losses and the threat of personal safety. Hence, the condition monitoring of gearboxes is essential, and it would be beneficial if the gear defects in gearboxes can be detected in the early stages. The general non-destructive method for condition monitoring of gearboxes is based on sensing the vibration characteristics which contain the fault-related components [1]. The complex sideband frequencies are distributed around the meshing frequency and its harmonics, which are considered as intrinsic components in Sensors 2021, 21,18 3 of 23 that can efficiently determine tiny different components of non-stationary vibration signals of an MTCG in a gearbox system operating under varying speeds. The deep learning technique has dawned as an advantageous tool that has been applied in the fields of natural language processing, computer vision, image processing, and pattern recognition, and has succeeded in discriminating barely distinguishable components in categories through multiple non-linear transformations [24][25][26]. In other words, deep neural networks (DNNs) are suitable for use in the construction of sensitive and non-linear models. Instead of manually extracting the features and selecting the most separable ones, DNNs can be efficiently used for unsupervised hierarchical feature extraction and feature learning [27]. Thus, this study employs a stacked sparse autoencoder (SSA)-based DNN for identifying the fault types of an MTCG gearbox system based on the vibration signals with reduced noise components delivered by the ANC module.
The major contributions of this study are summarized as follows: (1) an adaptive noise control approach is designed for de-noising and preserving fault-related elements of raw vibration signals to obtain the optimized subbands on its outputs which mostly contain the essential informative components of vibration signals, and (2) the SSA-DNN utilizes the optimized subbands for identifying the MTCG defect types. The efficiency of the proposed model is evaluated by applying it to the vibration dataset collected from the MTCG gearbox that contains signals collected under six levels of tooth cut fault, such as 6.6%, 10%, 20%, 30%, 40%, and 50% cut as well as signals collected under normal operating conditions. The experimental dataset was collected under variable shaft rotating speeds, such as 300 RPM, 600 RPM, 900 RPM, and 1200 RPM, respectively. The results demonstrate the improved fault classification performance in comparison with the existing models.
The rest of this paper is organized as follows. Section 2 presents a gearbox experimental dataset along with the characteristics of vibration for normal and defective gears. The detail of the proposed method is provided in Section 3. Section 4 describes the experiment configuration and the process of parameter tuning for the proposed network. Section 5 presents the results and discussion, and Section 6 contains the concluding remarks. Figure 1 shows the experimental setup used for exploring the vibration characteristics of the MTCG gearbox system. A three-phase AC induction motor is connected to a pinion wheel through a drive shaft (DS) and a set of adjustable blades is mounted on a non-drive shaft (NDS) the other end of which is connected to a gear wheel. The numbers of teeth on the pinion wheel and the gear wheel are equal to 25 (N p = 25) and 38 (N g = 38), respectively. The length of each tooth is equal to 9 mm. The torque generated by the AC motor is transferred to the adjustable blade through the gearbox with a gear ratio of 25:38 (1:1.52). The multi-level tooth cut faults were seeded in one tooth of the gear wheel by cutting the percentage of the tooth length as depicted in Figure 2. The MTCG fault types contain a normal gear or a no seeded fault gear (N) condition, a tooth cut seeded gear defect of 6.6% (D1), a tooth cut seeded gear defect of 10% (D2), a tooth cut seeded gear defect of 20% (D3), a tooth cut seeded gear defect of 30% (D4), a tooth cut seeded gear defect of 40% (D5), and a tooth cut seeded gear defect of 50% (D6), respectively. For measuring the vibration characteristics of an MTCG gearbox in the normal and defects cases, the vibration sensor (an accelerometer 622B01 of IMI Sensor company) was installed at the end of the NDS, 72.5 mm from a gear wheel. Therewith, the shaft rotation speeds are monitored by using a displacement transducer (a speed sensor) to track the seeded hole in the DS once per rotation. The output signal from a vibration sensor was digitized using a PCI-based data acquisition board with a sampling frequency of 65,536 Hz continuously for one second. The data collection process was repeated 200 times to receive 200 samples of 1-s length per each gear defect state (seven states) under each shaft rotation speed. Therefore, the total number of observing samples is 5600, each of one second duration. The detailed description of the MTCG gearbox dataset is in Table 1.  1-s length per each gear defect state (seven states) under each shaft rotation speed. Therefore, the total number of observing samples is 5600, each of one second duration. The detailed description of the MTCG gearbox dataset is in Table 1.

The Vibration Characteristics of the Gearbox System
The categories of gear defects can be generally split into three types: manufacturing defects (wheel eccentricity, defect of tooth profile, etc.), installation defects (parallelism), and defects caused by long-term operation (cracked tooth, spalled tooth, case ware tooth, tooth wear, etc.). In this work, the MTCG defects were created to simulate the operated defects as the multi-level depth of a tooth cut seeded in the gear wheel of the gearbox system. The vibration characteristics of a gearbox system are analyzed in the cases of a healthy gear (a defect-free gear) and a defect gear for identifying the informative faultrelated components in the vibration signal. The vibration signal of a defect-free gear represents a linear and periodical signal that is calculated using the following formula [28]: where y n (t) is a vibration signal of a healthy gear; K is a total number of meshing frequency harmonics in the observed frequency spectrum of a vibration signal; Y k and ∂ k are the amplitude and phase of the k-th meshing frequency harmonics (k = 1, . . . , K); and f h stands for the meshing frequency which can be calculated using the parameters of a gear wheel ( f h = f g N g , where f g is a gear wheel rotation speed and N g is the number of gear teeth) or parameters of a pinion wheel ( f h = f p N p , where f p is a pinion wheel rotation speed and N p is the number of pinion teeth). Figure 3a illustrates an example of a frequency spectrum denoting the informative components as meshing frequency tones in a spectrum of vibration signals of a defect-free gearbox. Compared to a vibration signal of a normal gear, a signal of a defected gear is more complex due to the occurrence of impulsive vibrations when the motion is transferred from the DS to the NDS by rotating a pinion wheel through a gear wheel at a defective tooth position during one rotation cycle. Those periodical impulsive vibrations create the non-linear and non-stationary vibration signal formed as the amplitude and phase modulation signal in the point of view in the signal processing zone [3]. The fault gear vibration signal can be formulated [29] by Equation (2), and an example for demonstrating the fault-related informative components is shown in Figure 3b: Here, σ k (t) = ∑ N i=0 Θ ki cos 2πi f g t + Ω kj and ψ k (t) = ∑ N i=0 Ψ ki cos 2πi f g t + ξ ki are modulating components of the amplitude and phase partial in the fault gear vibration signal y d (t); Θ ki ,Ψ ki are amplitudes and Ω kj ,ξ ki are phases of the i-th sideband, respectively, roundly k-order meshing the frequency tone of the vibration signal y d (t).

The Incorporated Construction Model of the ANC and the SSA-DNN
The proposed sensitive and speed invariant model for diagnosing gearbox faults is presented in Figure 4. Three major function blocks are utilized in this model, such as the data collection system (Sensors and DAQ), the ANC, and the SSA-DNN. The data collection system collects the vibration dataset of an MTCG gearbox system for each fault type (seven fault types in total) under variable shaft rotation speeds. It collects the vibration data samples and captures the gear defect behaviors in the vibration characteristics: each vibration sample is evenly acquired during one second to monitor several complete rotation cycles of the defected gear. The ANC module then processes the raw vibration signals. Firstly, it performs down-sampling three times along with filtering the signal with a low-pass filter to receive the vibration subbands within the frequency range from 0 to 10 kHz according to the real operating frequency range of the vibration sensor [18]. The expression of multilevel gear defect types on the vibration characteristic is signified by the magnitudes of the principal frequency tones, therefore the main function of the ANC is optimizing vibration subbands for removing the redundant components along with noise while preserving the original fault-related components. The output of the ANC provides the optimized subband in the frequency domain (power spectrum density) which mostly contains the meshing frequency, its harmonics, and their distributed sideband gear frequency tones (i.e., the defect-related informative components). Under variant speeds condition, the positions of principal frequency tones are altered according to the explanation in Section 2. There exist the components that represent the speed invariant MTCG defects as the numbers of latent features related to the ratio and proportional to the amplitudes and displacements in the optimized vibration subbands, which are difficult to extract features from by traditional methodologies [30]. Notwithstanding, based on the unsupervised learning and hierarchy of feature extraction constitution of a deep neural architecture (DNA), the SSA-DNN can vanquish the issue and automatically explore the most defect-substantial features from a set of components in the frequency spectrums of optimized subbands output from the ANC. By fetching out these features, the SSA-DNN can use them to identify defect types of an MTCG gearbox system for achieving a high classification result in the output layer.

Adaptive Noise Control (ANC)
ANC is a signal processing method used for reducing noise and preserving the faultrelated informative elements in gearbox vibration characteristics. The ANC approach is a self-constructed and time-varying system that uses a recursive algorithm for optimizing its parameters for obtaining the desired optimized signal in its output [31]. General ANC consists of a digital filter, an adaptive algorithm, and a reference signal generator. An adaptive algorithm operates to update the coefficients of the digital filter based on the feedback error signal of a filtered reference and an input signal to receive the optimized denoised subband signal in the output of the ANC [32]. In this study, the ANC employs the adaptive noise reducer-based Gaussian reference signal (ANR-GRS) which has been elaborated in [18] for reducing noise and optimizing gearbox vibration signals. An adaptive noise control scheme contains two inputs (the desired input and a reference input) and one output. As the desired input for the observed signal, the vibration subband is used in this study, while the reference input is used for a signal that imitates the parasitic noise in the observed signal. The function of the ANC approach can be described in detail in the following processes [18]: Generating the reference signal to supply to the reference input of an ANC: Mainly, there are two types of noise present in the vibration signal: white noise and band noise. Hence, the reference signal generator creates the output signal behavior which is homologous with those such as Gaussian signals and white noise signals, as illustrated in Figure 5. The parameters of a Gaussian signal (a mean and a standard deviation value) can be adjusted based on the input variable of the shaft rotation speed. The adjustable Gaussian window, a component for building the entire Gaussian signal, is drawn to adapt to the frequency space between two consecutive sideband gear frequencies, formulated as follows: where the adjustable parameters (mean value F o and standard deviation value σ) are functions of the shaft rotation frequency [18]. Concretely, F o is proportional to the frequency of faulty wheel ( f DG ) and can be computed as below: and by linearizing the Gaussian function, the standard deviation is approximated to the mean value as:  Also, the number of sideband segments N t is calculated using the known parameters such as the number of samples N s , sampling frequency F s , and fault wheel frequency. The formulation of sideband segments is presented below: where the frequency of a faulty wheel ( f DG ) is represented as a gear frequency ( f g ) which is defined in Section 2. Therefore, by adjusting the ratio coefficient ε, the Gaussian window can access the space between two consecutive sideband frequencies in the frequency spectrum of a vibration signal to reduce the presence of noise. According to specific conditions defined in [18], first, the coefficient ε is selected from the range of [0.25 0.75], and then, the Gaussian windows are created with the parameters chosen as shown below: (1) the mean value F o is assigned to be in the range: (2) the standard deviation of the Gaussian windows is selected in the following range: By limiting the adjusting values of the coefficient ε, each generated Gaussian window is positioned completely inside the area between two consecutive sideband frequencies during the optimization processes in the next steps. This ensures that the adaptive noise control technique performs reducing band-noise significantly whereas originally preserving the fault-related informative components as meshing frequencies, its harmonics, and sideband frequencies [18].

2.
The construction of an adaptive filter The adaptive filter is formed by combining the N-tap FIR digital filter (the coefficient vector as c(n) ≡ [c 0 , c 1 , . . . , c N-1 ] T ) and a least mean square (LMS) adaptive algorithm. The reference signals are used as the input to the digital filter and its output signals are summed with the vibration subbands to calculate the output error signals. Based on this error, the LMS adaptive algorithm tunes the coefficient vectors according to the convergence criterion of the least mean square error for determining the optimal coefficient vector (c 0 ) and then identifying the local optimal subbands. The operation of an adaptive filter is functionally described in Figure 6.

3.
The optimization process for selecting the optimal vibration subband Each vibration subband, processed by an adaptive filter with the input reference of a parameter-adjustable Gaussian reference signal, results in many subbands in its output (termed as local optimal subbands) corresponding to the set of specific values of parameters and appropriate optimal coefficient vectors. At this step, the ANC selects the subband which has a minimum mean squared value as an output result of the optimization process (termed as an optimized subband) illustrated in Figure 6. This optimized output subband is a final output of the ANC module that contains mostly the fault-related informative components and trivial disturbances or redundant components.
In fact, the signal portions, which reflect the gear states (a meshing frequency, meshing frequency harmonics, and gear sideband frequencies), are represented mostly in the frequency domain as magnitudes, tones amplitudes, oscillations, frequencies, and the ratios between them. Thus, it is suitable to use the frequency spectrum of the optimized subband as the input data to the SSA-DNN so the deep network can explore and automatically extract the defect characteristic features from its inputs. Additionally, the usage of the frequency spectrum of the vibration signal reduces the complexity of the DNN. Therefore, in this paper, the frequency spectrum of the optimized vibration subband calculated by Fourier transform [33] is used as the input of the SSA-DNN module. The spectrum of the optimized subband is of ranges from 0 to 10 kHz due to the down-sampling process of raw one second vibration samples.

Stacked Autoencoder
A stacked autoencoder is a type of DNN, with a number of hidden layers greater than one, formed by stacking simple autoencoders for feature discrimination and classification. To understand the concept of a stacked autoencoder, a simple autoencoder should be discussed first. It is an unsupervised DNN based on a three-layer symmetrical architecture for learning the representation of high-level data [34]. An autoencoder functions through two learning stages-encoding and decoding, as shown in Figure 7. In the encoding stage, it transforms the higher-dimensional input into a lower-dimensional one. High-dimensional input data is compressed by the hidden layer in DNN architecture [35]. Hence, the encoding path contributes to the principal goal of an autoencoder. In the mathematical expression, the higher-dimensional input represented as s ∈ R N (i.e., N dimensions) is encoded to a lower-dimensional space h ∈ R K (i.e., K dimensions), producing the output vector known as a latent space. The encoder function or the latent space can be represented as follows: where f e , W e , and b e are the encoding activation function, weights, and bias of the network, respectively. From Figure 7, it can be interpreted that the decoding portion reconstructs the output of a lower-dimensional space that was compressed from higher-dimensional input using an encoding process. The reconstruction procedure can be expressed as follows: Here f d , W d , and b d are the decoding activation function, weights, and bias of the network, respectively. The key goal of the autoencoder is to minimize the reconstruction loss which is an objective function of an autoencoder. It can be expressed as following [36]: In this paper, the feature engineering and classification path of the sensitive and speed invariant gearbox fault diagnosis model is constructed by stacking multiple sparse autoencoders as a stacked sparse autoencoder (SSA) for determining the small differences of features between gear defect types which are the basis components for improving classification accuracy. In the next subsection, the sparse autoencoder algorithm is explained.

Sparse Autoencoder
Sparsity is a special parameter of autoencoders, which puts a constraint onto the hidden layer and causes activation of inactive hidden units to discover the tiny differences in decimated features of data representation more sensitively and robustly than the simple autoencoder architecture [37]. The constraint of a sparse autoencoder usually embeds a regularization term to the objective function. Therefore, the regularized objective function can be expressed as follows [36]: In Equation (12), β and γ refer to the L 2 regularization coefficient and the sparsity penalty factor, respectively. In the training process of an autoencoder, it is sometimes observed that the value of γ alters in an inversed way with the values of weight parameters and behaves proportionally to the latent space h (for example the value of the sparsity penalty factor increases by decreasing the value of weights and increasing the value of latent code). Thus, the L 2 regularization is introduced for embedding in the cost function to solve this issue, which can be represented as follows [36]: where L, n, and k represent the number of hidden layers, the number of observations, and the number of variables in the input data, respectively. Consequently, the sparsity constraint Φ sparse can be formulated as follows: where This Equation (14) is known as Kullback-Leibler divergence [38]. Φ sparse takes a higher value when the i-th neuron gives an average activation value ρ because that deviates mainly from the desired value ρ.
To establish the SSA, several numbers of sparse autoencoders, which have been individually trained, are stacked and positioned in a form such the input layer is placed before the series of hidden layers, and a SoftMax classifier [39] represents an output layer of this network architecture. Hence, all sparse autoencoders, which are stacked, form the DNA. Figure 8 depicts an example of a DNA with four hidden layers for visual understanding. This DNA first operates in an unsupervised learning manner, where all of the SSAs extract useful features and then, in a supervised learning manner, the DNA executes fine-tuning employing a back-propagation algorithm based on the stochastic gradient descent [40]. After the training process is completed, the unseen data is used for evaluating the performance of the DNA.

Experimental Setup and Tuning DNA Parameters
To validate the effectiveness of feature engineering and classification by the SSA-DNN in the proposed model, we perform a set of four experiments listed in Table 2. In these experiments, the SSA-DNN uses the input data as the samples of the frequency spectrum of the subbands that were optimized by the ANC. The four subsets of gearbox data were taken based on shaft rotation speed, i.e., each data subset contains 1400 samples in total for all defect states (200 samples for each class of seven defect states: N, D1, . . . , D6), which were acquired from the vibration sensor when the shaft rotates at the same speed. For each experiment trial, the proposed DNA was trained numerous times with diverse numbers of epochs using samples corresponding to one speed of the shaft and validated with the dataset collected under two other shaft speeds, then changing samples belonging to different speeds for all four experiments.

Tuning Parameters for the SSA-DNN
The parameters of the DNA play an important role in classification performance, so that the tuning process for selecting the optimal values has to be performed [41]. To construct this model, we have repeatedly tested the proposed model using various values of model parameters such as the length of recipient input, the sparsity regularization term, the number of hidden layers, the number of hidden nodes, and the cost function to evaluate their effect on DNA performance. The following subsections explain the parameter tuning process in detail.

Exploration of the DNA Parameter Configurations
The length of the recipient input is the size of a single sample which is inputted to the DNA, it is also known as the value of higher-dimensional representation of the input layer. According to [41], this parameter is the first important factor for recognizing the complex features that can be well supported for the classification of MTCG fault types to build up the sensitive gearbox fault diagnosis model. Therefore, a larger recipient input length helps the DNA to extract better representative features. Nevertheless, a huge size of the input increases the computational complexity of the model, while a reasonable size of the input can provide both a reasonable quality of feature extraction and wellproportioned computation complexity. As mentioned in Section 3.1, the one second raw vibration signals were sampled at a frequency of 65,536 Hz, resulting in 65,536 points in the time domain. This raw signal was preprocessed by three-time down sampling accompanied by low-pass filtering before entering the ANC module. Hence, there are 21,845 (65,536/3) data points in the optimized time-domain signals received in the output of the ANC module. By applying the Fourier transform to these signals, the symmetrical frequency spectrum of each optimized subband containing an imaginary part (this part represents a spectrum of the signal in the negative frequency) and a real part (for the frequency tones greater than zero) is received. The real part that represents a real frequency spectrum of an optimized subband with 10,922 (21,845/2) data points is used as the input to the DNA. The usage of a large number of data points at the input layer might increase the computational complexity; however, the effectiveness of fault identification might not be improved significantly. On the contrary, a further reduction of the input size will lead to the reduction of frequency resolution and hence, it might cause challenges for the model when identifying the MTCG defect types. Thus, the length of the recipient input with 10,922 points of an optimized subband represents a rational trade-off between the classification performance and computational complexity for the sensitive and speed invariant MTCG gearbox fault diagnosis model.
Similarly, the number of neurons in the hidden layers also influences the performance of the DNA. Although there are no exact guidelines for selecting the number of neurons for a hidden layer of an autoencoder, this parameter directly impacts the process of feature extraction. Based on the functionality of the autoencoder, the number of nodes in the first hidden layer has to be lesser than the length of the recipient input for compressing the higher-dimensional data. To adjust the parameters of node number and sparsity, in this paper we create a fine-tuning dataset which is formed by randomly picking 100 data instances corresponding to each class under each rotation speed condition. Hence, the finetuning dataset consisted of N samp × N class × N speed = 100 × 7 × 4 = 2800 data instances in total. Figure 9 illustrates the relationship between the reconstruction error curve and the number of nodes for the first hidden layer obtained while training the autoencoder on the fine-tuning dataset during 350 epochs. This curve demonstrates that the number of 3000 nodes in the hidden layer, which is greater than 20% of the input size (10,922), leads to smaller reconstruction errors. A further increase in this number minorly affects the reconstruction error, but the computational complexity would be increased significantly. Thus, it is recommended to keep the number of nodes for the hidden layer at less than 35% of the input size. This criterion is applied to the remaining hidden layers in the proposed model, so the number of nodes in each consecutive hidden layer is in the range from 20% to 35% of the number of nodes in the previous layer. The sparsity penalty can be used for improving the forward learning process of an unsupervised autoencoder, whose purposive activity orients to manifest the highly representative features. To evaluate the effect of the sparsity penalty, the reconstruction error is mostly considered for the experiment the value of sparsity penalty parameter in the first autoencoder (the first hidden layer is selected with number hidden nodes as 3000). Figure 10 demonstrates the relation between the value of the sparsity term and the reconstruction error, which is a mean square error (MSE) in this study, achieved when training the autoencoder on a fine-tuning dataset during 350 epochs. It is observed that values of sparsity penalty in the range from 0.05 to 0.15 are better than the remaining values, and a value of 0.08 is the optimal one leading to the minimum MSE. Hence, this value has been chosen as a penalty factor for all the hidden layers in the proposed model.  The number of hidden layers plays an important role in the learning process. There exists a general opinion that a higher number of hidden layers results in better accuracy, but also reduces the generalization ability of the network [40]. In this work, a series of experiments to determine the number of hidden layers were performed while varying their number from three to six, as shown in Table 3. From this table, it can be observed that a number of hidden layers greater than three leads to the smallest reconstruction errors. Regarding a higher number of hidden layers, the reconstruction error does not change significantly; however, the computational time can be increased dramatically when making the architecture deeper. Therefore, to select a suitable number of hidden layers, the time performance also should be considered. The complexity of computation of the architecture, in general, can be measured as an average time required for one training cycle of DNA. Figure 11 shows the time consumption of different SSA-DNN deep architectures with various numbers of hidden layers and nodes in them during the training process. In this figure, the DNAs with higher numbers of hidden layers and nodes requires more time for training due to the depth of the architecture.

Parameter Selections of the SSA-DNN Model
Through the experiments in the previous subsection, it was observed that with the increase of DNA architecture complexity, the reconstruction error was getting smaller while the time needed for training the deep architecture was increasing. However, from Table 3 it can be seen that after reaching certain numbers of hidden layers and nodes, the further increase of architecture complexity leads only to minor reductions of the reconstruction error. From this observation, it can be concluded that the actual number of highly representative features is limited, and thus, when the DNA attempts to extract more features from its input, which might be redundant and not representative, they would not affect the resulting reconstruction error significantly. The structure of a DNA should contain several numbers of hidden layers to adequately perform dimensionality reduction of the input data, where each hidden layer analyzes its input to perform both feature extraction and selection to receive the higher-level representative features. These features are then used for discriminating the MTCG defect types during the classification process. Because of the challenge of constructing the speed invariant fault diagnosis model for MTCG gearbox systems, the parameters are selected to prioritize the small reconstruction error with acceptable execution time consumption. Regarding the architectures with five or six hidden layers, the reconstruction errors are relatively small in comparison with other architectures, though, the time consumed for the training process is much higher and the error values are not much larger. Therefore, in this study, the number of hidden layers is selected as four with the amounts of nodes (i.e., number of features) in them as 3000, 1000, 300, and 100 neurons for the first, second, third, and fourth hidden layers, respectively. The finalized optimal parameters of the SSA-DNN model are listed in Table 4, and its architecture is shown in Figure 12. Table 4. The optimal selected parameters for constructing the SSA-DNN model.

Parameter Selections of the SSA-DNN Model
Through the experiments in the previous subsection, it was observed that with the increase of DNA architecture complexity, the reconstruction error was getting smaller while the time needed for training the deep architecture was increasing. However, from Table 3 it can be seen that after reaching certain numbers of hidden layers and nodes, the further increase of architecture complexity leads only to minor reductions of the reconstruction error. From this observation, it can be concluded that the actual number of highly representative features is limited, and thus, when the DNA attempts to extract more features from its input, which might be redundant and not representative, they would not affect the resulting reconstruction error significantly. The structure of a DNA should contain several numbers of hidden layers to adequately perform dimensionality reduction of the input data, where each hidden layer analyzes its input to perform both feature extraction and selection to receive the higher-level representative features. These features are then used for discriminating the MTCG defect types during the classification process. Because of the challenge of constructing the speed invariant fault diagnosis model for MTCG gearbox systems, the parameters are selected to prioritize the small reconstruction error with acceptable execution time consumption. Regarding the architectures with five or six hidden layers, the reconstruction errors are relatively small in comparison with other architectures, though, the time consumed for the training process is much higher and the error values are not much larger. Therefore, in this study, the number of hidden layers is selected as four with the amounts of nodes (i.e., number of features) in them as 3000, 1000, 300, and 100 neurons for the first, second, third, and fourth hidden layers, respectively. The finalized optimal parameters of the SSA-DNN model are listed in Table 4, and its architecture is shown in Figure 12. Table 4. The optimal selected parameters for constructing the SSA-DNN model.

Result and Discussion
The main function of the ANC is to perform noise reduction and to preserve the fault-related useful components existing in the vibration signals. To collect the informative content of the vibration sample, where the content represents numerous fault-related components that are useful for designing the sensitive fault diagnosis model, the analog signals from the vibration acceleration sensor were digitized with a high sampling frequency of 65,536 Hz every one second. Thus, a 1-sec length data sample is used to monitor several rotation cycles (from three to thirteen rotational cycles depending on the rotation speed from 300 RPM to 1200 RPM) to collect fault-related vibration characteristics with some special oscillations. After data collection, the digitized vibration samples were filtered by a digital low-pass filter with the cut-off frequency of 10,000 Hz accompanied with the downsampling process to remove the high-frequency components (i.e., components located in spectrum higher than 10,000 Hz) which are out of operation range of the acceleration sensor, and to preserve the vibration components with intrinsic fault-informative features following realistic operation of a gearbox system. That is the first step for preprocessing data to remove the redundancy in the raw vibration signals. The vibration subbands output from a low-pass filter are inputted into the ANC module for a fine-optimizing process for noise reduction. In the range of the frequency spectrum less than 10 kHz, the ANC uses adaptive windows to access and remove white noise and band noise remaining between two consecutive sideband frequencies along the frequency spectrum. Figure 13 demonstrates the superiority of the ANC module for the de-noising process. Here, the red dotted circles indicate the noise frequency component zones of the input signals which were reduced significantly in the optimized subband outputted from the ANC. Moreover, the amplitudes of the sideband frequency tones, the meshing frequency, and its harmonics are kept unchanged when the vibration subband flows through the ANC module (the dashed blue and black circles). The outputs of the ANC are the optimized vibration subbands represented in the frequency domain for the expression of the energy distribution. These spectra are used as inputs to the SSA-DNN module for extracting the representative latent features by an unsupervised learning technique, the autoencoder, which is a part of the SSA-DNN module. Figure 14 illustrates feature spaces for seven defect types of an MTCG gearbox using some of the discriminative features extracted by sparse autoencoders from the frequency spectra of optimized subbands under different rotational speeds. This figure shows that the data instances corresponding to different signal classes are well separable in feature space. Here, the samples belonging to one defect type are placed closely, whereas the samples of different defect types are located separately in the visualized feature space. These distinct features are extracted by stacking the sparse autoencoder layers and are used to enhance the performance of the deep architecture using a back-propagation algorithm to minimize the reconstruction errors and then, finally, to classify gearbox defects. For fault diagnosis performance evaluation, we compared the results of the proposed model with previous models such as ANC and SVM [18] (model 1), ANC and ANN (model 2), stacked denoising autoencoder [42] (model 3), and the spectra imaging of vibration signal [43] (model 4). enhance the performance of the deep architecture using a back-propagation algorithm to minimize the reconstruction errors and then, finally, to classify gearbox defects. For fault diagnosis performance evaluation, we compared the results of the proposed model with previous models such as ANC and SVM [18] (model 1), ANC and ANN (model 2), stacked denoising autoencoder [42] (model 3), and the spectra imaging of vibration signal [43] (model 4). These results are presented in Table 5. The performance is evaluated using the four cases of experiment setup expressed in Table 2. The training dataset of each experiment contains 1400 vibration samples (200 vibration samples for each defect state of seven states as N, D1, D2, D3, D4, D5, D6) for each rotational speed to construct the deep architecture network model. The testing process is performed by 2800 vibration samples of two different rotational speeds. By executing four experiments, the vibration samples of four rotational speeds are used for training set in sequence, whereas two datasets of rotational speeds, which are different from rotational speed in training dataset in each experiment, are consumed for the testing process. In these experiments, models 1 and 2 use the statistical features extracted from time and frequency domains, whereas the remaining models use autonomous feature extraction methods based on the unsupervised learning approach (model 3) and vibration imaging approach (model 4). Models 1 and 2 use the optimized subband output from an ANC module to extract twenty-one feature parameters and then, using these feature vectors, classify fault types using SVM and ANN, respectively. Manually extracted features in models 1 and 2 cause a challenge when classifying multi-level tooth cut gear defects. Their fault classification results were around 68% ± 10% for model 1 and 59.4% ± 10% for model 2, fluctuating over four experiments. The construction of DNA in model 3 is performed by replacing the four sparsity autoencoder hidden These results are presented in Table 5. The performance is evaluated using the four cases of experiment setup expressed in Table 2. The training dataset of each experiment contains 1400 vibration samples (200 vibration samples for each defect state of seven states as N, D1, D2, D3, D4, D5, D6) for each rotational speed to construct the deep architecture network model. The testing process is performed by 2800 vibration samples of two different rotational speeds. By executing four experiments, the vibration samples of four rotational speeds are used for training set in sequence, whereas two datasets of rotational speeds, which are different from rotational speed in training dataset in each experiment, are consumed for the testing process. In these experiments, models 1 and 2 use the statistical features extracted from time and frequency domains, whereas the remaining models use autonomous feature extraction methods based on the unsupervised learning approach (model 3) and vibration imaging approach (model 4). Models 1 and 2 use the optimized subband output from an ANC module to extract twenty-one feature parameters and then, using these feature vectors, classify fault types using SVM and ANN, respectively. Manually extracted features in models 1 and 2 cause a challenge when classifying multilevel tooth cut gear defects. Their fault classification results were around 68% ± 10% for model 1 and 59.4% ± 10% for model 2, fluctuating over four experiments. The construction of DNA in model 3 is performed by replacing the four sparsity autoencoder hidden layers with two layers of denoising autoencoders using the optimal regularization terms and parameters from [42] and removing the ANC module from the proposed model. In model 3, the input data are the vibration subbands outputted from the down sampling and low-pass filtering process, with the denoising and feature engineering processes performed using the objective functions with the embedded manifold regularization. The fault identification results achieved by this model were about 82.88% ± 8% in four experiments. sensitive and speed invariant fault diagnosis model proposed in this study outperformed their fault diagnosis performance with results around 97% ± 2% during four experiments showing small accuracy deviations when alternating the shaft rotational speeds of the MTCG gearbox system.
Additionally, to verify the stability of the proposed algorithm, the experiments described above have been performed five times. The classification accuracies and their averages computed over five experimental trials are presented in Table 6. From these results, it can be seen that the proposed model demonstrates stable fault classification accuracy in independent trials of the experiments performed for training and testing subsets containing samples collected under different operating conditions, i.e., rotating speed. Controlling the noise embedded in the vibration signals is essential for the sensitive detection of multi-level cut tooth faults in gearbox systems. The presence of a high noise level can cause misidentifications of fault types and thus reduce the fault classification accuracy. Noise reduction is a complex problem, and it is not always possible to completely resolve this issue by signal processing or feature engineering techniques. Therefore, simultaneous usage of the ANC and SSA-DNN methods is an efficient approach for significant noise reduction while preserving the original fault-related information of the gear vibration characteristic, which is useful for fault identification. The design of a sensitive and speed invariant model requires exploration of the representative features that can be used for discrimination of multi-level tooth cut gear defects and maintaining its reliable performance under the operating speed fluctuation conditions in the gearbox system. In general, the manual feature extraction methods cannot satisfy those requirements, thus the unsupervised approaches based on deep neural networks are well-suitable for extracting the latent representative features by the process of minimizing reconstruction errors during the operation of a back-propagation algorithm in the DNA. The SSA-based DNN constructed in this research satisfies the requirements for constructing the proposed model, such as extracting the representative feature space, selecting the most defect-related useful features for classification, and finally, achieving high fault classification results.

Conclusions
This study presents a novel method which combines an ANC and an SSA-DNN to utilize their advantages for constructing a sensitive and speed invariant fault identification model for gearbox systems with multi-level tooth cut gear defects. The ANC technique is created based on the analysis of vibration characteristics of a gearbox system to generate the speed-dependent reference window signals with adjustable parameters, according to the noise types presenting in the raw vibration signals. Then, these generated window series were adaptively adjusted to access the space between two consecutive defect-related frequency tones and remove the noise along the whole frequency range of vibration signals. The ANC optimizes the input vibration signal for outputting the optimal subband which contains mostly the defect-related frequency tones with the integration of low-level background noise. Then, the frequency spectra of these optimal subbands are used as the input to the deep network architecture. This network is built up by stacking sparse autoencoders as the hidden layers of the network and using a Softmax activation function at the output layer for extracting latent representative feature spaces and selecting the most defect-related discriminative features for identifying the multi-level tooth cut fault types under the condition of various shaft rotational speeds. The effectiveness of the proposed model is validated by experiments performed using the vibration dataset containing MTCG gearbox defects collected under four different rotational speeds. To validate the property of speed invariance for the proposed model, the experiment was arranged as four sub-experiments using the datasets corresponding to each rotational speed. Each sub-experiment uses a one-speed dataset to construct and train the model. Then this given model is used for fault identification using two datasets collected under other speed conditions. This procedure was performed four times using the different speed datasets for building the model in each. The average classification result achieved over four experiments was 97%, which outperforms the techniques used for comparison. Moreover, the classification results shown by the proposed model did not fluctuate significantly (2-3%) when applied to different speed datasets, which evidences that the prosed model is speed invariant and can be used for identifying multi-level tooth cut defects in a gearbox system under varying rotational speeds.