Fault Diagnosis for Rotating Machinery Using Vibration Measurement Deep Statistical Feature Learning

Fault diagnosis is important for the maintenance of rotating machinery. The detection of faults and fault patterns is a challenging part of machinery fault diagnosis. To tackle this problem, a model for deep statistical feature learning from vibration measurements of rotating machinery is presented in this paper. Vibration sensor signals collected from rotating mechanical systems are represented in the time, frequency, and time-frequency domains, each of which is then used to produce a statistical feature set. For learning statistical features, real-value Gaussian-Bernoulli restricted Boltzmann machines (GRBMs) are stacked to develop a Gaussian-Bernoulli deep Boltzmann machine (GDBM). The suggested approach is applied as a deep statistical feature learning tool for both gearbox and bearing systems. The fault classification performances in experiments using this approach are 95.17% for the gearbox, and 91.75% for the bearing system. The proposed approach is compared to such standard methods as a support vector machine, GRBM and a combination model. In experiments, the best fault classification rate was detected using the proposed model. The results show that deep learning with statistical feature extraction has an essential improvement potential for diagnosing rotating machinery faults.


Introduction
As one of the fundamental types of mechanical system, rotating machinery is widely applied in various fields. As a result of relative motion between mating surfaces, components of rotating machinery are prone to suffer from damage [1]. Effective fault diagnosis is thus important for maintaining the health of rotating machinery. One of the most challenging fault diagnosis tasks is the detection of faults and fault patterns, if any. Different methods have been developed for fault diagnosis in rotating components such as gearboxes and bearings [2][3][4]. Gao et al. [5,6] systematically reviewed the fault diagnosis with model-based, signal-based, knowledge-based, and hybrid/active approaches. The most successful methods have three main steps: determining the fault symptoms, extracting the sensitive features, and classifying the condition patterns. Various fault symptoms, including vibration measurements [7], thermal features [8], acoustic signals [9], oil debris [10], and other process parameters have been used as indices of the health of rotating systems. Vibration sensor signals have been proven effective for monitoring the health of rotating machinery.
Even in the vibration sensor category, different features sensitive to fault detection have been extracted in recent years. Most of these feature extractions are performed in the time domain, feature representation and classification, the GDBM is constructed in Section 2.3. More details can be found in the following sections.

Statistical Features of the Vibration Sensor Signals
For a vibration measurement x(t) of the rotating machinery, its spectral representation X(f ) can be calculated by: where the hat "ˆ" stands for the Fourier transform, t the time and f the frequency. For engineering applications, the collected vibration data are discrete values. Hence, the discrete version of Equation (1) (i.e., the discrete Fourier transform, DFT) should be used for the vibration data. There are several ways to calculate the DFT, among which the fast Fourier transform (FFT) is an efficient solution.
The time domain measurement x(t) and the frequency domain spectrum X(f ) are capable of describing the machinery vibration in terms of time and frequency separately. For jointly representing the machinery vibration, the wavelet transform provides a powerful mathematical tool for signal processing and analysis. As mentioned in the Introduction, the CWT, DWT and WPT are in general the most popular categories in the wavelet transform family. Although different wavelet transforms have been successively applied in the fault diagnosis community, this paper uses the WPT to generate the time-frequency statistical features because it has comparatively low dimensions of the decomposition numbers and enhanced signal decomposition capability in the high frequency region.
The WPT is an extension of the typical DWT, in which detailed information is further decomposed by the WPT in the high frequency region. In other words, the WPT decomposes x(t) into a set of wavelet packet (WP) nodes through a series of low-pass and high-pass filters recursively.
With the integral scale parameter j and translation parameter k (k = 0, . . . , 2 j´1 ; j = 0, . . . , J, which is the number of the decomposition levels), a WP function T n j,k ptq is defined by: T n j,k ptq " 2 j{2 T n p2 j t´kq (2) where n = 0, 1, . . . is the oscillation parameter [30]. The first two WP functions with j = k = 0 are the scaling function φ(t) and the mother wavelet function ψ(t), respectively. The remaining WP functions for n = 2, 3, . . . can be given by the WPT as: T 2n ptq " ? 2 ÿ k hpkqT n 1,k p2t´kq and T 2n`1 ptq " where the low-pass filter h(k) and the high-pass filter g(k) have the following forms: hpkq " 1 ? 2 ă ϕptq, ϕp2t´kq ą and gpkq " 1 where <*,*> represents the inner product operator. The WP coefficients P n j,k are therefore the inner product between the signal and the WP functions, i.e.: P n j,k "ă xptq, T n j,k ą" In this way, the signal x(t) is decomposed by the WPT into J levels. At the j-th (j = 0, . . . , J) level, there are 2 j packets with the order n = 1, 2, . . . , 2 j . For simplicity, we index the WP node as (j, n) whose coefficients are given by P n j,k . According to the above analyses, the vibration measurement of the rotating machinery can be represented in the time domain, the frequency domain and the time-frequency domain. This can be formulated by: p P R 1 , q " t P R n 0 , time domain Xp f q; p P R 1 , q " f P R tn 0 {2u , frequency domain rP 1 1,k , P 2 1,k , ..., P 2 j j,k s; p P R 2 J`1´1 , q P R n 0 {2 j , time-frequency domain (6) where n 0 is the length of x(t).
As the three representations M(p,q) are usually very long, statistical features can be used as healthy condition indicators for rotating machinery. Statistical features have been approved as simple and effective features in fault diagnostics [17]. Based on the aforementioned studies, one can use the following statistical features for the vibration signals: , F 2,p pMq " where N is the length of q for M(p,q), P(.) is the probability density [31], µ is the mean value, σ is standard deviation, and F 1,p , . . . , F 9,p stand for kurtosis, skewness factor, crest factor, clearance factor, shape factor, impulse indicator, variance, denominator of clearance factor (the square of the averaged square roots of absolute amplitude), and mean of absolute amplitude values of the p-th vector of M(p,q), respectively [32]. Note that there are nine statistical features for the time domain representation M(p,q) = x(t), 9 for the frequency domain representation M(p,q) = X(f ), and 9(2 j + 1´1 ) for the time-frequency domain representation M(p,q) = rP 1 1,k , P 2 1,k , ..., P 2 j j,k s. The feature set F is therefore given by: rF 1,1 pMq, ..., F 9,1 pMqs; time domain rF 1,1 pMq, ..., F 9,1 pMqs; frequency domain rF 1,1 pMq, ..., F 9,1 pMq, F 1,2 pMq, ..., F 9,2 pMq, ..., F 9,2 J`1´1pMqs; time-frequency domain (8)

Statistical Feature Representation by Unsupervised Boltzmann Machines
After determining the statistical features in the time domain, the frequency domain and the time-frequency domain, in this subsection the unsupervised Boltzmann machine is proposed for feature representation.
The deep learning is a promising branch of the machine learning. It was developed to simulate the working mechanism of the brain to make sense of such data as images, sounds, and texts. The composed single layer GRBM model is the core to construct the deep learning (GDBM) frameworks in this work, and is originated from restricted Boltzmann machine (RBM).
The Boltzmann machine is a log-linear energy based model, where the energy function is linear in its free parameters. To restrict the Boltzmann machines to those without visible-visible and hidden-hidden connections, the RBM was proposed by Hinton, the father of deep learning, to form deep learning networks [33]. Conventional RBMs define the state of each visible and hidden neuron as binary codes (0 or 1). For real-valued data, the RBM has to normalize the input variables into [0, 1] with treating them as probabilities. For regular cases where the real values data have limited values, e.g., [0, 255] for pixels in the image processing, the RBM works well [34]. However, our statistical features scatter in different ranges. For example, the minimal value for F 1 is 0, but that for F 5 will be a negative number. This means that the conventional RBM is difficult to cope with our statistical features for the fault diagnosis.
To accommodate the real-valued data, the binary visible neurons can be replaced by the Gaussian ones to generate the Gaussian-Bernoulli RBM (GRBM). Although with real-valued neurons, the GRBM exhibits same structure compared to its RBM counterpart as shown in Figure 1. Conventional RBMs define the state of each visible and hidden neuron as binary codes (0 or 1). For real-valued data, the RBM has to normalize the input variables into [0, 1] with treating them as probabilities. For regular cases where the real values data have limited values, e.g., [0,255] for pixels in the image processing, the RBM works well [34]. However, our statistical features scatter in different ranges. For example, the minimal value for F1 is 0, but that for F5 will be a negative number. This means that the conventional RBM is difficult to cope with our statistical features for the fault diagnosis.
To accommodate the real-valued data, the binary visible neurons can be replaced by the Gaussian ones to generate the Gaussian-Bernoulli RBM (GRBM). Although with real-valued neurons, the GRBM exhibits same structure compared to its RBM counterpart as shown in Figure 1. For the GRBM shown in Figure 1, the energy function E(v, h) is given by: where v and h denote the visible and the hidden neurons, bi and ci stand for the offsets of the visible layers, wij represents the weights for the connection matrix, σi is the standard deviation associated with a Gaussian visible neuron vi, and θ is the Gaussian parameter [35]. The traditional gradientbased training of the GRBM has difficulty learning σi, which is constrained to be positive. Hence, some algorithms fix σi as unity. With the improved energy function, Cho et al. [35] proposed conditional probabilities for the visible and the hidden neurons as follows: is the Gaussian probability density function with mean μ and variance σ 2 , and S(.) is a sigmoid function. The upgraded gradients with respect to the GRBM parameters are given by: where <.>d and <.>m represent the expectation computed over the data and the model distributions,  For the GRBM shown in Figure 1, the energy function E(v, h) is given by: where v and h denote the visible and the hidden neurons, b i and c i stand for the offsets of the visible layers, w ij represents the weights for the connection matrix, σ i is the standard deviation associated with a Gaussian visible neuron v i , and θ is the Gaussian parameter [35]. The traditional gradient-based training of the GRBM has difficulty learning σ i , which is constrained to be positive. Hence, some algorithms fix σ i as unity. With the improved energy function, Cho et al. [35] proposed conditional probabilities for the visible and the hidden neurons as follows: and where p.|µ, σ 2 q is the Gaussian probability density function with mean µ and variance σ 2 , and S(.) is a sigmoid function. The upgraded gradients with respect to the GRBM parameters are given by: where <.> d and <.> m represent the expectation computed over the data and the model distributions, respectively. When applying the GRBM for the unsupervised learning of the statistical features, the feature set F should be used as v and the GRBM results GR(F) = h. In this way, the n v statistical features are represented by n h neurons [36]. For condition monitoring and fault type classification, GRBM representations can be input to a classifier such as a support vector machine (SVM), decision tree, or random forest.
When applying the SVM as a classifier for fault diagnosis in rotating machinery, one should choose a multi-class SVM. The classical SVM is a binary classifier. Different methods have been proposed for using classical SVMs to compose multi-class SVMs. A pairwise coupling strategy was introduced by Hastie and Tibshirani [37] to perform multi-class classification by combining posterior probabilities provided by individual binary SVM classifiers.

Deep Statistical Feature Learning and Classification
After determining the statistical features in the time domain, the frequency domain and the time-frequency domain, in this subsection the unsupervised Boltzmann machine is proposed for feature representation.
In a common sense, an unsupervised mono-layer GRBM is inferior to a supervised multi-layer deep model. To stack several GRBMs on top of each other, a Gaussian-Bernoulli deep Boltzmann machine (GDBM) can be constructed for deep statistical feature learning of the machinery vibration signals. As an extension of the classical deep Boltzmann machine (DBM), the GDBM was introduced by Cho et al. [36]. Unlike other RBM-based deep models such as the deep belief network and the deep autoencoder, each neuron in the intermediate layers of the GDBM connects with both top-down and bottom-up information.
The GDBM structure used in this paper is shown in Figure 2a. The suggested GRBM is composed of three GRBMs (i.e., GRBM 1 , GRBM 2 , and GRBM 3 ). Each GRBM consists of one visible layer and one hidden layer, and the hidden layer of the previous GRBM is just the visible layer of the next GRBM. In this way, the first layer (data layer) and the second layer (hidden layer 1) forms the GRBM 1 , the second layer and the third layer (hidden layer 2) forms the GRBM 2 , the third layer and the last layer (output layer) forms the GRBM 3 , and the three GRBMs are stacked together to form the GDBM. representations can be input to a classifier such as a support vector machine (SVM), decision tree, or random forest. When applying the SVM as a classifier for fault diagnosis in rotating machinery, one should choose a multi-class SVM. The classical SVM is a binary classifier. Different methods have been proposed for using classical SVMs to compose multi-class SVMs. A pairwise coupling strategy was introduced by Hastie and Tibshirani [37] to perform multi-class classification by combining posterior probabilities provided by individual binary SVM classifiers.

Deep Statistical Feature Learning and Classification
After determining the statistical features in the time domain, the frequency domain and the timefrequency domain, in this subsection the unsupervised Boltzmann machine is proposed for feature representation.
In a common sense, an unsupervised mono-layer GRBM is inferior to a supervised multi-layer deep model. To stack several GRBMs on top of each other, a Gaussian-Bernoulli deep Boltzmann machine (GDBM) can be constructed for deep statistical feature learning of the machinery vibration signals. As an extension of the classical deep Boltzmann machine (DBM), the GDBM was introduced by Cho et al. [36]. Unlike other RBM-based deep models such as the deep belief network and the deep autoencoder, each neuron in the intermediate layers of the GDBM connects with both top-down and bottom-up information.
The GDBM structure used in this paper is shown in Figure 2a. The suggested GRBM is composed of three GRBMs (i.e., GRBM1, GRBM2, and GRBM3). Each GRBM consists of one visible layer and one hidden layer, and the hidden layer of the previous GRBM is just the visible layer of the next GRBM. In this way, the first layer (data layer) and the second layer (hidden layer 1) forms the GRBM1, the second layer and the third layer (hidden layer 2) forms the GRBM2, the third layer and the last layer (output layer) forms the GRBM3, and the three GRBMs are stacked together to form the GDBM. The GDBM and its constituting GRBMs can be pretrained using a greedy, layer-by-layer unsupervised learning algorithm [37]. During the pretraining period as shown in Figure 2b, special attention should be paid to the GDBM as the neurons in the intermediate layers receive information both from the upper and the lower layers. To cope with this particularity, Salakhutdinov [38] halved the pretrained weights in the intermediate layers and duplicated the visible and topmost layers for the pretraining. With this idea, Equation (10) should be revisited to calculate the energy of the visible layer for the GRBM as: where Nv = 2 corresponds to the duplication of the visible layer. Similarly, the energy for the topmost GRBML during the pretraining is given by: The GDBM and its constituting GRBMs can be pretrained using a greedy, layer-by-layer unsupervised learning algorithm [37]. During the pretraining period as shown in Figure 2b, special attention should be paid to the GDBM as the neurons in the intermediate layers receive information both from the upper and the lower layers. To cope with this particularity, Salakhutdinov [38] halved the pretrained weights in the intermediate layers and duplicated the visible and topmost layers for the pretraining. With this idea, Equation (10) should be revisited to calculate the energy of the visible layer for the GRBM as: Epv, h p1q |θq " where N v = 2 corresponds to the duplication of the visible layer. Similarly, the energy for the topmost GRBM L during the pretraining is given by: The aforementioned pretraining is an unsupervised, bottom-up procedure for the GDBM. This means that it cannot be applied for the classification after the pretraining. Compared to conventional unsupervised learning, fortunately, the GDBM requires an extra supervised, top-down fine-tuning procedure [39,40]. At the fine-tuning procedure, the output layer is replaced by a multilayer perceptron (MLP) with sigmoid functions. To fit the fault classification task, all the weights w can be discriminatively fine-tuned using a back-propagation (BP) algorithm [41]. The supervised BP method uses labeled data as an extra MLP layer of variables to train the GDBM model for the classification. Unlike the unsupervised training process considering one GRBM at a time, the BP training considers all the layers in a GDBM simultaneously, which is in the same way as for the standard feed forward neural networks [42]. In this way, the GDBM can be regarded as an improvement of the MLP, or neural networks. It is capable of dealing with the classification for nonlinear, abnormal (non-Gaussian) data using a "deeper" fashion [43]. Of course, this "deeper" learning is much more time-consuming than the conventional ones.
Having introduced the GDBM and its constituting components, the GRBMs, the procedure of applying the GDBM based classification for the fault diagnosis of the rotating machines is shown in Figure 3 and is summarized as follows: Step 1. Collect the vibration signals x(t), define the fault patterns and the diagnosis problems; Step 2. Calculate the statistical feature set F according to Equation (8) Step 4. Pretrain the GDBM model and its constituting GRBMs using the layer-by-layer unsupervised learning algorithm from the training dataset; Step 5. Fine-tune the GDBM weights using the BP algorithm from the training dataset; and Step 6. Diagnose the rotating machinery condition using the trained GDBM model.
The aforementioned pretraining is an unsupervised, bottom-up procedure for the GDBM. This means that it cannot be applied for the classification after the pretraining. Compared to conventional unsupervised learning, fortunately, the GDBM requires an extra supervised, top-down fine-tuning procedure [39,40]. At the fine-tuning procedure, the output layer is replaced by a multilayer perceptron (MLP) with sigmoid functions. To fit the fault classification task, all the weights w can be discriminatively fine-tuned using a back-propagation (BP) algorithm [41]. The supervised BP method uses labeled data as an extra MLP layer of variables to train the GDBM model for the classification. Unlike the unsupervised training process considering one GRBM at a time, the BP training considers all the layers in a GDBM simultaneously, which is in the same way as for the standard feed forward neural networks [42]. In this way, the GDBM can be regarded as an improvement of the MLP, or neural networks. It is capable of dealing with the classification for nonlinear, abnormal (non-Gaussian) data using a "deeper" fashion [43]. Of course, this "deeper" learning is much more time-consuming than the conventional ones.
Having introduced the GDBM and its constituting components, the GRBMs, the procedure of applying the GDBM based classification for the fault diagnosis of the rotating machines is shown in Figure 3 and is summarized as follows: Step 1. Collect the vibration signals x(t), define the fault patterns and the diagnosis problems; Step 2. Calculate the statistical feature set F according to Equation (8); Step 3. Develop the GDBM model with the stack of the GRBMs according to Figures 1 and 2; Step 4. Pretrain the GDBM model and its constituting GRBMs using the layer-by-layer unsupervised learning algorithm from the training dataset; Step 5. Fine-tune the GDBM weights using the BP algorithm from the training dataset; and Step 6. Diagnose the rotating machinery condition using the trained GDBM model.

Data Collection Experiments for the Fault Diagnosis
To validate the effectiveness of deep statistical feature learning for fault diagnosis, the proposed deep learning was applied to diagnose the health of two rotating mechanical systems. The experimental setups and procedures are detailed in the following two subsections.

Data Collection Experiments for the Fault Diagnosis
To validate the effectiveness of deep statistical feature learning for fault diagnosis, the proposed deep learning was applied to diagnose the health of two rotating mechanical systems. The experimental setups and procedures are detailed in the following two subsections.

Experimental Procedure for Gearbox Fault Diagnosis
The first experiments were carried out on a gearbox fault diagnosis system. As shown in Figure 4a, the output of a motor (3~, 2.0 HP, Siemens, Munich, Germany) was connected to the input shaft of a gearbox (fabricated by the lab of the Universidad Politécnica Salesiana, Cuenca, Ecuador) via a coupling. A 53-tooth pinion was installed on the input shaft of the gearbox, whose output shaft has an 80-tooth gear. An electromagnetic torque break (8.83 kW, Rosati, Monsano, Italy) was used as a load to connect with the output shaft of the gearbox via a belt transmission. The torque break was controlled by a controller (GEN 100-15-IS510, TDK-Lambda, Tokyo, Japan) which enabled the load to be adjusted manually. An accelerometer (ICP 353C03, PCB, Depew, NY, USA) was mounted on top of the gearbox to collect the vibration signals, which were sent to a laptop (Pavilion g4-2055la, HP, Palo Alto, CA, USA) through a data acquisition system (cDAQ-9234, NI, Austin, TX, USA). The laptop controlled an inverter (VLT 1.5 kW, Danfoss, Nordberg, Denmark) for adjusting the motor's rotation speed, which was monitored by a tachometer (VLS5/T/LSR optical sensor, Compact, Bolton, UK). The first experiments were carried out on a gearbox fault diagnosis system. As shown in Figure 4a, the output of a motor (3~, 2.0 HP, Siemens, Munich, Germany) was connected to the input shaft of a gearbox (fabricated by the lab of the Universidad Politécnica Salesiana, Cuenca, Ecuador) via a coupling. A 53-tooth pinion was installed on the input shaft of the gearbox, whose output shaft has an 80-tooth gear. An electromagnetic torque break (8.83 kW, Rosati, Monsano, Italy) was used as a load to connect with the output shaft of the gearbox via a belt transmission. The torque break was controlled by a controller (GEN 100-15-IS510, TDK-Lambda, Tokyo, Japan) which enabled the load to be adjusted manually. An accelerometer (ICP 353C03, PCB, Depew, NY, USA) was mounted on top of the gearbox to collect the vibration signals, which were sent to a laptop (Pavilion g4-2055la, HP, Palo Alto, CA, USA) through a data acquisition system (cDAQ-9234, NI, Austin, TX, USA). The laptop controlled an inverter (VLT 1.5 kW, Danfoss, Nordberg, Denmark) for adjusting the motor's rotation speed, which was monitored by a tachometer (VLS5/T/LSR optical sensor, Compact, Bolton, UK).  In the gearbox fault diagnosis experiments, in addition to one normal pinion and one normal gear, three different faulty gears and five different faulty pinions (shown in Figure 4b) were used to configure different condition patterns for the gearbox. The 10 different patterns shown in Table 1 were set for the collection of vibration signals.  In the gearbox fault diagnosis experiments, in addition to one normal pinion and one normal gear, three different faulty gears and five different faulty pinions (shown in Figure 4b) were used to configure different condition patterns for the gearbox. The 10 different patterns shown in Table 1 were set for the collection of vibration signals.
To challenge the fault diagnosis performance, three different load conditions (no load, small load, and large load), were manually set for each pattern. For each pattern and load condition, we collected 24 signals, each of which covered 0.4096 s, with a sampling frequency of 10 kHz. The experiments were repeated five times, so 3600 vibration signals corresponding to 10 condition patterns (with three different loads) were recorded. Each vibration signal was used to generate the temporal, spectral, and WPT representations M(p,q) given by Equations (6)-(8) were then used to generate the feature set F for the vibration signals. The 3600 feature sets were divided into a training dataset with 2400 samples and the testing dataset with 1200 samples. The unsupervised GRBM and the supervised GDBM were applied to learn the statistical features of the vibration signals. The statistical features represented by the unsupervised GRBM required an additional classifier for the pattern classification. Considering its excellent classification capability, the SVM was used as the classifier for the GRBM representations. For the GDBM, supervised deep learning as shown in Figure 3 was applied for the healthy condition pattern classification of the gearbox.

Experimental Procedure for Bearing Fault Diagnosis
To further challenge the deep statistical feature learning for fault diagnosis, we also carried out bearing fault diagnosis experiments. The gear fault patterns (displayed in Table 1) occupied areas of great damage, which introduced greater changes in the vibration measurements [44]. Compared to the vibration signal of the gear fault, an incipient bearing fault often has a smaller damage surface and thus generates weak vibration changes [45].
As shown in Figure 5a, a rolling element bearing test rig was constructed in the Universidad Politécnica Salesiana of Ecuador to collect the vibration measurements for different healthy conditions. The test rig was driven by a motor (3~, 2.0 HP, Siemens) controlled by an inverter (VLT 1.5 kW, Danfoss). The rotating speed of the motor was monitored by a tachometer (VLS5/T/LSR optical sensor, Compact). A steel shaft (φ30 mm) was connected to the motor via a coupling. The two ends of the shaft were supported by two bearings (bearing 1 and bearing 2, 1207 EKTN9/C3, SKF, Goteborg, Sweden). An accelerometer (ICP 353C03, PCB) was mounted on the housing (SNL 507-606, SKF) of bearing 2 for measuring the vibration signals, which were collected by a data acquisition box (cDAQ-9234, NI) that communicated with a laptop (Pavilion g4-2055la, HP). Two flywheels were installed on the shaft as the load of the system.
In addition to the normal bearings, as shown in Figure 5b, three different faulty bearings with an inner race fault, an outer race fault and a ball fault, were used in the experiments. Using combinations of bearings in different conditions, seven healthy condition patterns were set, as shown in Table 1. For each experiment with each pattern, there were respectively 0, 1 and 2 flywheels used as the load. For each pattern and load configuration, 48 signals were collected for 0.4096 s. Each experiment was repeated five times. This means that 5040 signals were finally obtained. The sampling frequency for the bearing fault diagnosis was also set at 10 kHz. Similar to the procedure described in the previous section, the statistical features were produced from the raw data of the bearing vibration signals. The unsupervised GRBM and the supervised GDBM were again applied for the fault diagnosis of the bearing system. The results of all the experiments are detailed in the next section. Similar to the procedure described in the previous section, the statistical features were produced from the raw data of the bearing vibration signals. The unsupervised GRBM and the supervised GDBM were again applied for the fault diagnosis of the bearing system. The results of all the experiments are detailed in the next section.    Similar to the procedure described in the previous section, the statistical features were produced from the raw data of the bearing vibration signals. The unsupervised GRBM and the supervised GDBM were again applied for the fault diagnosis of the bearing system. The results of all the experiments are detailed in the next section.  For generating the time-frequency domain representations, the WPT was applied to decompose the raw data up to four levels. There are 2, 4, 8 and 16 nodes for each level. Put all the nodes together, the WPT presentation and statistical features are displayed in Figure 8a,b, respectively. As the length  For generating the time-frequency domain representations, the WPT was applied to decompose the raw data up to four levels. There are 2, 4, 8 and 16 nodes for each level. Put all the nodes together, the WPT presentation and statistical features are displayed in Figure 8a,b, respectively. As the length of the raw signal is 4096 points, numbers of data points for a node at the four levels are 2052, 1030, 519 and 264, respectively. In this way, the number of data points as shown in Figure 8a is 16,600. For the WPT, there are 30 nodes each of which has nine features. This generates 270 features for the first signal as shown in Figure 8b.

Gearbox Fault Diagnosis Results
All the 3600 data have been disordered for the experiments. Among all the 3600 samples for each data, 2400 samples were random chosen as the training dataset F. To represent the statistical feature set F, we first applied the mon-layer GRBM with parameters as: number of the neurons in the hidden layer = 200, number of the learning epochs = 150, the initial learning rate = 0.001, its upper-bound = 0.001, and the weight decay = 0.005. As unsupervised learning of the GRBM does not have the classification function, a multi-class SVM classifier was applied to obtain the first fault diagnosis model (# 1 peer model). The reason for us to implement the model is to show the performance of the present deep learning. For # 1 peer model, the GRBM acts as the second feature representation tool (statistical features given by Equation (7) is the first one) for the vibration measurements. The outputs of the GRBM were fed into the SVM classifier. The supervised GDBM was subsequently applied for the same dataset F with parameters as: number of the neurons in the hidden layer 1 = 200, number of the neurons in the hidden layer 2 = 200, number of the pretraining epochs (for each constituent and the model) = 150, number of the fine-tuning epochs = 150, the initial learning rate = 0.001, its upper-bound = 0.001, and the weight decay = 0.005. In this way, we obtained the second fault diagnosis model (the proposed GDBM model). For comparison, the SVM classifiers for the original statistical features M(p,q), and the combination of M(p,q) and the GRBM representation were respectively developed as the third fault diagnosis model (#2 peer model) and the fourth one (#3 peer model). All the algorithms were realized using Matlab ® . One may note that in this work we have not employed more "shallow" learning models such as the decision tree, the random forest, and the neural network. The reason is that the SVM has been proven the prominent representative which outperforms most of the "shallow" learning members. of the raw signal is 4096 points, numbers of data points for a node at the four levels are 2052, 1030, 519 and 264, respectively. In this way, the number of data points as shown in Figure 8a is 16,600. For the WPT, there are 30 nodes each of which has nine features. This generates 270 features for the first signal as shown in Figure 8b. All the 3600 data have been disordered for the experiments. Among all the 3600 samples for each data, 2400 samples were random chosen as the training dataset F. To represent the statistical feature set F, we first applied the mon-layer GRBM with parameters as: number of the neurons in the hidden layer = 200, number of the learning epochs = 150, the initial learning rate = 0.001, its upper-bound = 0.001, and the weight decay = 0.005. As unsupervised learning of the GRBM does not have the classification function, a multi-class SVM classifier was applied to obtain the first fault diagnosis model (# 1 peer model). The reason for us to implement the model is to show the performance of the present deep learning. For # 1 peer model, the GRBM acts as the second feature representation tool (statistical features given by Equation (7) is the first one) for the vibration measurements. The outputs of the GRBM were fed into the SVM classifier. The supervised GDBM was subsequently applied for the same dataset F with parameters as: number of the neurons in the hidden layer 1 = 200, number of the neurons in the hidden layer 2 = 200, number of the pretraining epochs (for each constituent and the model) = 150, number of the fine-tuning epochs = 150, the initial learning rate = 0.001, its upperbound = 0.001, and the weight decay = 0.005. In this way, we obtained the second fault diagnosis model (the proposed GDBM model). For comparison, the SVM classifiers for the original statistical features M(p,q), and the combination of M(p,q) and the GRBM representation were respectively developed as the third fault diagnosis model (#2 peer model) and the fourth one (#3 peer model). All the algorithms were realized using Matlab ® . One may note that in this work we have not employed more "shallow" learning models such as the decision tree, the random forest, and the neural network. The reason is that the SVM has been proven the prominent representative which outperforms most of the "shallow" learning members.  With the trained models, the remaining 1200 samples (in the time, frequency, and time-frequency domains, respectively) were used to test the classification performances, which are displayed in Table 2.
From the diagnosis results shown in Table 2, it is clear that the classification rates for the time-frequency domain statistical features are higher (72.09% on average) than those for the time and frequency domains. This is due to the joint time and frequency representation of the WPT. When comparing the statistical features of the time and frequency domains, the time domain features are always the worst. Among all the models, deep statistical feature learning via the GDBM exhibits the best classification rate for the same data (62.58%, 91.75%, 95.17%, and 83.17% for the time, frequency, time-frequency domain statistical features, and the average, respectively). The best classification rate of 95.17% is seen with the GDBM model and time-frequency statistical features. Compared to supervised learning methods (e.g., the GDBM), the unsupervised GRBM displays the lowest classification rates (26.67%, 52.67%, 42.25%, and 41.33% for the time, frequency, time-frequency domain statistical features, and the average, respectively). Nevertheless, it should be noted that the GRBM used in this paper is an unsupervised algorithm, which shows that there is still some potential for fault diagnosis, if a fine-tuning procedure can be introduced for its learning process. As one of the most important "shallow" learning approaches, the SVM exhibited good classification results for the gearbox system. This result is similar to that of existing studies (e.g., [46]). When the GRBM representations are combined with the original statistical features M(p,q), a small increase in the classification rates can be seen (from 52.83% to 79.42% for the frequency domain, 69.50% to 78.42% for the time-frequency domain statistical features, and 61.05% to 64.56% on average). However, due to the "shallow" learning limit, it is very difficult to further improve the classification rate for the SVM. Our results indicate that deep statistical feature learning has the best performance for gearbox fault diagnosis. It should be noted that deep learning is much more time-consuming than classical learning methods.

Bearing Fault Diagnosis Results
For the bearing fault diagnosis experiments, 5040 vibration signals and their statistical features in the time domain are plotted in Figure 9a,b. The Fourier transform were then used to generate the frequency data and their statistical features as shown in Figure 9c Table 2. complex model with the largest number of parameters that must be estimated from the sample. It is an intrinsic drawback that "deeper" learning requires much more time than "shallow" learning does. As pointed out by LeCun et al. [48], the advent of fast graphics processing units (GPUs), which are convenient to program, allowed researchers to train deep networks 10 or 20 times faster. This indicates that parallel computation is helpful for reducing computation time. However, parallel computation is beyond the scope of this study. All the programs in this work were executed on a laptop. This resulted in much more computation time (hour-level) for the presented GDBM than its "shallow" counterparts (usually second-or minute-level on a laptop).

Remarks
Based on the fault diagnosis results as shown in the previous two subsections, one can see that the deep statistical feature learning holds the best classification performance comparing to the peer models. During the experiments, there were some very aberrant values (outliers) collected from the experimental setups, because the outliers are always unavoidable for real applications. It is obvious that the outliers may lead to deterioration of the fault diagnosis. However, we did not remove those outliers from the dataset, even if the removal of the outliers may increase the classification rates.
It should be noted that the given parameters also play important roles to the GDBM model. As indicated by Cho et al. [35], the training procedure of the GDBM can easily run into problems without careful selection of the learning parameters. Upon determining the network structure for different layers, therefore, the learning epochs for the pretraining and the fine-tuning will be directly related to the classification performance. In this subsection we will discuss the influence of the epochs for the pretraining and the fine-tuning procedure. We first adjusted the number of pretraining epochs (for the GRBMs and the presented model) with all the other parameters fixed. Figure 10a plots the change of the fault classification rates in response to the increase of the pretraining epochs for the WPT features. For the fault diagnosis of both the gearbox and the bearing systems, the number of A comparison of the feature performances in the different domains in Table 2 suggests that the time-frequency domain features exhibit the best performances (58.84%, 91.75%, 81.53% and 82.70% with the GRBM, GDBM, SVM and GRBM-SVM models, respectively), and the time domain features have the lowest classification rate (45.17% on average for all models). Compared with the gearbox fault diagnosis, the fault features for the rolling element bearings are more evident in the frequency domain, especially in the high frequency resonance band [47]. However, the model comparison results for the bearing fault diagnosis are almost same as those for the gearbox fault diagnosis. Among all the peer models, the deep statistical feature learning model (the GDBM) has the best classification rate (60.63% for the time domain, 87.57% for the frequency domain, 91.75% for the time-frequency domain, and 79.98% on average). This again validates the effectiveness of deep statistical feature learning for fault diagnosis in rotating machinery. Nevertheless, the improvement in fault diagnosis performance with deep learning is at the cost of complexity. The present GDBM is the most complex model with the largest number of parameters that must be estimated from the sample. It is an intrinsic drawback that "deeper" learning requires much more time than "shallow" learning does. As pointed out by LeCun et al. [48], the advent of fast graphics processing units (GPUs), which are convenient to program, allowed researchers to train deep networks 10 or 20 times faster. This indicates that parallel computation is helpful for reducing computation time. However, parallel computation is beyond the scope of this study. All the programs in this work were executed on a laptop. This resulted in much more computation time (hour-level) for the presented GDBM than its "shallow" counterparts (usually second-or minute-level on a laptop).

Remarks
Based on the fault diagnosis results as shown in the previous two subsections, one can see that the deep statistical feature learning holds the best classification performance comparing to the peer models. During the experiments, there were some very aberrant values (outliers) collected from the experimental setups, because the outliers are always unavoidable for real applications. It is obvious that the outliers may lead to deterioration of the fault diagnosis. However, we did not remove those outliers from the dataset, even if the removal of the outliers may increase the classification rates.
It should be noted that the given parameters also play important roles to the GDBM model. As indicated by Cho et al. [35], the training procedure of the GDBM can easily run into problems without careful selection of the learning parameters. Upon determining the network structure for different layers, therefore, the learning epochs for the pretraining and the fine-tuning will be directly related to the classification performance. In this subsection we will discuss the influence of the epochs for the pretraining and the fine-tuning procedure. We first adjusted the number of pretraining epochs (for the GRBMs and the presented model) with all the other parameters fixed. Figure 10a plots the change of the fault classification rates in response to the increase of the pretraining epochs for the WPT features. For the fault diagnosis of both the gearbox and the bearing systems, the number of pretraining epochs does not influence the classification very much. This means that even for a small number (e.g., 10) of the epochs, the pretraining can achieve good effect.

Remarks
Based on the fault diagnosis results as shown in the previous two subsections, one can see that the deep statistical feature learning holds the best classification performance comparing to the peer models. During the experiments, there were some very aberrant values (outliers) collected from the experimental setups, because the outliers are always unavoidable for real applications. It is obvious that the outliers may lead to deterioration of the fault diagnosis. However, we did not remove those outliers from the dataset, even if the removal of the outliers may increase the classification rates.
It should be noted that the given parameters also play important roles to the GDBM model. As indicated by Cho et al. [35], the training procedure of the GDBM can easily run into problems without careful selection of the learning parameters. Upon determining the network structure for different layers, therefore, the learning epochs for the pretraining and the fine-tuning will be directly related to the classification performance. In this subsection we will discuss the influence of the epochs for the pretraining and the fine-tuning procedure. We first adjusted the number of pretraining epochs (for the GRBMs and the presented model) with all the other parameters fixed. Figure 10a plots the change of the fault classification rates in response to the increase of the pretraining epochs for the WPT features. For the fault diagnosis of both the gearbox and the bearing systems, the number of pretraining epochs does not influence the classification very much. This means that even for a small number (e.g., 10) of the epochs, the pretraining can achieve good effect. The pretraining epochs were subsequently set at 150 to adjust the number of the fine-tuning epochs between 10 and 250. The fault classification rates for the two mechanical systems were displayed in Figure 10b. With the increase of the fine-tuning epochs, the classification rates for both experiments improve accordingly. For the gearbox diagnosis experiments, the improvement goes The pretraining epochs were subsequently set at 150 to adjust the number of the fine-tuning epochs between 10 and 250. The fault classification rates for the two mechanical systems were displayed in Figure 10b. With the increase of the fine-tuning epochs, the classification rates for both experiments improve accordingly. For the gearbox diagnosis experiments, the improvement goes slowly after 125 epochs. As for the bearing systems, the classification rate increases still evidently before 200 epochs. Figure 10a,b prove that the deep statistical feature learning using the GDBM is not very sensitive to the learning parameters. Even though, a careful selection of the model parameters will be helpful in improving the fault pattern classification for the rotating machinery. This means that the proposed method has an essential improvement potential for the fault diagnosis of the rotating machinery. Figure 11a-c plots the comparisons between the real fault patterns and the classified patterns for the bearing fault diagnosis in the time, the frequency and the time-frequency domains, respectively. It is shown that some signals are correctly classified by the proposed method in some domains, but are misclassified in other domains. Let's take 10 vibration signals (#286~#295) of the bearing fault diagnosis as an example. The time domain diagnosis misclassified #295 signal, while both the frequency domain and the time-frequency domain obtained right classification. The frequency domain diagnosis misclassified #290 signal, but all the other two domains are correct. The time-frequency domain diagnosis misclassified #286 signal, while all the rest domains are right. If one considers the three domains simultaneously, therefore, all the 10 vibration signals (#286~#295) can be right diagnosed. This shows that the combination of the diagnosis results may contribute better classification rates. Though out of our scope in this paper, this discussion encourages us that further potentials can be explored for the proposed fault diagnosis approach. slowly after 125 epochs. As for the bearing systems, the classification rate increases still evidently before 200 epochs. Figure 10a,b prove that the deep statistical feature learning using the GDBM is not very sensitive to the learning parameters. Even though, a careful selection of the model parameters will be helpful in improving the fault pattern classification for the rotating machinery. This means that the proposed method has an essential improvement potential for the fault diagnosis of the rotating machinery. Figure 11a-c plots the comparisons between the real fault patterns and the classified patterns for the bearing fault diagnosis in the time, the frequency and the time-frequency domains, respectively. It is shown that some signals are correctly classified by the proposed method in some domains, but are misclassified in other domains. Let's take 10 vibration signals (#286~#295) of the bearing fault diagnosis as an example. The time domain diagnosis misclassified #295 signal, while both the frequency domain and the time-frequency domain obtained right classification. The frequency domain diagnosis misclassified #290 signal, but all the other two domains are correct. The timefrequency domain diagnosis misclassified #286 signal, while all the rest domains are right. If one considers the three domains simultaneously, therefore, all the 10 vibration signals (#286~#295) can be right diagnosed. This shows that the combination of the diagnosis results may contribute better classification rates. Though out of our scope in this paper, this discussion encourages us that further potentials can be explored for the proposed fault diagnosis approach.

Conclusions
In this paper, a deep statistical feature learning for vibration measurement has been proposed to diagnose fault patterns in rotating machinery. The statistical feature set was first extracted from the time, frequency, and time-frequency domains of the vibration signals. The real-valued RBMs were then stacked to develop a GDBM to accommodate statistical feature learning. Two typical rotating machinery systems (a gearbox and bearing test rigs), were constructed to validate the proposed approach, which was used for fault classification in the three-domain feature sets. The results show

Conclusions
In this paper, a deep statistical feature learning for vibration measurement has been proposed to diagnose fault patterns in rotating machinery. The statistical feature set was first extracted from the time, frequency, and time-frequency domains of the vibration signals. The real-valued RBMs were then stacked to develop a GDBM to accommodate statistical feature learning. Two typical rotating machinery systems (a gearbox and bearing test rigs), were constructed to validate the proposed approach, which was used for fault classification in the three-domain feature sets. The results show that deep statistical feature learning is capable of classifying fault patterns at higher rates than other models. Compared with the unsupervised GRBM, the SVM and the combined SVM and GRBM models, the deep statistical feature learning by the GDBM consistently had clearly better performances. This means that deep learning with statistical feature representation is a feasible update of conventional methods. The results also reveal that the statistical features in the time, frequency and time-frequency domains have different representation capabilities for fault patterns. Our further work will focus on optimizing the statistical features in different domains for different diagnostic applications.