Bearing Fault Diagnosis under Variable Rotational Speeds Using Stockwell Transform-Based Vibration Imaging and Transfer Learning

: In this paper, discrete orthonormal Stockwell transform (DOST)-based vibration imaging is proposed as a preprocessing step for supporting load and rotational speed invariant scenarios for signals of various health conditions. For any health condition, features can easily be extracted from its generated health pattern. To automate the feature selection process, a convolutional neural network (CNN)-based transfer learning (TL) approach for diagnosis has also been introduced. Transfer learning allows an established model to use feature knowledge obtained under one set of working conditions through hidden layers to diagnose faults that occur under other working conditions. The network learns from the massive source dataset, and that knowledge is applied to the target data to identify faults. Using the bearing dataset of Case Western Reserve University, the proposed approach yields an average 99.8% classiﬁcation accuracy and, speciﬁcally, 99.99% for healthy condition (HC), 99.95% for inner race fault (IRF), 99.96% for ball fault (BF), 99.68% for outer race fault for 12 o’clock sensor position (ORF@12), 99.93% for outer race fault for 3 o’clock sensor position (ORF@3), and 99.89% for outer race fault for 6 o’clock sensor position (ORF@6). In this paper, the proposed approach is compared with conventional artiﬁcial neural networks (ANNs), support vector machines (SVMs), hierarchical CNNs, and deep autoencoders. The proposed approach outperforms these conventional methods in the accuracy under all working conditions.


Introduction
In electromechanical engineering, motion is usually determined by mechanical device structures (e.g., rotating machines or induction motors), which leads to satisfactory records of nearly 70% of the gross energy ingestion in modern manufacturing economics [1,2].Rotating machineries with induction motors use bearings to moderate friction, which preserves otherwise wasted energy and increases the useful life of a machine.Nevertheless, inimical operating environments and cyclic stuffing can lead to substantial wear in bearings, exhibiting in the form of exterior cracks [3].If these surface cracks go undetected, it can lead to unexpected shutdowns, resulting in financial inefficiency, as well as human injuries [4][5][6][7].
Bearings play an important role in condition monitoring because, in more than 50% of the cases involving induction motor failures, they are the root cause of the failure [2,8].Health state diagnosis under variable operating conditions can be employed to improve smooth operability of the machines.In real time, bearing fault diagnosis is performed through the collected data (i.e., vibration acceleration signals, acoustic emission signals, and motor currents), which has been an important area of study over the last few decades [6,[9][10][11].These fault diagnosis studies have shown that diagnosis of bearing wear can reduce maintenance expenses and enhance machine reliability [6,[12][13][14][15][16]. Bearing fault diagnosis studies have extensively investigated vibration signals and motor current analysis [17,18].Multiple signature analysis of vibrations and motor currents has also been researched to improve reliability [19].
Primarily, bearing faults occur due to localized imperfections, i.e., cracks or spalls.These flaws create shocks and stimulate high-occurrence reverberations of the bearings and machine assembly due to the recurring effect on spinning parts [17].Time-frequency signal analysis approaches have been employed to solve these issues, such as fast Fourier transform (FFT) and short-term-Fourier transform (STFT), which convert time-domain signals into frequency-domain signals for further analysis [20].However, inappropriate time-window adjustments downgrade the performance of these methods.Another powerful signal processing approach that has been considered is wavelet analysis [21].However, wavelet analysis creates good time but poor frequency resolutions at high frequencies, and good frequency but poor time resolutions at low frequencies.Adjusting the window size is often a solution to gather important information from vibration signals [22].Enveloping with energy kurtosis [23] is another approach.However, detecting the envelopes of real signals does not always rely on analytic signals [24].Existing signal processing techniques have employed feature analysis approaches to detect statistical features for further classification employing classical machine learning algorithms, such as SVM (support vector machine) and PSVM (proximal support vector machine) [25,26].In addition to these classical approaches, several deep learning techniques have also been explored to extract automated features through the hidden layer architecture for bearing fault classification.Examples include deep autoencoders [27,28], ANNs (artificial neural networks), and hierarchical convolution network [29].These methods established several dimensions for fault classification without considering the handcrafted features.Due to the limitation of experimental data, however, these methods are not compatible with dynamic invariant scenarios.
In the present work, the feature extraction process is automated by using deep neural network techniques.In contrast to existing methods, this study focuses on a signal processing technique to create an invariant scenario under different load and rpm levels to extract automated features by employing an advanced neural network mechanism.To address this issue, the Stockwell transform (S-transform) is employed in this paper [30].The S-transform is a time-window Fourier transform that has the advantages of both the STFT and the wavelet transform.Moreover, the problem of selecting window sizes, i.e., selecting a long window for low-frequency components and a short window for high-frequency components, is balanced when using the S-transform by an adjustable window size with a tunable parameter that can yield good frequency and time localization for each signal.Discrete orthonormal Stockwell transform (DOST)-based signal stacking is proposed for pattern generation for each fault type of a bearing.
In short, this study presents a new approach for bearings diagnosis under variable speed and load conditions that employs vibration signals to address two key limitations of existing methods: (1) under different operational speeds, proper feature selection requires domain-level expertise, and (2) automating optimal feature selection requires special dynamic algorithms.Instead of selecting optimal features directly from the one-dimensional (1D) signals, two-dimensional (2D) vibration images are generated by employing discrete orthonormal Stockwell transform (DOST)-based stacking to explore the patterns of different bearing states.The proposed 2D vibration imaging creates identical patterns for the same type of health, where variable operating speeds do not affect the patterns for certain health states.From these 2D images, the feature selection process is automated by employing a transfer learning (TL)-based convolutional neural network (CNN).Conventional, straightforward neural networks make the feature selection process much easier through their convoluted encoding layers compared to traditional methods [17,31,32].In [33], Zheng et al. introduced a TL-based artificial neural network (ANN) for a bearing's raw vibration signals.However, for a raw signal, this approach cannot discover the critical features needed to transfer to the knowledge domain for further classification under different loads and speeds.Inadequate raw signal data extracted from mechanical sensors leads us to incomplete observation of critical patterns for the neural networks.To create invariant patterns, pre-processing steps are necessary since our data are limited.Due to white noise in signals, it is difficult to find out the exact information from additional properties mixed with domain data.These learnings are passed through another working condition later through transfer learning, which is how the network can learn from both working conditions and be fine-tuned by adjusting weights along with previous learning.In this study, after creating an invariance scenario with 2D DOST-based vibration imaging, a TL-based approach is executed to resolve these challenges.Details about the TL are discussed in the Methodology section along with a description of the proposed CNN architecture.Therefore, the main contributions of the current work are: (1) identical health pattern formations for different health types employing discrete orthonormal Stockwell transform (DOST)-based vibration imaging to create load-invariant and rpm-invariant scenarios, and (2) a transfer learning-based convolutional neural network approach to automate the feature extraction process from those identical health patterns in a short amount of training time.
The rest of this paper is organized as follows.Section 2 describes the proposed methodology.Section 3 describes the dataset and discusses the experimental results analysis, including comparisons to establish the robustness of the proposed method.Finally, Section 4 concludes this paper.

Methodology
The proposed methodology mainly comprises three major tasks: (a) the source task, (b) the element transfer, and (c) the target task.The source and target tasks each have three common steps to use the convoluted layers of the proposed neural network architectures: (a) data input, (b) vibration imaging, and (c) using the convolutional neural network (CNN).Vibration imaging works as the preprocessing step for the input data to generate identical patterns, and the CNN is used to save and transfer knowledge between the source and target tasks to achieve the classification for the bearing fault diagnosis.Figure 1 illustrates the overall proposed mechanism.

Vibration Imaging Using Discrete Orthonormal Stockwell Transform (DOST)
In the transfer learning (TL), first, the network is trained with a working condition (source task) and then those learnings are passed to another working condition (target task).Processing the substantial amount of 1D vibrational data from sensors requires massive computational cost and time.Therefore, an adjustable sliding window frame [33] mechanism is adopted with the goal of (a) creating a massive amount of source data to train the network efficiently, and (b) handling the large amount of data for further processing to fit it to the proposed network.
If the total length of the acquired vibration signal is L t , then the total number of the samples n t is: In Equation ( 1), L f is the length of a single frame, which is selected based on the experimental requirements.The step size is L s .The length of the vibration signal L t is fixed in this study.The overall process of adjusting the sliding window frame is shown in Figure 2.With a 1D signal, it is very difficult to observe identical patterns for different health types.Moreover, in recent studies [33,34], preprocessing-free transfer learning is claimed.Even with the limited amount of data for mechanical machines, network learning for a transferring scenario remains questionable.Huge chunks of data having a lot of variations can help a network gather some intrinsic information.This kind of information can help to classify other mechanical sensors further.In this scenario, TL is used to resolve the issue of reducing time along with boosting performance.However, TL cannot resolve the challenges of raw signals alone since the open source data are incommensurate.Also, for additional white noise, the time-domain information is not always accurate.Rather, preprocessing can result in more robust performance if it uses the TL-based approach instead of classical deep learning because a limited portion of data facilitated with some additional information can help us adjust the network weights to learn more accurate intrinsic feature information for classification.Current study focuses mainly on this idea.A 1D raw signal suffers from the inability to create invariant identical scenarios.Two-dimensional imaging is developed to achieve ascendable feature engineering that processes heterogeneous data from systems with invariant working principles.In this study, discrete orthonormal Stockwell transform (DOST)-based stacking for vibration imaging is employed as a preprocessing step in the proposed approach.
The Stockwell transform (ST) was first proposed by R. G. Stockwell [30].Using local spectral phase properties, it can represent time and frequency.The ST can distinctively combine a frequencydependent resolution of the time-frequency space with unequivocally referenced local phase information.In other words, the ST is like a fusion between the Gabor transform and the wavelet transform.In practice, this time-frequency decomposition tool overcomes some of the drawbacks (i.e., provides better time-frequency resolution) of the short time Fourier transform (STFT).In [30], R. G. Stockwell described a method to decompose a signal on a discrete orthogonal basis for the ST (DOST basis).In this study, the DOST basis is considered to form 2D images with identical patterns.To perform the orthogonal transform, this method uses the discrete Fourier transform (DFT) [3].
First, the method performs a fast Fourier transform on each segmented windowed sample.Then, it segments the processed signal via frequency partitioning for the DOST, that is, changing dyadic negative to positive.Next, on each bandwidth partition, it calculates the inverse Fourier transformation.Finally, it combines all these partitions together to form the final DOST output.Figure 3 shows the process of this DOST basis method.If a 1D function of time t for a constant frequency f is represented as V s (τ, f ), then the continuous [30,35] ST of a function h(t) is: Equation ( 2) demonstrates the change of the amplitude and phase for this frequency over time.In the discrete case, there are computational advantages to using the equivalent frequency domain definition of the ST.However, as R. G. Stockwell pointed out, standard ST is computationally expensive (O N 2 ) [27].
In [36], U. Battisti et al. demonstrated that the DOST basis is not suited to a standard Gaussian window.Later, in [37], Y. Wang et al. proposed a fast algorithm for the ST with a time complexity of O(N log N).U. Battisti et al.,in [36], extended that work and provided an adapted basis to decompose the ST with a general admissible window.Let us consider a signal f with finite energy, where f ∈ L 2 (R).If w is the window in L 2 (R), then the S-transform S w f is: where b, ξ ∈ R. In [32], U. Battisti et al. assumed that one could find such a basis E w p of L 2 ([0,1]), where w will be dependent on choice.Then, S w f can be expressed as: where They proved that an orthonormal basis of L 2 ([0, 1]) can satisfy the condition that ST is local in time and frequency and has a fast computation algorithm of O(N log N) time complexity to compute the coefficients.
To form the identical patterns, the considered DOST basis output of the segmented signals of all the types are stacked together.These stacks of the preprocessed signals form identical patterns for each health type.When stacking a huge number of segmented signals, these images are large.To address this issue, heights of the stacked signal are fixed for generating identical images in the experiment.If the total number of segmented signals is M and each segment has length Q, then the size of the image is M × Q (height × width).This height is very large as well as challenging to feed to the proposed network.To generate small images, m samples from the M segmented samples (where m ∈ M, m < M and M > 8) are considered.Then, all the resized images are bunched together and fed into the network.In this study, m = 8 can generate good resolution.This m can be larger but note that increasing m reduces the total number of samples.

Transfer Learning with Convolutional Neural Network (CNN)
The prime goal of transfer learning (TL) is to improve the performance of a target task by using the knowledge from a source task.The source and target tasks may be similar or different.In this study, the source and target domains preserve similar types of feature spaces because the data types are similar except for the revolution speed of the bearings.In this study, transfer learning is employed to reduce the overhead of the training network.In practice, TL creates a robust fault diagnosis process [38].Figure 4 illustrates the TL concept.TL methods depend on machine-learning algorithms for learning the task.In this study, inductive learning is used.In inductive learning, TL is considered as a classifier along with one of the interference algorithms to solve classical classification problem (i.e., neural networks).These interference algorithms, that is, the deep neural networks, can automate the feature selection process through the hidden layers and then classify.In addition, they can preserve the knowledge of their learning and use it for further tasks.Convolutional neural networks (CNNs) have been considered as one of the most effective feed-forward supervised machine-learning networks.One more reason for considering CNN as the deep neural network for this study is that a large-scale CNN has the potential to be the most effective of the deep learning and classical methods.Moreover, using the transferred knowledge obtained from TL, with only a small set of training data, the large-scale CNN can achieve excellent performance in the considered scenarios [33,34,38].
A CNN is used to automate the detailed feature selection process for a given set of data, as well as to save the source knowledge for further usage in the target domain.A CNN has a hierarchical construction and collects from several convolutional, subsampling, and fully connected layers.This network makes optimal use of the indigenous connections (instead of fully-connected layers), weight distribution, and spatial or progressive subsampling to achieve invariance of shifting, scaling, and distortion in their inputs [39,40].In this study, as TL and CNN are designed together, the learning of the well-trained u layers are passed to the v layers of the target network, where u < v, v = u + 1, and layer number u + 1 denotes the output layer or last layer.Thus, the last layer of the target network is trained, providing an accuracy of output based on the learning from the source network [34].
For a mathematical, intuitive overview of the CNN, let P and Q be the input and output vectors, respectively, for the network.The model has three layers-input, hidden and output layers-where the hidden vector is H.The feed forward method is as follows: where w 1 is the weight matrix between the input and hidden layers, w 2 is the weight matrix between the hidden and output layers, b 1 and b 2 represent the bias vector of the hidden and output layers, respectively, and σ(.) is the sigmoid activation function.The loss function is as follows: where m i represents the target vector and n denotes the number of training samples.The aim of the CNN is to minimize the loss function F L through back propagation and gradient descent [33].In this study, the architecture of the neural network contains nine layers along with one output layer (see Figure 5).From the convolution layer output, the max-pooling layer does the subsampling of data.
A dropout layer is added to the network to avoid over-fitting.In this study, Adam optimizer is used for fine tuning the network and SoftMax classifier to improve the classification performance.In the TL, the learning of these hidden layers is then passed to the target task to boost the learning of the target task.For the neural network, these learnings are stored in the form of weights.The layers that are transferred to the target network are listed in Table 1.

Dataset and Experimental Working Conditions
To evaluate the proposed method, the publicly available seeded fault bearing dataset by Case Western Reserve University (CWRU) Bearing Data Center [41] is used.The data were accumulated by using a 2-horsepower (hp) Reliance Electric motor with a torque transducer and a dynamometer to apply different loads, ranging from 0 hp to 3 hp.Rotation velocities of the motor also varied from 1797 rpm to 1730 rpm.Drive end bearings were seeded with defects on the inner raceways, outer raceways, and rolling elements with the assistance of an electro-discharge machine, shown in Figure 6.The dataset consists of ratings of healthy condition (HC), inner raceway fault (IRF), ball fault (BF), and outer raceway fault (ORF) signals under the considered working conditions.The ORF has three variants: ORF at the center, ORF at the orthogonal, and ORF at the opposite positions.In this study, the variable length vibration acceleration signals were recorded at 12,000 samples/second (Hz) for the drive-end bearings.To segment the signals by an adjustable sliding window, the proposed method considers 1024 data points for every frame.After that, the invariant conditions are established through vibration imagining via DOST-based stacking.As discussed earlier, after stacking, the dimension of an individual sample becomes 8 × 1024.To evaluate the performance of this speed-invariant experiment, four different working conditions are considered.Under each set of working conditions, six different health types are included.Table 2 presents the details of the working conditions.In total, 300 fine-tuned epochs are revolved around the network for performance assessment.

Analysis of 2D Vibration Imaging
As described earlier, the reasons to employ 2D vibration imaging are as follows.One reason is to create an invariant scenario for different loads and rpms.For each health type (HC, IRF, BF, ORF at center, ORF at orthogonal and ORF at opposite), the 2D imaging exhibits identical patterns across working conditions.Another reason is to provide the CNN with the full advantages of the 2D structure with visibility of identical patterns.A further reason is be able to visually distinguish between the source and target domains to reduce negative transfer learning.Figure 7a-f shows the output of vibration imaging for each health type under different working conditions.Using the S-transform, the frequency domain is divided into several bandwidths, providing micropatterns based on bandwidth.From close observation, the rpm and load do not affect the identical patterns of the signals under different working conditions.Across different rpm and load levels, the pattern for each health condition does not vary, which demonstrates the similarity among all the working conditions for different health types.

Diagnostic Performance of the Proposed TL-Based Method
To validate the performance of the proposed TL-based method, the available data are divided into four different working conditions (WC), denoted WC 1, 2, 3, and 4 (described in Table 2).These datasets contain different speeds and loads, but the same type of health condition.The rpm invariance of this approach is validated by examining four separate scenarios.In the first scenario, WC 1 is used for training the network and saves the knowledge, and WC 2, 3, and 4 use that learning to perform the classification tests.From WC 2, 3, and 4, 10% of the data are used for adjusting the network to use the prior knowledge.In the second scenario, WC 2 is used for gathering the knowledge, and WC 2, 3, and 4 are used for TL-based testing.In the same manner, in Scenario 3, WC 3 is used for knowledge gathering, and in Scenario 4, WC 4 is used.The corresponding other working conditions are used for testing and classification.In each scenario, one dataset is known to the network, and the TL makes the later learning faster and more efficient, and achieves better accuracy.These tests validate the proposed approach.Table 3 lists the details of the diagnostic performance of the proposed approach.To evaluate the performance of the proposed approach, the same experiment is also conducted with different amounts of training data.In this study, 80% of the training data are considered from the source dataset for each scenario (Table 2).With different percentages of training data, the overall classification accuracy varies as provided in Table 4. Table 4 clearly shows that the best performance is achieved when using 80% of the training data from the target dataset for the different scenarios.In the Methodology section, Table 1 describes the proposed neural network architecture for TL.The table also shows which layers or stages are transferred to the target dataset for different scenarios to evaluate the performance.There are four stages in the network (Table 1), and the first three are transferred to perform TL.To evaluate the network's performance, the number of stages is varied for transfer.In this study, 80% of data are used for training and the number of learning stages is only changed to evaluate the performance.Table 5 gives the overall performance results.As shown in Table 5, the proposed approach performs the best when Stages 1 to 3 (see Table 1) from the network are transferred to the target task.

Comparison Analysis
To evaluate the performance of the proposed approach, several experiments are performed with and without TL.In the experiment, WC 1 is considered as the source task and WC 2 as the target and the performance of WC 2 with transfer learning (WTL) and without transfer learning (NTL) is evaluated.The NTL scenario meant training the network from scratch (training:testing = 60:40) without transferring any knowledge from the source task.As shown in Table 6, the accuracy performance did not improve much for WTL versus NTL.For normal health type (HC), the accuracy is the same, and for the inner fault (IRF), the accuracy without TL is higher (2.2%).For the other health types, WTL performed better than NTL.On average, WTL provided 1.04% additional accuracy over NTL in this experimental scenario.
However, training time with TL is shorter than that without TL.As shown in Figure 8, the network provides the highest training accuracy (100%) with TL after almost 80 epochs (in t1 time), whereas NTL requires almost 160 epochs (in t2 time) to provide the same performance.These results demonstrate that transfer learning greatly reduces the training time while maintaining the overall performance.
To validate the performance of the proposed method, some recent deep learning approaches, including raw signal-based TL approach [33], hierarchical CNN-based approach [29], and ensemble deep autoencoders approach [27], are compared with the proposed method.In addition, the proposed TL-based network approach is compared with some of the classical machine learning approaches (e.g., SVM and ANN).These network architectures are analyzed with the same test scenarios to validate the performance improvement using the proposed TL-based CNN approach, as shown in Table 2. Table 7 shows the comparison results in detail for each scenario of the experiment, and our proposed approach outperforms conventional methods.

Conclusions
In this paper, a convolutional neural network-based transfer learning approach was proposed for automated feature extraction to improve the performance of bearing fault classification.To make the automated feature extraction more reliable and accurate, the discrete orthonormal Stockwell transform (DOST) was proposed as a preprocessing step for creating a load-and rpm-invariant scenario for considering signals of multiple health types.The theoretical and experimental analysis of this study demonstrated that TL boosts the performance of the network under invariant working conditions and brings the learning mechanism under one network architecture.The experimental analysis also established that DOST-based vibration imaging can help a two-dimensional CNN to learn features faster and with greater accuracy.Experimental results showed that the proposed method achieves an average of 99.8% classification accuracy for all health types (i.e., healthy condition, inner race fault, ball fault, and outer race fault for 1, 2, 3, and 6 o'clock sensor positions).In addition, the proposed method outperformed other state-of-the-art algorithms (i.e., ANN, SVM, hierarchical CNN, and deep autoencoders), showing 32.3%, 16.39%, 6.78% and 1.58% improvements in accuracy, respectively.

Figure 1 .
Figure 1.Detailed process of the proposed method.

Figure 3 .
Figure 3. Discrete orthonormal Stockwell transform (DOST) basis construction process, where FFT stands for fast Fourier transform.

Figure 4 .
Figure 4.The left side shows the conventional learning process, while the right side shows the concept of transfer learning (TL).

Figure 5 .
Figure 5. Proposed architecture of the 2D CNN.

Figure 6 .
Figure 6.An overview of the experimental setup.

Figure 8 .
Figure 8. Training accuracy comparison with and without transfer learning.

Table 1 .
Proposed architecture of the CNN for transfer specification for target network.

Table 2 .
Details of the considered working conditions with the same health types.

Table 3 .
Diagnostic performance of the proposed model under different scenarios.

Table 4 .
Classification results for different sizes of training data.

Table 5 .
Impact on overall classification accuracy of different numbers of stages for TL.

Table 6 .
Comparison analysis of the classification accuracies of transfer learning-based model (WTL) vs. without transfer learning (NTL).

Table 7 .
Comparison of classification accuracy (proposed vs. existing methods).