A Two-Stage Framework for Time-Frequency Analysis and Fault Diagnosis of Planetary Gearboxes

: In the operation and maintenance of planetary gearboxes, the growth of monitoring data is often faster than its analysis and classiﬁcation. Careful data analysis is generally considered to require more expertise. Rendering the machine learning algorithm able to provide more information, not just the diagnosis conclusion, is promising work. This paper proposes an analysis and diagnosis two-stage framework based on time-frequency information analysis. In the ﬁrst stage, a U-net model is used for the semantic segmentation of vibration time-frequency spectrum to highlight faulty feature regions. Shape features are then calculated to extract useful information from the segmented image. In the second stage, the decision tree algorithm completes the health state classiﬁcation of the planetary gearboxes using the input of shape features. The real data of wind turbine planetary gearboxes and augmented data are utilized to verify the proposed framework’s effectiveness and superiority. The F1-score of segmentation and the classiﬁcation accuracy reach 0.942 and 97.4%, respectively, while in the environmental robustness experiment, they reached 0.747 and 83.1%. Equipping the two-stage framework with different analytical methods and diagnostic algorithms can construct ﬂexible diagnostic systems for similar problems in the community.


Introduction
Planetary gearboxes are widely used in many industrial categories such as wind power generation, mining, metal forming, etc., due to their compact structure, large transmission ratio, and stable operation [1,2].In these applications, planetary gearboxes are usually installed in the transmission chain transmitting large torque and are subject to harsh working environments such as dynamic load and extreme temperature.Hence, they are prone to various failures [3].It is important to ensure the safe operation of equipment to detect faults as early as possible through the analysis of sensor data to avoid unplanned shutdowns or catastrophic failures [4][5][6].
As a well-established field, the vibration analysis-based fault diagnosis method has been widely used in planetary gearboxes.The representation methods of vibration signals can generally be summarized into three categories: time domain, frequency domain, and time-frequency (T-F) domain [7].T-F domain analysis is a powerful tool for dealing with non-stationary signals.Chen and Feng [8] proposed an iterative generalized T-F reassignment method by exploiting the uniqueness of iterative generalized demodulation to decompose nonstationary, multi-component signals into mono-components of constant frequency, thus, meeting the requirement of mono-component with linear instantaneous frequency by T-F reassignment and improving the T-F readability in planetary gearbox fault diagnosis.With consideration of the time-varying characteristics of planetary gearboxes, Han and Feng [9] used the local maximum mean discrepancy to evaluate the data distributions between relevant subclasses in source and target tasks and proposed a deep residual joint subclass alignment transfer network based on T-F features.Yuan et al. [10] extracted fault features from wavelet T-F images from the perspective of image texture analysis and proposed a novel fusion fault diagnosis framework of gray level co-occurrence matrix and label consistent K-SVD.Tu et al. [11] developed a new method termed as generalized wavelet-based synchro squeezing transform to deal with a strong, modulated non-stationary signal, which has better performance as compared with traditional approaches in the energy concentration of T-F representations and accuracy of the mode reconstruction.In [12], Dhamande and Chaudharicon considered a more complex but real situation: compound gear-bearing fault and proposed a fault diagnosis method based on T-F statistical features of the discrete wavelet transform and continuous wavelet transform.
T-F segmentation is one of the crucial means of analyzing T-F distribution.Limited by computer calculation levels, early investigations mainly focused on threshold segmentation methods represented by the Otsu method [13], edge detection [14], and gray level histogram [15].With the development of computer technology and machine learning, researchers have put forward many new ideas and methods for T-F analysis.Medical Informatics is the field where early T-F analysis technologies have achieved fruitful results.For example, Zhang et al. [16] combined the Hilbert transform with the Wigner-Ville distribution to bring about hybrid T-F analysis and used ResNet to analyze cardiac arrhythmia via heartbeat classification.Cheng et al. [17] proposed a novel method based on T-F analysis and the CNN-LSTM cascade model to automatically detect atrial fibrillation, solving the problem of burst atrial fibrillation detection based on electrocardiograms.In the field pf mechanical fault diagnosis, Yan et al. [18] proposed a fusion method based on multi-resolution T-F spectrum segmentation and sparse decomposition of vibration signals.
Compared with traditional Gabor T-F atoms, the new method has a faster pursuit speed of the best atom and higher approximation precision.Saulig et al. [19] combined K-means clustering with local Rènyi entropy to distinguish the basic structural differences between helpful information content and noise component in the T-F plane, developing an unsupervised adaptive T-F analytical algorithm.In fact, the T-F analysis task of the planetary gearbox is particularly challenging.Because of the complex structure and time-varying characteristic of the planetary gearbox, the characteristic regions in the T-F spectrum are often intertwined with the background, or the contrast and sharpness may be very low.Therefore, traditional image processing methods are difficult for achieving satisfactory performance of such tasks.
Previous works have applied deep learning to planetary gearbox fault diagnosis based on T-F analysis.However, these studies usually regard T-F analysis and fault diagnosis as two separate tasks.Other works focus on developing an indivisible intelligent system to accomplish the gearbox fault diagnosis task based on T-F information [20][21][22][23][24].This paper aims to incorporate T-F spectrum analysis and fault diagnosis based on T-F feature information into a continuous but stage-by-stage diagnosis framework.This is more similar to manual diagnosis processes.The specific indications of fault feature regions and the high-accuracy fault classification diagnosis can improve confidence in the fault diagnosis conclusion for operators because it provides richer and more comprehensive diagnosis information about the health status of equipment, making the diagnosis results more reliable.
Based on the above views, this paper proposes a two-stage diagnosis framework for planetary gearboxes.Because the traditional threshold-based algorithms are error prone for high-resolution T-F spectrum, we implement a fully convolutional network U-net for semantic segmentation; that is, to label the regions of interest (ROI) pixel by pixel.The U-net model highlights the pixels belonging to fault feature regions from the background.Then, the shape features are utilized to extract valuable information from the feature region distribution.Finally, the decision tree algorithm is used to determine the health status of the planetary gearbox.
The main contributions of this paper are summarized as follows: 1.
Compared with previous studies that considered T-F analysis or fault diagnosis alone, a more comprehensive two-stage framework is proposed to combine the two tasks.
To the best of our knowledge, we are the first paper to discuss planetary gearbox fault diagnosis based on T-F information from the perspective of an analysis and diagnosis two-stage task.

2.
The modified U-net model is adopted to process T-F image, which can utilize largerange context information to improve the segmentation accuracy and enhance the robustness against environmental variations.

3.
The effectiveness of the proposed method is verified by using the real data of in-service wind turbines.T-F images and ground truth labels are available online [25].
The remainder of this paper is organized as follows.Section 2 briefly introduces the U-net, feature extraction method, and decision tree algorithm used in the research.Section 3 introduces the two-stage gearbox fault diagnosis framework proposed in this paper.Section 4 describes the experimental dataset and analyses the results of running our method on this dataset.Conclusions are outlined in Section 5.

U-Net
In 2015, Ronneberger et al. [26] first proposed the U-net model for the problem of cell image segmentation.As an important variant of fully convolutional neural network, U-net does not contain fully connected layers but uses symmetrically arranged convolutional layers to compress and reconstruct feature information, thereby exploiting context information at multiple scales and generating pixel-level image segmentation results.So far, U-net and its variants have seen the most applications in the field of medical image processing [27] and are expanding into machine vision [28], image-based fault detection [29], etc. Figure 1 illustrates the modified U-net architecture used in this paper.The left half of the U-net is the contracting path, also known as the encoding path, which consists of four repeated contracting operators.The contracting operator has the typical structure of a convolutional neural network, including a 3 × 3 convolutional layer, a batch normalization layer, followed by a rectified linear unit (ReLU), and a 2 × 2 pooling layer.The right half of the U-net is the expanding path, also known as the decoding path.Its overall structure is very similar to the contracting path, except with its more complex expanding operators.Some of the most crucial changes include deconvolution instead of convolution and the addition of concatenation from the contracting path.Based on the above design, U-net presents a simple and elegant U-shaped structure, which is also the origin of its name.
In order to explain the U-net structure in more detail, several critical components in the network are described next: 1.
The convolutional layer is composed of a set of convolutional kernels that can be set high and wide.The learnable convolutional kernel enables the convolutional layer to generalize the feature information in the input and map it into a new feature space.In addition, the setting of shared weights enables the convolutional layer to have lower computational complexity than the fully connected layer.For each convolutional kernel, its output can be expressed as: where X j is the jth input channel, O i is the ith channel of the feature map, and f (•) is the activation function.ω ij and b i denote convolutional weight and bias term respectively, both of which are trainable parameters.

2.
Deconvolution is also known as up-convolution or transposed convolution.Its function in the U-net model is to increase the resolution of feature maps rather than computing the true inverse of convolution.In order to obtain an appropriate expansive capability matching the contracting path, we use a 2 × 2 deconvolutional kernel and set the deconvolutional stride to 2.

3.
Activation functions enable convolutional neural networks to model nonlinear mapping hidden in data.Common activation functions include Sigmoid, Tanh, ELU, ReLU [30], etc.As a non-saturated activation function, ReLU can alleviate the problems of gradient vanishing and exploding while accelerating model learning.Therefore, ReLU is selected as the activation function in the contracting operator and expansive operator.
where µ B and σ 2 B denote the mean and variance of batch training data, respectively.xi is the normalized result after subtracting the mean and variance of batch training data, and y i is the normalized result calculated based on the scaling factor γ and translation factor β.

Shape Feature
The segmented T-F image can be regarded as a binary image.It has the same resolution as the input image, so there is a data redundancy for the subsequent fault classification algorithm.Therefore, this paper innovatively uses a region-based shape feature extraction method to reduce the dimensionality of the data.
Geometric moment [31] is a concise and effective region-based shape feature.For a general function f (x,y), its (p,g)-order geometric moment is defined as follow: x p y q f (x, y)dxdy, p, q = 0, 1, 2, . . .(7) Some important properties of shapes can be derived from geometric moments: M 00 defines the mass of a shape.(M 10 /M 00 , M 01 /M 00 ) defines the centroid of a shape.(M 20 , M 02 ) defines the moments of inertia of a shape, which describes the mass distribution of the shape relative to the coordinate axes.
The above properties are useful for representing the distribution of characteristic patches in the segmented T-F image.

Decision Tree
The decision tree method introduced by Quinlan [32] is a powerful machine learning algorithm that constructs a knowledge-based leaf-branch system by inductive inference from historical data.So far, researchers have proposed various algorithms to induct the decision tree-based diagnosis model, such as CART [33], BOAT [34], and SPRINT [35].This paper selected CART because of its popularity and simplicity.Generally, developing a fault classifier with CART can be summarized into two phases: the building phase and the pruning phase.

1.
Building phase: The CART-based decision tree is a binary tree.Namely, each split generates exactly two branches.A test attribute x and a test threshold t x can divide the training set into two subsets.In the CART model, the attribute-threshold domain (X, T) is searched to obtain the combination that produces the purest subset.This process is repeated many times to segment subsets and subsets of subsets until the algorithm cannot put forward a new segmentation to obtain higher purity subsets or the preset maximum depth is reached.A fully grown, binary, tree-like structure makes the identification of crucial variables quite easy.

2.
Pruning phase: If the decision tree grows to the maximum size without restriction, developing as a nonparametric model, there will usually be an overfitting problem which will reduce the accuracy of the decision tree in the whole instance space.Therefore, it is necessary to delete unreliable branches or limit the model's degrees of freedom.Setting the maximum depth of the tree structure is the most common pruning method.The other choices of hyperparameters include the minimum number of samples per leaf node, the maximum number of leaf nodes and the complexity parameter.For more information about the CART algorithm, see [36][37][38].

Architecture of the Two-Stage Framework
This paper aims to develop a general tool for automatic analysis and diagnosis of planetary gearbox failures from T-F data.For this purpose, a two-stage framework for T-F spectrum analysis and fault diagnosis is proposed in this paper.The first stage uses U-net to label the characteristic patches representing faults in the T-F spectrum, also known as semantic segmentation.Then, the well-trained U-net is used as a segmentation tool, and geometric moment features in the segmented T-F image are extracted to facilitate the second stage calculation.In the second stage, a decision tree is trained to analyze the shape features and determine the health status of the gearbox.The overall architecture of the two-stage framework is shown in Figure 2. The proposed framework is built using Python machine learning library Pytorch [39].The structure of U-net mainly refers to Ronneberger's paper and some modifications have been made to adapt to the T-F analysis task, including the redesign of the network structure and the removal of the overlap-tile input strategy, etc.The decision tree algorithm is mainly based on the DecisionTreeClassifier model in Scikit-Learn package [40].Because of its outstanding computational efficiency, Adam is used as the optimization algorithm for the U-net model.The learning rate is set to 1 × 10 −3 , and the weight decay rate is set to 1 × 10 −5 .The model uses categorical cross-entropy for the loss function.The stopping criterion of training is set such that the loss on the validation set has not improved significantly for five successive epochs or reached the epoch maximum (50 epochs).

Evaluation Metrics of Performance
In this paper, three evaluation metrics are used to quantitatively evaluate the ability of different algorithms to label ROI in T-F spectrum, including precision (P), recall (R), and F1-score (also known as Dice coefficient).The joint use of three indicators helps us to analyze the algorithm performance from a more comprehensive perspective: precision reflects the ability of the model to accurately select valuable information, recall reflects the ability of the model to avoid missing valuable information, and F1-score reflects the coordination between the above two abilities, obviously a more rigorous metric.
where TP represents the number of pixels belonging to true positive, FP represents false positive and FN represents false negative.

Case Study 4.1. Data Collection and Augmentation
The vibration signals used in the case study were collected from 1.5 MW pitchcontrolled wind turbines located on a wind farm in northeast China, as shown in Figure 3.The transmission chain mainly includes a bladed rotor, a planetary gearbox with a ratio of 100.48:1, and a doubly-fed induction generator.The planetary gearbox contains two planetary transmission mechanisms and one parallel transmission mechanism.The accelerometer was attached to the ring of the second-stage planetary mechanism by magnetic.Figure 4 illustrates the internal structure of the gearbox and the location of the sensors.Table 1 lists teeth number information of the second-stage planetary mechanism.The case study involves four health conditions: normal (NS), ring gear tooth pitting (RP), planetary gear meshing misalignment (MA), and pitting-misalignment concurrence (CO).All the above failures occur in the second-stage planetary mechanism.Each health condition corresponds to four planetary gearboxes and, thus, there are 16 gearboxes used in this experiment.During the data collection, the speed of the gearboxes continuously varied.The characteristics of varying speed operation of wind turbines and changeable operating environment bring more challenges to fault analysis and diagnosis.
The piezoelectric accelerometer samples at a frequency of 16,384 Hz and each signal lasts for 10 s.Ten segments are randomly intercepted from each signal to calculate the T-F spectrum.In this paper, generalized S-transform [41] (generalized factor = 2) is used to generate 512 × 512 T-F images from signal segments.Figure 5 shows the vibration signal and T-F spectrums corresponding to the four health conditions.
The shortfall in dataset scale, accompanied by the possibility of overfitting, is one of the common problems in model training.10 × 4 × 4 = 160 samples are insufficient to train deep learning models in this case study.Therefore, the Augmentor toolkit [42] is utilized to mirror or distort the original image to obtain 800 additional training samples.After data augmentation, the total number of experimental samples reaches 960.Referring to similar studies [43], datasets of this scale can effectively inhibit over-fitting.

Performance Validation
In semantic segmentation methods, each pixel in the sample is given a label.The characteristic patches or bands in the T-F image are highlighted with a grayscale of 255.The grayscale of other parts is set to 0. The experimental data is divided into two nonoverlapping parts in a ratio of 4:1, which are used as training and test datasets, respectively.The size of minibatch data in training is set at 24 to determine how many training samples jointly calculate a parameter update in the model.A computer (Intel Core i5-10400 CPU with 16 GB of RAM) and an NVIDIA GTX 1660 with 6 GB of GPU memory are used to conduct all experiments in this section.Figure 6 illustrates the results of identifying fault feature regions using the U-net method in test dataset.We can find that, in general, the U-net model can identify the characteristic patches or characteristic bands well for the test samples in four health states, revealing the gear meshing phenomenon or impact phenomenon contained in the T-F spectrum.Tooth surface pitting faults and pitting misalignment concurrent faults seem to be two more difficult cases.There are a small amount of noise pixels and error boundaries in the network output.One possible reason is that patch-like areas are easily confused with intense background noise.In order to illustrate the performance of U-net method and analyze the differences between four types of test samples, Table 2 lists the detailed evaluation metrics more specifically.The Precision, Recall, and F1-score of U-net are all satisfactory.The F1-score in the whole test dataset reaches 0.942.By comparison, the tooth surface pitting fault and pitting misalignment concurrent fault have poor performance, with F1-score of 0.891 and 0.87, respectively.These are consistent with our observations in Figure 6.Furthermore, the Otsu method and Fourier filtering are used to process the samples in the test dataset to conduct the performance comparison between the proposed method and traditional methods, as shown in Figure 7.The time consumption to process each T-F image is also shown in the figure.We find that the U-net method is more successful than the two classical algorithms thanks to its strong feature learning ability and context information processing ability.Although the Otsu method gives the highest computational efficiency with a time consumption per sample of approximately 6 ms, there is an apparent imbalance between its Precision and Recall metrics, suggesting too many false positive pixels in the results.The Fourier filtering method shows better performance than the Otsu method, with F1-score reaching 0.858.However, because the size of the mask in the filter needs to be optimized during image processing, the computational efficiency of Fourier filtering is degraded.U-net, as a deep learning method, can automatically capture multi-scale features without manual tuning of model parameters.A trained model, in addition, can process T-F samples quickly, which meets the real-time requirements in fault analysis.In conclusion, as the first stage of the proposed framework, the U-net method is competent for the T-F analysis task among planetary gearbox fault diagnosis.Besides analyzing the T-F information, the ultimate goal of presenting a high-accuracy analytical method is to realize the fast classification of planetary gearbox faults.The labeled fault feature regions are used in the second stage of the proposed framework.This paper uses three geometric moments to extract valuable information from segmented images, including image mass M 00 and moment of inertia in two directions (M 20 , M 02 ).In our conception, the former is defined to distinguish between characteristic patches and characteristic bands, while the latter two judge the relative position in the T-F spectrum.
The significantly compressed feature space makes it possible to use a simple classifier.In this experiment, we choose the decision tree algorithm to classify the four health states of planetary gearboxes quickly.The simplicity of the model, convenience of training, and high interpretability of the model are several factors that we focus on.Obviously, the decision tree algorithm is an excellent choice to reveal the faulty information in T-F images.
Figure 8 shows the classification confusion matrix of the decision tree algorithm on the test dataset.Overall, the average accuracy of the diagnosis reaches 97.4%.The limited amount of real data and small feature space have no adverse impact on the diagnosis results.Specifically, meshing misalignment fault (MA) and pitting-misalignment concurrent fault (CO) are easy to confuse because they both have the characteristics of high energy of double meshing frequency bins.Ring gear tooth pitting (RP) fault is a health condition with the lowest diagnosis accuracy.Referring to the observation in the previous subsection, it is more polluted by noise.Figure 9 shows the trained decision tree in the experiment.In summary, the decision tree algorithm combined with the shape feature extraction method can effectively extract the status features in T-F images and accurately classify the health states of planetary gearboxes.

Robustness against Variable Operating Environment
In the engineering application of fault diagnosis, the variable operating environment greatly influences the system's performance.For example, wind turbines operate under continuous variable speed and variable loads.In the same wind farm, the operating conditions of wind turbines differ by local geographical environment.In addition, there are more difficult diagnostic scenarios, such as the same type of planetary gearboxes mounting in different wind farms.In these cases, diagnostic systems require high robustness against environmental changes.In this sub section, we sequentially take a subset belonging to one wind turbine from the original dataset as the robustness test data and use the remaining data to train the two-stage framework as described in Section 4.1.The performance metrics and fault classification accuracy of the trained model on the robustness test dataset are shown in Figure 10.We can observe that the segmentation and classification performance in each healthy state is affected by the variable operating environment in some way.The proposed twostage framework achieves an average F1 score of 0.747 and a classification accuracy of 83.1% on the robustness test dataset.The normal state gives the best segmentation and classification performance on the robustness test dataset, reaching an average F1 score of 0.91 and a classification accuracy of 99.18%.Pitting-misalignment concurrent faults gives the worst robustness.A possible reason is that more complex fault causes make less similarity between different machines.In addition, the experimental results also show that there is no dependence of the classification accuracy on certain segmentation performance metric when surveying four health states, which indicates that the influence of errors in the analytical stage on the geometric moments is uncertain.To sum up, the pro-posed framework presents satisfactory robustness against variable operating environments.Therefore, it is suitable for application in the actual industrial environment with variable operating environments.

Conclusions
This paper presents a fault diagnosis framework for planetary gearboxes, including a U-net based T-F spectrum analytical method in the first stage and a decision treebased health state classification method in the second stage.Both the specific indications of fault feature regions and the white-box property of the decision tree model can improve confidence in the fault diagnosis conclusion for operators, which is lacking in conventional deep learning models.Using the real data collected from wind turbines, we prove that the U-net model is better than the traditional image processing methods in the T-F spectrum analytical task of planetary gearboxes with a F1-score of 0.942.With a decision tree classifier, the final accuracy of planetary gearbox fault diagnosis reaches 97.4%.Further experiments show that the proposed framework presents high robustness against variable operating environment.Moreover, it is worth noting that our method provides a flexible paradigm for solving similar problems with high openness.
In future work, the authors plan to improve the proposed method from two aspects.First, knowledge generation based on interpretable machine learning is a promising direction for improvement.Second, the authors will follow up on advanced semantic segmentation methods to enhance the robustness of time-frequency analysis.

Figure 2 .
Figure 2. Architecture of the proposed two-stage framework.

Figure 3 .
Figure 3. Planetary gearbox in the case study.

Figure 4 .
Figure 4. Internal structure of the planetary gearbox and the location of the accelerometer.

Figure 5 .
Figure 5. Vibration signals and T-F spectrums in the four health conditions: (a,e) NS marked by characteristic band in meshing frequency bin; (b,f) RP marked by characteristic patches occurring with meshing frequency in meshing frequency bin; (c,g) MA marked by high energy in 2× meshing frequency bin; (d,h) CO marked by characteristic patches occurring with meshing frequency in 2× meshing frequency bin.

Figure 6 .
Figure 6.The results of identifying fault feature regions using U-net.White represents the characteristic area, and black represents the background.

Figure 7 .
Figure 7.Comparison of performance and time consumption between different methods.

Figure 9 .
Figure 9. Trained decision tree in the experiment.

Figure 10 .
Figure 10.The segmentation performance metrics and fault classification accuracy of the trained model on the robustness test dataset: (a) NS; (b) RP; (c) MA; (d) CO.

) 4 .
Batch normalization is an effective tool to deal with the problem of feature distribution drift during batch training.Severe feature distribution drift will reduce the stability of neural network training and aggravate the over-fitting issue, which is particularly obvious in deep neural networks.Batch normalization transforms the distribution of neuron activations into the standard normal distribution.For batch training data [x 1 , x 2 , . . ., x m ], the normalized result is calculated as follows:

Table 1 .
Teeth number information of the second-stage planetary mechanism.

Table 2 .
Performance metrics of the U-net.