Compound Fault Diagnosis of Planetary Gearbox Based on Improved LTSS-BoW Model and Capsule Network

The identification of compound fault components of a planetary gearbox is especially important for keeping the mechanical equipment working safely. However, the recognition performance of existing deep learning-based methods is limited by insufficient compound fault samples and single label classification principles. To solve the issue, a capsule neural network with an improved feature extractor, named LTSS-BoW-CapsNet, is proposed for the intelligent recognition of compound fault components. Firstly, a feature extractor is constructed to extract fault feature vectors from raw signals, which is based on local temporal self-similarity coupled with bag-of-words models (LTSS-BoW). Then, a multi-label classifier based on a capsule network (CapsNet) is designed, in which the dynamic routing algorithm and average threshold are adopted. The effectiveness of the proposed LTSS-BoW-CapsNet method is validated by processing three compound fault diagnosis tasks. The experimental results demonstrate that our method can via decoupling effectively identify the multi-fault components of different compound fault patterns. The testing accuracy is more than 97%, which is better than the other four traditional classification models.


Introduction 1.Literature Review
Planetary gearboxes play an important role in mechanical equipment such as wind turbine, helicopter and construction machinery, which generally work under time-varying load conditions.The key parts of a planetary gearbox are prone to multiple structural damages such as wear, broke, pitting and crack, etc. due to the influence of long-term alternating stresses.The service performance of a planetary gearbox further endangers the operation safety of the entire mechanical equipment.Therefore, it is significant to diagnose the potential faults of a planetary gearbox [1,2].
The internal components of a planetary gearbox are varied and generally work together with a complex coupling relationship.The fault characteristics are coupled in that the failure of several components may simultaneously occur to different degrees.Moreover, the fault features could be impacted by multi-source excitations such as random impacts, timevarying load, strong noise, multi-interface attenuation and so on.As a result, it is very difficult to identify the compound fault of a planetary gearbox [3,4].
A series of studies have been carried out for the signal processing methods of a compound fault diagnosis.Since the fault signal is highly unstable with complex frequency components, time-frequency methods such as the wavelet transform, ensemble empirical mode decomposition (EEMD), symplectic geometric mode decomposition (SGMD), local mean decomposition (LMD), local characteristic-scale decomposition (LCD), variational mode decomposition (VMD), etc. are mainly used.Teng et al. [5] proposed a modulation model based on wavelet transform, which provided an effective tool for wind power gearbox compound fault diagnosis; Zhao et al. [6] used the EEMD and feature fusion methods to diagnose the composite fault of rolling bearing; Pan et al. [7] proposed an SGMD signal decomposition algorithm to decompose the compound fault signals of rotating machinery; Huang et al. [8] combined recursive least squares (RLS) with LMD to diagnose the early fault of bearings; Wang et al. [9] proposed an improved LCD method to extract the early fault characteristics of bearings; Zhang et al. [10] combined VMD with adaptive maximum correlated kurtosis deconvolution (AMCKD) to detect the wind turbine rolling bearing faults.Above all, these methods mainly focus on the improvement of the signal decomposition ability.However, the subsequent compound fault separation and identification rely heavily on expert experience and knowledge, resulting in low recognition accuracy.
Deep learning (DL) has been used increasingly in the intelligent diagnosis of mechanical equipment.The typical DL-based methods for the intelligent diagnosis of a planetary gearbox include: Deep Belief Network (DBN) [11,12], Generative Adversarial Network (GAN) [13], Convolutional Neural Network (CNN) [14][15][16], Long Short-Term Memory (LSTM) [17], etc.In terms of compound fault diagnosis, Shao et al. [12] combined adaptive DBN and CNN to diagnose the multiple faults of rolling bearings; Zhao et al. [13] proposed a GAN model to improve the diagnosis performance under data imbalance conditions; Zhang et al. [14] combined fast spectral kurtosis (FSK) with multi-branch CNN for multi-fault diagnosis of wind turbine gearboxes.
However, most classification models treat compound fault as a new fault class and output single label, which cannot provide a true sense of decoupling identification of the compound fault.In fact, the compound fault is not exactly a new fault class since its fault information consists of corresponding fault characteristics of single faults.In addition, the training process of a DL-based model requires a large number of training samples.However, the fault samples are relatively rare in practice, and random combinations of different single faults can generate various compound faults.It is impractical to collect sufficient compound fault samples for classification model training.Therefore, it is necessary to propose a new intelligent diagnosis method, which is especially suitable for compound fault diagnosis, and has the following functions: (1) only the single fault samples are required for model training, and the trained model can use the fault knowledge learned from the labeled single fault samples to identify the fault components of compound fault test samples; (2) the model can predict multi-labels for test samples when making classification decisions.
The typical multi-label classification methods include binary relevance, multi-label K-NN, Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Transformer structures [18,19], etc. Capsule Network (CapsNet) is a novel type of network proposed by Hinton et al. [20] in 2017.It utilizes capsule vectors rather than scalar neurons as the input and output of the network layers, which overcomes the problem that traditional networks cannot extract the spatial feature information.Meanwhile, it cancels the pooling layer to avoid the loss of valuable information, and can conduct multi-label outputs.It has been increasingly used in the fields of electroencephalography (EEG) emotion recognition [21], image and text classification [22,23], etc.In particular, it can identify and separate the overlapped objects [22], which is an important feature for the identification of compound faults.
In terms of single fault diagnosis, Liu et al. [24] proposed an improved multi-scale residual generative adversarial network (GAN) and feature enhancement-driven capsule network to solve the imbalanced fault data problem.Li et al. [25] proposed a dual convolutional-CapsNet for the fault diagnosis of a planetary gearbox under different rotation conditions.In terms of compound fault diagnosis, Liang et al. [26] integrated CapsNet with stockwell transform (ST) and data augmentation generative adversarial networks (DAGANs) to diagnose the single and simultaneous faults for a wind turbine gearbox.Xu et al. [27] developed an improved deep convolutional-CapsNet to diagnose the sun Sensors 2024, 24, 940 3 of 20 gear-planet compound faults of an RV gearbox; Huang et al. [28] adopted a convolutional-CapsNet model with a multi-label classifier to decouple the gear-bearing compound faults of automotive transmission.Then, Huang et al. [29] combined deep CapsNet and ensemble learning to improve the compound fault identification accuracy of an automotive gearbox.
To achieve accurate identification, sufficient feature information needs to be fed into the primary layer to ensure the CapsNet is working efficiently, which depends on the feature extractor of the model.However, most models use the convolutional network to extract features from raw signals; the feature extraction process is considered as a "black box", which has the limitation in compound fault feature extraction.As a result, the classification performance of the diagnosis model has been limited.This motivates us to develop a more suitable feature extractor to directly obtain high representation feature vectors from raw signals and ultimately increase the classification ability.
To address this issue, a new feature extractor is constructed to optimize CapsNet, which combines local temporal self-similarity (LTSS) and bag-of-words (BoW) methods, named LTSS-BoW, for feature extraction.This model is improved from the temporal selfsimilarity method [30], which has been successfully used to recognize the image action sequences due to its advantage of cross-view structural stability.In order to reduce the feature dimension and increase the computing efficiency, a sliding window is utilized to divide the raw time-series into a local subseries.On this basis, LTSS matrices of the subseries are constructed and the gradients of LTSS matrices are calculated.Then, the multi-dimensional LTSS feature vectors are obtained by moving the sliding window with a fixed step size to traverse the entire sample signal.The LTSS feature extraction leads to much data redundance, and thus brings a large computation burden.Therefore, BoW is utilized to further improve computing efficiency, which has the advantages of strong anti-noise ability and good robustness [31,32].Finally, the histogram feature vectors are treated as the inputs of the CapsNet layers.

Main Contributions of This Paper
The main contributions of this work are summarized as follows: (1) A novel framework, named LTSS-BoW-CapsNet, is proposed to intelligently identify the fault components contained in the compound fault signals of the planetary gearbox.(2) An LTSS-BoW-based feature extraction method is presented to increase the identification performance, which can be used to directly obtain high representation feature vectors from raw signals.
(3) A multi-label classifier based on CapsNet is designed to predict multi-labels for compound fault classification decisions, in which the dynamic routing algorithm and average threshold are adopted.(4) Verification experiments are conducted to demonstrate the advantages of the proposed method.

Structure of the Rest of This Paper
The rest of this paper is organized as follows: Section 2 introduces the basic theoretical background of the LTSS-BoW feature extractor and CapsNet, and presents the overall diagnostic scheme of the proposed LTSS-BoW-CapsNet method in detail.Section 3 shows the experimental verification and comparative analysis results.The conclusions are drawn in Section 4.

Compound Fault Description
Based on the author's previous research works [33,34], Figure 1 exhibits the simulated vibration signals of a planetary gear set with the planet gear crack, sun gear crack and compound gear cracks, respectively.For the single crack fault cases, a series of abnormal impulses with a fixed period appear in the time-domain as the cracked tooth engages with the matching gear.As shown in Figure 1d, the compound fault-related features contain the information of two kinds of single fault-induced features.However, it is not simply the superposition of single fault-induced features.When two cracked teeth are engaged simultaneously, the two types of anomalous pulses will be coupled and form new fault features; meanwhile, the single fault-induced features are also deformed.Therefore, it is difficult to identify the fault components contained in the compound fault signal.
Sensors 2024, 24, x FOR PEER REVIEW 4 of 22 and compound gear cracks, respectively.For the single crack fault cases, a series of abnormal impulses with a fixed period appear in the time-domain as the cracked tooth engages with the matching gear.As shown in Figure 1d, the compound fault-related features contain the information of two kinds of single fault-induced features.However, it is not simply the superposition of single fault-induced features.When two cracked teeth are engaged simultaneously, the two types of anomalous pulses will be coupled and form new fault features; meanwhile, the single fault-induced features are also deformed.Therefore, it is difficult to identify the fault components contained in the compound fault signal.The main limitations of existing compound fault diagnosis models are: (1) most researchers label compound faults as a new fault class, and the compound fault samples need to be fed into the network with other single fault samples for model training [12][13][14].Therefore, the proposed network cannot work effectively if the compound fault training samples are insufficient.However, compound faults are not exactly a new fault class since it contains the fault information of single fault components.Hence, it is possible to use the fault knowledge learned from the labeled single fault samples to identify the fault components of the compound fault; (2) the traditional neural networks generally use a softmax classifier, which only identifies the most obvious fault class [26,27].Therefore, it cannot output multiple independent labels at the same time, so it is unable to identify via decoupling the fault components contained inside the compound fault.In addition, it is more difficult to distinguish the fault components in the condition that the single fault-related features are close to each other.

Overall Framework of the Proposed Method
In order to intelligently identify the fault components contained in the compound fault of the planetary gearbox, this work proposed a novel diagnostic framework, which mainly includes two parts: the LTSS-BoW model is the feature extractor, and CapsNet makes multi- The main limitations of existing compound fault diagnosis models are: (1) most researchers label compound faults as a new fault class, and the compound fault samples need to be fed into the network with other single fault samples for model training [12][13][14].Therefore, the proposed network cannot work effectively if the compound fault training samples are insufficient.However, compound faults are not exactly a new fault class since it contains the fault information of single fault components.Hence, it is possible to use the fault knowledge learned from the labeled single fault samples to identify the fault components of the compound fault; (2) the traditional neural networks generally use a softmax classifier, which only identifies the most obvious fault class [26,27].Therefore, it cannot output multiple independent labels at the same time, so it is unable to identify via decoupling the fault components contained inside the compound fault.In addition, it is more difficult to distinguish the fault components in the condition that the single fault-related features are close to each other.

Overall Framework of the Proposed Method
In order to intelligently identify the fault components contained in the compound fault of the planetary gearbox, this work proposed a novel diagnostic framework, which mainly includes two parts: the LTSS-BoW model is the feature extractor, and CapsNet makes multi-label classification decisions.The overall framework of the proposed method is shown in Figure 2.Each part is described in the following.

LTSS Feature Extraction
LTSS is an optimized data representation method that utilizes local structural information of time-series.Compared with conventional statistical features, LTSS feature matrices contain more useful information including sequential characteristics and change trends.
Figure 3 shows the proposed LTSS feature extraction scheme, which includes three steps: (1) Construct the LTSS matrix from the raw signal.At first, the sliding window is utilized to divide the raw signal sample into a local subseries.On this basis, LTSS matrices of the subseries are constructed.( 2 Then, the Euclidean distance of each two data points in ( ) y t is calculated to con- struct the LTSS matrix ( ) where ij d denotes the Euclidean distance between the ith element and the jth element in ( ) y t , and m is the length of ( ) Step 2: Extract the gradient feature of the LTSS matrix Step 1: Construct the LTSS matrix from the raw signal The raw vibration signals measured under each health condition are divided into non-overlapping and equal-length signal samples.The sliding window with a length of m = 2∆t + 1 is utilized to collect the data points around time point t from time step t − ∆t to t + ∆t.The dataset y(t) can be denoted as: Then, the Euclidean distance of each two data points in y(t) is calculated to construct the LTSS matrix D(t): where d ij denotes the Euclidean distance between the ith element and the jth element in y(t), and m is the length of y(t), i.e., 2∆t + 1.
Step 2: Extract the gradient feature of the LTSS matrix Since the LTSS matrix is symmetrical, only the upper right triangular elements are considered in order to save computing resources.A block-based descriptor is employed to capture the structural information hidden in the LTSS matrix.At first, the whole matrix is divided into several blocks with a size of n × n; Then, the gradient of each block is calculated to obtain column vector b q ; Finally, all these vectors are concatenated as a multi-dimensional vector p t , which is named as the upper right triangular block-based descriptor: where T stands for transpose; q is the number of blocks.
Sensors 2024, 24, 940 As shown in Figure 3, the detailed procedure to calculate the gradient vector b q is as follows.Taking a block with a size of n × n as an example, the gradient in x direction p x is defined as: where l xn is the column vector of p x ; r n is the column vector of the block.The similar calculation can be used to get the gradient in y direction p y .
Then, an 8-bin histogram-based gradient direction is defined as: The gradient vector b q is formed by counting the number of elements within the range of each gradient direction.
Step 3: Transform the signal sample to a sequence of LTSS feature vectors A sequence of feature vectors p t can be obtained by moving the sliding window with a fixed step size to traverse the entire sample signal, and repeating the above steps.Then, the signal sample can be transformed to a sequence of LTSS feature vectors as: where p j is the feature vector extracted from the jth sliding window, and k is the number of sliding windows.

BoW Model
The LTSS feature extraction leads to much data redundance, and thus brings a large computation burden.The BoW model is a simplifying assumption to construct a global representation from local features, which can be used to improve computing efficiency, making it common in many fields such as natural language processing and image/videobased action recognition [31,32].As shown in Figure 4, for the learning phase, the BoW model performs an adaptive k-means clustering algorithm to sort all original LTSS feature vectors generating a codebook.For a new feature sample, a histogram-based encoding (HBE) strategy is used to encode it based on the codebook.The statistical feature histogram is subsequently computed as the inputs of the CapsNet layers.

BoW Model
The LTSS feature extraction leads to much data redundance, and thus brings a large computation burden.The BoW model is a simplifying assumption to construct a global representation from local features, which can be used to improve computing efficiency, making it common in many fields such as natural language processing and image/videobased action recognition [31,32].As shown in Figure 4, for the learning phase, the BoW model performs an adaptive k-means clustering algorithm to sort all original LTSS feature vectors generating a codebook.For a new feature sample, a histogram-based encoding (HBE) strategy is used to encode it based on the codebook.The statistical feature histogram is subsequently computed as the inputs of the CapsNet layers.Step 1: Form the codebook using clustering algorithm.
Based on Section 2.3.1, the signal samples are represented as a sequence of LTSS feature vectors i Z , and all these samples are then gathered together to form a feature space: Step 1: Form the codebook using clustering algorithm.
Based on Section 2.3.1, the signal samples are represented as a sequence of LTSS feature vectors Z i , and all these samples are then gathered together to form a feature space: where s is the number of samples.In this paper, the typical k-means clustering algorithm is employed to automatically learn the most representative words, i.e., codewords, which are determined by the cluster centers.Further, all these codewords form a codebook with a size of K, i.e., {C i }, i = 1, 2, • • • , K, where K is the cluster number, which has great influence in clustering results.
Several approaches have been proposed to select the appropriate cluster number [35].Among them, the Davis-Bouldin (DB) index is a promising method because of its simplicity, which is defined as the average similarity measure of each cluster with its most similar cluster, and is expressed as: where the similarity D i,j is the ratio of within-cluster distances to between-cluster distances (see Figure 4), and the expression is: where d i denotes the average distance of all points in the i-th cluster to the cluster center, d j denotes the average distance of all points in the j-th cluster to the cluster center, d i,j denotes the Euclidean distance between the i-th and j-th cluster centers.
According to Equations ( 10) and (11), the smaller the DB index, the better the clustering results.So, the best cluster number corresponds to the minimum DB index.
Step 2: Encode the feature sample using histogram-based encoding (HBE) strategy Histogram is an accurate representation of the distribution of numerical data, which has been widely employed in image processing, quality evaluation and time-series processing [29].Assume that a new signal sample Y has been expressed as the LTSS feature vector form, for each point of the LTSS feature vector, the Euclidean distance between it and all the codewords in the codebook are calculated, and the codeword with minimum Euclidean distance is assigned to replace this point.Thus, the LTSS feature vector is described by a series of nearest codewords, and the frequency of each codeword is gathered to construct the statistical feature histogram, i.e., H = {h 1 , h 2 , • • • , h K }.

Capsule Network for Decision-Making
The framework of CapsNet is shown in Figure 5. CapsNet usually contains a primary capsule layer and digital capsule layer.Different from the traditional neural network, the main improvements of CapsNet are: (1) traditional scalar neurons are replaced by capsule vectors to further mine the spatial information of features; (2) the dynamic routing algorithm is adopted to transmit information between the primary capsule layer and digital capsule layer, which effectively reduces the loss of feature information.
The specific parameters of CapsNet include the number and dimension of the capsules.For the primary capsule layer, the number of capsules is determined by the best cluster number K of the BoW model, and the dimension of the capsules is the same as the number of sliding windows used in the LTSS model.For the digital capsule layer, the number of digital capsules is the number of classifications, and the module length of the digital capsule vector represents the classification probability.Since the digital capsules are independent of Sensors 2024, 24, 940 9 of 20 each other, it can predict multi-labels for test samples when making classification decisions; therefore, CapsNet is adopted as the fault classifier.
The entire process of the dynamic routing algorithm can be divided into four steps as follows.The margin loss function and average threshold adopted for decision-making are described in Sections 2.5 and 2.6, respectively.
capsule layer, which effectively reduces the loss of feature information.
The specific parameters of CapsNet include the number and dimension of the capsules.For the primary capsule layer, the number of capsules is determined by the best cluster number K of the BoW model, and the dimension of the capsules is the same as the number of sliding windows used in the LTSS model.For the digital capsule layer, the number of digital capsules is the number of classifications, and the module length of the digital capsule vector represents the classification probability.Since the digital capsules are independent of each other, it can predict multi-labels for test samples when making classification decisions; therefore, CapsNet is adopted as the fault classifier.
The entire process of the dynamic routing algorithm can be divided into four steps as follows.The margin loss function and average threshold adopted for decision-making are described in Sections 2.5 and 2.6, respectively.Step 1: The input vectors u i of the primary capsule layer are the extracted feature vectors by the previous LTSS-BoW model.Each primary capsule is multiplied by an independent weight matrix to predict the high-level capsule, which can be expressed as: where the subscripts i and j denote the ith primary capsule and jth digital capsule, respectively.W i,j is the weight matrix, and u ⟨j|i⟩ denotes the prediction vector.
Step 2: The output vector s j is obtained by the weighted sum of all the intermediate prediction vectors u ⟨j|i⟩ , which can be expressed as: where c i,j is the coupling coefficient determined by the softmax function, which can be regarded as the connection probability that u ⟨j|i⟩ should be coupled to s j .The process can be expressed as: where k is the number of digital capsules.b i,j is the prior probability of c i,j .In the forward propagation, b i,j is initialized to zero and updated by dynamic routing as Algorithm 1.
Step 3: The final output vector h j of the digital capsule layer can be obtained by the nonlinear mapping of s j using the squashing function.The squashing function can compress the vector modulus length within the range of [0, 1) without changing its orientation, which can be expressed as: Step 4: The dynamic routing process is executed as shown in Algorithm 1 to update b i,j : Sensors 2024, 24, 940 10 of 20 where the dot product h j • u ⟨j|i⟩ is used to evaluate the similarity between the intermediate prediction vector u ⟨j|i⟩ and the output vector h j .The higher the similarity, the larger the values of b i,j and c i,j .The optimal solution of the coupling coefficient c i,j can be obtained by continuous updating.
Ultimately, the final output vector h j is returned, and the modulus length of the vector represents the classification probability p 6:

Margin Loss Function
The margin loss function is adopted as the objective function to calculate the loss value.Compared with the cross-entropy loss function, the boundary loss function can directly measure the similarity between different classes of samples based on the Euclidean distance, which can expand the inter-class differences and reduce the intra-class differences.The expression is: In the formula, k is the number of fault classes.T j is the classification indicator function and T j = 1 represents that the input sample belongs to class j, otherwise T j = 0. p pred j is the predicted probability that the input sample belongs to class j. m + denotes the expected upper bound of the predicted probability when the sample belongs to class j. m − denotes the expected lower bound of the predicted probability when the sample does not belong to class j. λ is the weight penalty factor.

Average Threshold
An adaptive average threshold φ is set to limit the number of output labels.If p pred j is greater than the threshold φ, the jth class label output is 1, which means the jth class exists.Otherwise, the jth class label output is 0, which means the jth class does not exist.The process can be expressed as: where L j is the output label of the jth class.L denotes the set of all the predicted class labels.

Diagnosis Process
Taking advantages of LTSS-BoW-based vibration feature extraction, coupled with CapsNet-based decision-making, a novel framework is proposed to diagnose the compound fault of a planetary gearbox.To summarize, the detailed steps are given below and shown in Figure 6. 1, , , where j L is the output label of the jth class.L denotes the set of all the predicted class labels.

Diagnosis Process
Taking advantages of LTSS-BoW-based vibration feature extraction, coupled with CapsNet-based decision-making, a novel framework is proposed to diagnose the compound fault of a planetary gearbox.To summarize, the detailed steps are given below and shown in Figure 6.

Experimental Verification
To evaluate the effectiveness of the proposed LTSS-BoW-CapsNet diagnosis method, a series of experiments were conducted on our planetary gearbox test rig.The experimental results are analyzed in three aspects: (1) demonstrate the multi-label output results of CapsNet; (2) compare the diagnosis results of our proposed method with other methods; (3) perform feature visualization to further evaluate the feature learning ability of the proposed method on the compound fault diagnosis tasks.

Experimental Setup and Data Description
As shown in Figure 7, the test rig for a planetary gearbox consists of the drive motor, planetary reducer, magnetic powder brake, three-axis acceleration sensor installed on the gearbox and the Dewetron acquisition system.As shown in Figure 8, four kinds of single fault patterns were separately seeded on the planetary gearbox, which are sun gear tooth crack, planet tooth crack, planet tooth surface pitting and ring gear tooth crack, denoted as SC, PC, PP and RC, respectively.Three kinds of compound fault patterns were simulated in the experiments, which are SC-PC, SC-PP and SC-RC, respectively.

Experimental Setup and Data Description
As shown in Figure 7, the test rig for a planetary gearbox consists of the drive motor, planetary reducer, magnetic powder brake, three-axis acceleration sensor installed on the gearbox and the Dewetron acquisition system.As shown in Figure 8, four kinds of single fault patterns were separately seeded on the planetary gearbox, which are sun gear tooth crack, planet tooth crack, planet tooth surface pitting and ring gear tooth crack, denoted as SC, PC, PP and RC, respectively.Three kinds of compound fault patterns were simulated in the experiments, which are SC-PC, SC-PP and SC-RC, respectively.In the experiment, the sun gear is the input component, and the carrier is the output.The rotation speed is set to 1200 rpm, and the load torque is 5 N•m.Setting the sampling frequency to 10,240 Hz, the vibration signal for each normal or faulty pattern is collected with a sampling time of 30 s.The raw signal is divided into non-overlapping signal samples and each sample has 2048 data points.
In order to reduce the impact of raw data on the diagnostic model, the normalization regularization method is adopted to normalize the raw data to between 0 and 1, and the corresponding formula is as follows: ( ) ( )

Experimental Setup and Data Description
As shown in Figure 7, the test rig for a planetary gearbox consists of the drive motor, planetary reducer, magnetic powder brake, three-axis acceleration sensor installed on the gearbox and the Dewetron acquisition system.As shown in Figure 8, four kinds of single fault patterns were separately seeded on the planetary gearbox, which are sun gear tooth crack, planet tooth crack, planet tooth surface pitting and ring gear tooth crack, denoted as SC, PC, PP and RC, respectively.Three kinds of compound fault patterns were simulated in the experiments, which are SC-PC, SC-PP and SC-RC, respectively.In the experiment, the sun gear is the input component, and the carrier is the output.The rotation speed is set to 1200 rpm, and the load torque is 5 N•m.Setting the sampling frequency to 10,240 Hz, the vibration signal for each normal or faulty pattern is collected with a sampling time of 30 s.The raw signal is divided into non-overlapping signal samples and each sample has 2048 data points.
In order to reduce the impact of raw data on the diagnostic model, the normalization regularization method is adopted to normalize the raw data to between 0 and 1, and the corresponding formula is as follows: ( ) ( ) In the experiment, the sun gear is the input component, and the carrier is the output.The rotation speed is set to 1200 rpm, and the load torque is 5 N•m.Setting the sampling frequency to 10,240 Hz, the vibration signal for each normal or faulty pattern is collected with a sampling time of 30 s.The raw signal is divided into non-overlapping signal samples and each sample has 2048 data points.
In order to reduce the impact of raw data on the diagnostic model, the normalization regularization method is adopted to normalize the raw data to between 0 and 1, and the corresponding formula is as follows: Figure 9 shows the normalized time-domain signal for each fault pattern.It can be observed that the gear fault can induce impacts with a fixed period, i.e., t s , t p , t r , in the time-domain signal.Compared with the single fault patterns, the coupling effect between multiple faults makes the vibration characteristics more complicated in compound fault cases.It is worth noting that new fault features occur due to the coupling effects, i.e., t sp and t sr ; meanwhile, the single fault-induced features are also deformed.Therefore, it is difficult to manually identify the compound fault components from the raw signal.
time-domain signal.Compared with the single fault patterns, the coupling effect between multiple faults makes the vibration characteristics more complicated in compound fault cases.It is worth noting that new fault features occur due to the coupling effects, i.e., sp t and s r t ; meanwhile, the single fault-induced features are also deformed.Therefore, it is difficult to manually identify the compound fault components from the raw signal.For the proposed diagnostic approach based on LTSS-BoW and CapsNet, three compound fault diagnosis tasks shown in Table 1 are set up.The normal and single fault samples are used for training based on the 5-fold cross-check method.After the model training process, the compound fault samples are used for testing.The trained model needs to predict multiple fault labels for the compound fault test samples based on the knowledge learned from the single fault samples.The length of sliding window m = 2∆t + 1 and cluster center number K have a great influence in the calculation efficiency and accuracy.Considering the calculation complexity of the LTSS matrix, the parameter ∆t is set to take value from the range [4, 16] with a step size of 3. K is adaptively determined by DB index as described in Section 2.3.2.The smaller the DB index, the better the clustering results, so the best cluster number is set to 125 as shown in Figure 10.For each sample, the output matrix size is 125 × 5 after extracting basic features through the LTSS-BoW model.
pound fault diagnosis tasks shown in Table 1 are set up.The normal and single fault samples are used for training based on the 5-fold cross-check method.After the model training process, the compound fault samples are used for testing.The trained model needs to predict multiple fault labels for the compound fault test samples based on the knowledge learned from the single fault samples.

Parameters of CapsNet
The extracted features are fed into the primary capsule layer as inputs.The number of digital capsules is determined by the number of categories to be classified.During the training process, the Adam optimizer with the initial learning rate of 0.001 is adopted to update the parameters.The iteration of dynamic routing r is set to 3. The batch size is set to 10.The margin loss function is adopted to calculate the loss value.The structural parameters of the network in this paper are greatly reduced, which is more conducive to improve the training speed.The specific parameters used for the LTSS-BoW model and CapsNet are summarized in Table 2.
Our approach is based on the Pytorch framework and trained on an NVIDIA RTX3070 GPU.

Parameters of CapsNet
The extracted features are fed into the primary capsule layer as inputs.The number of digital capsules is determined by the number of categories to be classified.During the training process, the Adam optimizer with the initial learning rate of 0.001 is adopted to update the parameters.The iteration of dynamic routing r is set to 3. The batch size is set to 10.The margin loss function is adopted to calculate the loss value.The structural parameters of the network in this paper are greatly reduced, which is more conducive to improve the training speed.The specific parameters used for the LTSS-BoW model and CapsNet are summarized in Table 2. Our approach is based on the Pytorch framework and trained on an NVIDIA RTX3070 GPU.

Comparative Analysis
To validate the effectiveness of the proposed LTSS-BoW-CapsNet method, four models were selected to compare the diagnosis performance.(iii).CNN-CapsNet model [26].This method uses a convolution network as a feature extractor, and a capsule network as the classifier.The parameter settings are described in Ref. [26].
In our experiments, each diagnosis task was performed independently ten times to obtain the average accuracy.Table 4 lists the accuracies of the four models on the compound fault diagnosis tasks.The results show that the SVM-based, kNN-based and CNNbased models failed in all three tasks due to the limitation of the classification principle.(iii).CNN-CapsNet model [26].This method uses a convolution network as a feature extractor, and a capsule network as the classifier.The parameter settings are described in Ref. [26].
In our experiments, each diagnosis task was performed independently ten times to obtain the average accuracy.Table 4 lists the accuracies of the four models on the compound fault diagnosis tasks.The results show that the SVM-based, kNN-based and CNN-based models failed in all three tasks due to the limitation of the classification principle.The CNN-CapsNet model only identifies the fault components of SC-RC and failed in the tasks SC-PC and SC-PP.The proposed LTSS-BoW-CapsNet method performed well in all tasks with an accuracy of more than 97%.This demonstrates that the proposed method can identify via decoupling the fault components and have better stability for different types of compound fault diagnosis.To further visually analyze the diagnosis results, the confusion matrices and the label outputs of four diagnosis methods (LTSS-BoW-SVM, CNN, CNN-CapsNet and the proposed LTSS-BoW-CapsNet) for three diagnosis tasks are displayed in Figures 12 and 13   Additionally, a t-SNE visual diagram is used to downscale the deep feature embedding and obtain the feature distribution of the CNN-CapsNet and LTSS-BoW-CapsNet models.As shown in Figure 14, comparing the feature results extracted by the high-level capsule layer in task 1, it can be seen that the class spacing between the SC-PC pattern and SC pattern is much closer than the class spacing with the PC pattern for the CNN-CapsNet method, which makes it easy to mistakenly identify the SC-PC class as SC.However, the class spacing distribution between the SC-PC pattern and two single fault patterns is uniform and clear for the LTSS-BoW-CapsNet method, so the fault components contained in the compound fault can be effectively identified.It indicates that the use of LTSS-BoW can enhance the ability of the network model to extract coupling features.It can be clearly seen that the LTSS-BoW-SVM and CNN models only output single class labels.The reason is that the traditional classifiers identify the most obvious features and output the most likely single fault label for compound fault diagnosis task.Therefore, the traditional classifiers cannot output multiple independent labels at the same time; it is unable to identify via decoupling the fault components contained in the compound fault.
CNN-CapsNet identifies the SC and RC fault components contained in the SC-RC compound fault, while wrongly identifying the SC-PC and SC-PP compound faults as an SC single fault.The reason could be the fault features of SC are more obvious than PC and PP faults in the compound fault signals.Therefore, the CNN has the limitation in compound fault feature extraction, especially in the case that one fault component has greater influence than the other one.
The proposed LTSS-BoW-CapsNet model successfully identifies the fault components contained in the compound faults in three tasks, which indicates that the LTSS-BoW model has better feature extraction ability.Moreover, the CapsNet model can output multi-labels due to its unique classification principle.Above all, the proposed model has significant advantages in compound fault diagnosis.
Additionally, a t-SNE visual diagram is used to downscale the deep feature embedding and obtain the feature distribution of the CNN-CapsNet and LTSS-BoW-CapsNet models.As shown in Figure 14, comparing the feature results extracted by the high-level capsule layer in task 1, it can be seen that the class spacing between the SC-PC pattern and SC pattern is much closer than the class spacing with the PC pattern for the CNN-CapsNet method, which makes it easy to mistakenly identify the SC-PC class as SC.However, the class spacing distribution between the SC-PC pattern and two single fault patterns is uniform and clear for the LTSS-BoW-CapsNet method, so the fault components contained in the compound fault can be effectively identified.It indicates that the use of LTSS-BoW can enhance the ability of the network model to extract coupling features.

Conclusions
In this paper, a novel LTSS-BoW-CapsNet framework is proposed to diagnose the compound fault of a planetary gearbox.An improved LTSS-BoW feature extractor is conto feature vectors, which has the advantages of high feature extraction efficiency and strong robustness.Then, a multi-label classifier based on CapsNet is designed.The dynamic routing algorithm and average threshold are adopted to predict multi-labels for compound fault components recognition.
The effectiveness of the proposed LTSS-BoW-CapsNet method is evaluated by processing three compound fault diagnosis tasks.The experimental results demonstrate that our proposed approach can effectively identify via decoupling the multi-fault components contained in the compound fault signals of planetary gearbox.The testing accuracy is more than 97%, which is better than the other four traditional classification models.The trained model can only use the fault knowledge learned from the labeled single fault training samples to identify the fault components of compound fault test samples.Therefore, it can solve the problem that the compound fault samples are insufficient in practice.
This research only realized the diagnosis of compound faults containing two types of faults.However, the compound fault of a planetary gearbox could be more complex in practice.Therefore, in future work, the proposed method would be improved by using multi-channel signal fusion and feature fusion, so that it can identify more fault components via decoupling and achieve better identification performance.

Figure 1 .
Figure 1.The simulated vibration signals of a planetary gear set with gear cracks.(a) Planetary gear set; (b) planet gear crack; (c) sun gear crack; (d) compound gear cracks.

Figure 1 .
Figure 1.The simulated vibration signals of a planetary gear set with gear cracks.(a) Planetary gear set; (b) planet gear crack; (c) sun gear crack; (d) compound gear cracks.
24,  x FOR PEER REVIEW 5 of 22 label classification decisions.The overall framework of the proposed method is shown in Figure2.Each part is described in the following.

Figure 2 .
Figure 2. Overall framework of the proposed method.
) Extract the gradient feature of the LTSS matrix.The upper right triangular elements of the LTSS matrix are divided into several blocks.Then, the gradient of each block is calculated to construct the block-based descriptor.(3) Transform the signal sample to a sequence of LTSS feature vectors.The signal samples are represented as a sequence of LTSS feature vectors, and all these samples are then gathered together to form a feature space.The detailed steps are given as follows.

Figure 2 .
Figure 2. Overall framework of the proposed method.

2. 3 .
LTSS-BoW Feature Extractor 2.3.1.LTSS Feature Extraction LTSS is an optimized data representation method that utilizes local structural information of time-series.Compared with conventional statistical features, LTSS feature matrices contain more useful information including sequential characteristics and change trends.

Figure 3
shows the proposed LTSS feature extraction scheme, which includes three steps: (1) Construct the LTSS matrix from the raw signal.At first, the sliding window is utilized to divide the raw signal sample into a local subseries.On this basis, LTSS matrices of the subseries are constructed.(2) Extract the gradient feature of the LTSS matrix.The upper right triangular elements of the LTSS matrix are divided into several blocks.Then, the gradient of each block is calculated to construct the block-based descriptor.(3) Transform the signal sample to a sequence of LTSS feature vectors.The signal samples are represented as a sequence of LTSS feature vectors, and all these samples are then gathered together to form a feature space.The detailed steps are given as follows.

Step 1 :
Construct the LTSS matrix from the raw signal The raw vibration signals measured under each health condition are divided into non-overlapping and equal-length signal samples.The sliding window with a length of 2 1 m t = Δ + is utilized to collect the data points around time point t from time step t t − Δ to t t + Δ .The dataset ( ) y t can be denoted as:

Figure 4 .
Figure 4.The flowchart of BoW model.

Figure 4 .
Figure 4.The flowchart of BoW model.

3 :
Set the number of iterations T 4: For r = 1 to T do 5: c r i,j = softmax b r−1 i,j

( 1 )
Collect the vibration signals of the planetary gearbox in different health states, divide the raw signal into equal-length signal samples and normalize the data samples.(2) Divide the dataset into a training dataset and a test dataset.Note that the training dataset only contains the normal and single fault samples.The test dataset is composed of compound fault samples.(3) Design the LTSS-BoW feature extractor and convert all the samples into feature matrices.(4) Train the CapsNet model based on the training dataset.The trained model is used to identify the fault components of the test samples and output the predicted probability of each fault class.(5) Compare the predicted probability of each fault class with the average threshold for class label output.

( 1 )
Collect the vibration signals of the planetary gearbox in different health states, divide the raw signal into equal-length signal samples and normalize the data samples.(2) Divide the dataset into a training dataset and a test dataset.Note that the training dataset only contains the normal and single fault samples.The test dataset is composed of compound fault samples.(3) Design the LTSS-BoW feature extractor and convert all the samples into feature matrices.(4) Train the CapsNet model based on the training dataset.The trained model is used to identify the fault components of the test samples and output the predicted probability of each fault class.(5) Compare the predicted probability of each fault class with the average threshold for class label output.

Figure 6 .
Figure 6.The diagnosis flowchart of proposed method.Figure 6.The diagnosis flowchart of proposed method.

Figure 6 .
Figure 6.The diagnosis flowchart of proposed method.Figure 6.The diagnosis flowchart of proposed method.
and cluster center number K have a great influence in the calculation efficiency and accuracy.Considering the calculation complexity of the LTSS matrix, the parameter t Δ is set to take value from the range [ ] 416 ， with a step size of 3. K is adaptively determined by DB index as described in Section 2.3.2.The smaller the DB index, the better the clustering results, so the best cluster number is set to 125 as shown in Figure 10.For each sample, the output matrix size is 125 5 × after extracting basic features through the LTSS-BoW model.

Figure 11 .
Figure 11.The predicted probability values for each pattern in task 1.
The comparison models include an SVM-based model, a kNN-based model, a CNN-based model and a CNN-CapsNet model, which are briefly described below: (i).SVM-based and kNN-based models.To compare the effect of a classifier, two widely used classifiers SVM and kNN are used for making classification decisions.These two methods extract features based on the same LTSS-BoW model.(ii).CNN-based model.CNN is a typical neural network with convolution and pooling operations.The classical LetNet5 model is used here for comparison.

Figure 11 .
Figure 11.The predicted probability values for each pattern in task 1.

3. 3
.2. Comparative Analysis To validate the effectiveness of the proposed LTSS-BoW-CapsNet method, four models were selected to compare the diagnosis performance.The comparison models include an SVM-based model, a kNN-based model, a CNN-based model and a CNN-CapsNet model, which are briefly described below: (i).SVM-based and kNN-based models.To compare the effect of a classifier, two widely used classifiers SVM and kNN are used for making classification decisions.These two methods extract features based on the same LTSS-BoW model.(ii).CNN-based model.CNN is a typical neural network with convolution and pooling operations.The classical LetNet5 model is used here for comparison.

Table 4 .
Diagnosis accuracies of all the models.