Rolling Bearing Fault Diagnosis Based on Time-Frequency Compression Fusion and Residual Time-Frequency Mixed Attention Network

: The traditional rolling bearing diagnosis algorithms have problems such as insufﬁcient information on time-frequency images and poor feature extraction ability of the diagnosis model. These problems limit the improvement of diagnosis performance. In this article, the input of the time-frequency image and intelligent diagnosis algorithms are optimized. Firstly, the characteristics of two advanced time-frequency analysis algorithms are deeply analyzed, i.e., multisynchrosqueezing transform (MSST) and time-reassigned multisynchrosqueezing transform (TMSST). Then, we propose time-frequency compression fusion (TFCF) and a residual time-frequency mixed attention network (RTFANet). Among them, TFCF superposes and splices two time-frequency images to form dual-channel images, which can fully play the characteristics of multi-channel feature fusion of the convolutional kernel in the convolutional neural network. RTFANet assigns attention weight to the channels, time and frequency of time-frequency images, making the model pay attention to crucial time-frequency information. Meanwhile, the residual connection is introduced in the process of attention weight distribution to reduce the information loss of feature mapping. Experimental results show that the method converges after seven epochs, with a fast convergence rate and a recognition rate of 99.86%. Compared with other methods, the proposed method has better robustness and precision.


Introduction
Bearing is one of the essential parts of rotating machinery, and its damage causes serious failures of rotating machinery and incalculable consequences. Therefore, the bearing fault diagnosis research has become a hot spot. However, in actual working conditions, the fault signals of rotating machinery are difficult to accurately identify due to the complexity of the working condition and the influence of noise signals.
At present, A series of time-frequency analysis methods are proposed to solve these problems, for example, short-time Fourier transform (STFT) [1] , continuous wavelet transform (CWT) [2] , s-transform (ST) [3] and so on. The essence of time-frequency the analysis is to transform a one-dimensional time-domain signal into two-dimensional time-frequency image to reflect the variation rule of each frequency component of signal with time. Many scholars have applied it to the study of fault diagnosis. Ma et al. [4] presented a condition monitoring method based on a deep belief network (DBN) optimized by multi-order fractional Fourier transform (FRFT) and sparrow search algorithm (SSA). Firstly, they used fractional Fourier transform based on curve feature segmentation to filter fault vibration signals and extract fault characteristic frequencies. Then, the fault features are input into SSA-DBN model for training and the bearing fault features are classified, recognized and diagnosed. Zhu et al. [5] extracted the time-frequency characteristics of bearing signals rolling bearing fault diagnosis is still how to establish a high-precision, and high-efficiency fault diagnosis model [41,42]. Therefore, it is crucial to integrate the information of multiple time-frequency images and diagnostic design models with more vital feature extraction ability and better performance. Meanwhile, the above methods have great advantages in the case of constant speed, but the advantages are not obvious in the case of variable speed. This paper proposes a fault diagnosis algorithm in the case of variable speed.
Given the above problems, this paper proposes time-frequency compression fusion (TFCF) and residual time-frequency mixed attention network (RTFANet). Firstly, two timefrequency images obtained by TMSST and MSST are fused to transform the vibration signals into dual-channel time-frequency images. Then, the attention mechanism is introduced from three aspects of channel, time, and frequency combined with the residual connection. The model can selectively focus on essential time-frequency information, avoid information overload, and extract the practical features under the framework of the convolutional neural network to solve the problem of the weak generalization ability of the model. Figure 1 shows the overall framework of the proposed method, and the following subsections provide details of TFCF and RTFANet. As can be seen from Figure 1, the input of the RTFANet model is a TFCF dual-channel time-frequency image, and the output is the probability of this image belonging to bearing health, inner race fault and outer race fault. RTFANet first carries out the first convolution operation on the input image. After each convolution operation, the nonlinear expression ability of the model is improved by the ReLU activation function, and the parameters of the feature graph are reduced to 1/2 of the original by maximum pooling. Then, the residual time-frequency mixed attention (RTFA) is used to enhance the vital information of feature mapping. After RTFA, the second convolution operation is carried out, and the tensor dimension is reconstructed. A fully connected layer (FC) is input, and the probability of each failure is output by the softmax classifier. The details of TFCF and RTFA in Figure 1 are described in subsequent sections. TMSST redistributes the coefficients of time-frequency points to the time position indicated by the group delay estimation value to complete asynchronous compression transformation and obtains a new time-frequency plane, on which time redistribution operation is repeated. The group delay estimation of the signal can be expressed as:

RCA
where τ stands for time shift factor and v for frequency shift factor, when the signal does not satisfy the ideal pulse signal, there is some error between the group delay estimation and the real-time calculated by Equation (1). Fortunately, it has been shown in the literature [12] that this error can be reduced by finding a new group delay on the assumption of the originalt(τ, v) into τ.
where t indicates the actual processing time of the signal, ast(t(τ, v), v) is an iteration, this operation can be performed several times. As the number of iterations increases, the estimated group delay is closer to the actual processing time.
After the new group delay estimation is obtained, the time-frequency coefficients of the traditional STFT can be redistributed, and the process can be expressed as: where TSTFT(t, v) is the traditional short-time Fourier transform of the model signal.
TMSST [N] (τ, v) is the time-frequency representation of the final time redistribution N resynchronization compression transformation. The larger N is, the more compression times and the better energy concentration of the time-frequency image.

Multisynchrosqueezing Transform
Unlike TMSST, MSST redistributes the coefficients of time-frequency points to the frequency position indicated by the instantaneous frequency estimation value to complete asynchronous compression transformation and obtains a new time-frequency plane. The frequency redistribution operation is repeated. The instantaneous frequency estimation of the signal can be expressed as: when the signal does not satisfy the complex sine model, there is some error between the instantaneous frequency estimate and the actual frequency calculated by Equation (5). Fortunately, literature [11] has demonstrated that this error can be reduced by obtaining a new instantaneous frequency estimate from the originalω(τ, v) into v.
where ω indicates the actual frequency, and the processing ofω(τ,ω(τ, v)) can be performed multiple times. As the number of iterations increases, the instantaneous frequency is estimated to be closer to the real frequency. The iterative process is similar to TMSST.
After the new instantaneous frequency estimation is obtained, the time-frequency coefficient of the improved STFT can be redistributed. The process can be expressed as follows: where MSTFT(τ, ω) is the improved short-time Fourier transform of the model signal.
MSST [N] (τ, v) is the time-frequency representation of the final frequency redistribution N resynchronization compression transformation. The larger N is, the more compression times and the better energy concentration of the time-frequency image.

Comparison of the Two Methods
TMSST and MSST adopt different forms of short-time Fourier transform in principle from the perspective of signal reconstruction. Let the direct current component of the window function not equal to zero, that is, G * (0) = 0. * is the conjugate symbol. The inverse transformation formula of TMSST can be expressed as: where F −1 {·}represents the inverse Fourier transform operator. Assuming g * (0) = 0, the inverse transformation formula of MSST can be expressed as: The proof process of combined Equations (1) and (6) is as follows: The proof process of combined Equations (3) and (6) is as follows: From the derivation of Equations (7) and (8), it can be known that whether the reassignment transform can be reconstructed depends on the short-time Fourier transform it selects. The reconstruction of the traditional STFT is completed by integrating along the time axis. Before reconstruction, the redistribution of the time-frequency coefficients along the time axis does not affect the final reconstruction result, and the traditional STFT is suitable for TMSST. The reconstruction of the improved STFT is completed by integrating it along the frequency axis. Before reconstruction, the redistribution of time-frequency coefficients along the frequency axis does not affect the final reconstruction result, and the improved STFT applies to MSST. The reconstruction properties of TMSST and MSST determine their redistribution mode to redistribute the time-frequency point coefficients of the time-frequency image to the group delay estimation and instantaneous frequency estimation of the signal, respectively. The more iterations are, the closer the group delay estimation and instantaneous frequency estimation are to the time-frequency ridgeline. However, in the time-frequency image, the time-frequency ridge line forms of different signals are different. The time-frequency ridge of the SCSVF is more inclined to the horizontal state, while that of SCRVF is vice versa.
To further investigate the applicable scenarios of TMSST and MSST, We define a simulation signal including fast and slow variable signals. The simulation signal x e (t) is defined as follows: Figure 2 shows the time-domain waveform and related time-frequency images of the simulation signal. Figure 2a,b are the time-domain waveform and short-time Fourier transform time-frequency image of the simulation signal. It can be seen that, compared with the time-domain waveform, the time-frequency image can clearly describe the change rule of signal frequency over time to better express the characteristics of the signal. Figure 2c,f are TMSST and MSST time-frequency images of simulation signals, respectively. 1 and 2 represent the two red square box regions marked in the figure. It can be observed that the frequency changes slowly in region 1, while the frequency changes quickly in region 2, where the small red arrow represents the compression direction. Time redistribution is to move the time-frequency coefficient in Figure 2b to a new position along the time axis according to the group delay estimation calculated at the time-frequency point to realize the conversion from Figure 2b to c. Frequency redistribution refers to moving the time-frequency coefficient along the frequency axis according to instantaneous frequency estimation, as shown in Figure 2f. By comparing Figure 2c,f, it can be seen that time redistribution has a good compression effect in region 2 but leads to time-frequency energy dispersion in region 1. Frequency redistribution is the opposite. Although the two methods have the defect of energy dispersion, their advantages and disadvantages are complementary, so we can consider combining the two methods for signal analysis.

Time-Frequency Compression Fusion
The two STFT reconstruction methods determine the applicable redistribution methods according to the above analysis. The traditional STFT is reconstructed by integrating along the time axis, showing that it is suitable for multiple synchronous compression transform time redistribution. The improved short-time Fourier transform reconstruction method integrates along the frequency axis, which is suitable for frequency redistribution multiple synchronous compression transform. At the same time, different redistribution methods apply to other signal components. Time redistribution compresses horizontally on a timefrequency image and applies to SCRVF. Frequency redistribution compresses vertically on a time-frequency image and is more suitable for SCSVF. The two methods can learn from each other and enhance their application value. Therefore, we propose a time-frequency compression fusion method to fuse the information of two time-frequency images obtained by TMSST and MSST, respectively. Since the scale range of the time-frequency coefficients of the two time-frequency images is not consistent, the time-frequency coefficients are normalized before fusion.
where f = ω/2π, TFI min and TFI max are the minimum and maximum values of all timefrequency coefficients in the time-frequency image, respectively, and TFI (t, f ) is the result after normalization. The fusion method is named time-frequency compression fusion (TFCF). TFCF superposes and splices two time-frequency images to form dual-channel images, which can fully play the characteristics of multi-channel feature fusion of the convolutional kernel in the convolutional neural network. It is suitable for deep learning diagnosis methods.

Residual Time-Frequency Mixed Attention Module Network
The convolutional neural network has shown excellent performance in image feature extraction. However, as the complexity of the network model increases, the phenomenon of gradient explosion or disappearance is easy to occurs, and the model performance is affected. Residual network structure [43] is widely used in various network models due to its special jump connection mode that can effectively alleviate gradient explosion or disappearance. The deep convolutional neural network has many parameters and performs image classification tasks well under the condition of sufficient sample size, whereas, in practical engineering, the insufficient sample size is a common problem. The sample size of vibration signals of variable speed rolling bearings used in this article is small, with only 1200 for each health condition. It is easy to overfit when directly input into the network for learning, resulting in a poor sample effect of the test set. Moreover, in the classification task, only a few important contents in the image contribute to the recognition result. Other redundant information quickly interferes with network learning and reduces network performance. Therefore, the RTFANet model is proposed. The residual time-frequency mixed attention module (RTFA) is designed and embedded into the convolutional neural network to fully extract important time-frequency features and improve the model's classification accuracy.

Residual Time-Frequency Mixed Attention Module
The attention mechanism was first proposed by Bahdanau et al. [44] based on the observation rules of the visual system. In essence, it is a mechanism for allocating resources to the object of attention, that is, allocating resources according to the importance of the object. The critical parts need to be allocated more than the other parts. In deep learning, the resources allocated by the attention mechanism are reflected in weight. The information related to the recognition task is weighted more heavily, while the irrelevant information is weighted less [45].
Introducing the attention mechanism into the convolutional neural network can make the network model pay more attention to the region of interest in the input information. It makes the model ignore irrelevant features and focus only on the essential features to be extracted. The residual time-frequency mixed attention module proposed in this paper includes the channel, time and frequency. As can be seen from Figure 1, RCA, RTA and RFA are the three components of the residual time-frequency mixed attention module, and this module is stacked with the three components in sequence. When an input feature map is given, RCA pays attention to the time-frequency images of different channels. Then, RTA and RFA pay attention to SCRVF and SCSVF, respectively, and ignore the unimportant interference information. The residual time-frequency mixed attention module does not increase the model's depth but expands the model's width, which further improves the performance of the network model. The details of the RCA, RTA, and RFA components are shown in Figure 3. To improve the recognition performance of a convolutional neural network for TFCF dual-channel time-frequency image feature mapping, RCA, RTA and RFA are proposed to focus on the valuable information in feature mapping. Firstly, three dimensions of input feature mapping M are defined as C, T and F, corresponding to time-frequency image channel, time and frequency, respectively. As you can see from Figure 3, RCA, RFA, and RTA differ only in input and output. RCA does not change the dimension of the input feature map, while RFA and RTA perform generalized transpose of the input feature map to realize the attention mechanism of C, T and F. Take RCA as an example, and the input characteristic map is M ∈ R T×F×C . Then, global average pooling and maximum global pooling are performed on M, and the sum of the two results by element fuses the entire timefrequency plane information into a channel identifier M c ∈ R 1×1×C . Then, to further extract the effective information of M c , this paper uses two convolution operations to process it and adds the ReLU function after each convolution operation to improve the nonlinear expression ability of the attention module. The first convolution layer is mainly used for dimensionality reduction, setting the dimensionality reduction ratio r = 2. The second convolution layer is used to restore the dimensions. Finally, the element value of channel identifier M c is controlled between 0 and 1 by sigmoid function, and the channel weight information A c ∈ R 1×1×C is obtained. The channel attention feature matrix U c ∈ R T×F×C is obtained by weighting the input feature map M with A c . The above process can be expressed as: where R(·) represents the ReLU function, W 1 ∈ R c r ×C and W 2 ∈ R C× c r represent the weights of the two convolution operations respectively, and σ(·) represents the Sigmoid activation function. Finally, to reduce the information loss of the channel attention feature matrix after channel weighting, the residual structure is used to fuse the channel attention feature matrix and input feature map to obtain the final channel attention feature map U c ∈ R T×F×C .
The other components, RTA and RFA, differ from RCA only in the dimension positions of input features and output features after generalized transpose of input feature mapping, but the computing process is consistent.

Loss Function
Compared to other loss functions, cross entropy loss can avoid gradient dispersion in gradient descent calculation, leading to the decrease of learning rate. So the cross entropy loss function is a common objective function that can be divided into binary and multi-classification cross-entropy loss functions. The proposed network model realizes a multi-classification fault diagnosis based on TFCF time-frequency images of rolling bearing vibration signals. Therefore, the multi-classification cross-entropy loss function is adopted, and its expression is as follows: where N indicates the number of types of rolling bearing faults, y i indicates the actual label value of category i, andŷ i indicates the predicted value of category i.

Experiments and Results
Firstly, the collected experimental data are sampled and sorted out, and three types of fault signal are selected for time-frequency analysis. Time-frequency images are input into the proposed neural network, and then ablation experiments and comparative analysis are carried out. Finally, to verify the robustness of the proposed algorithm, tests are carried out under different sample sizes, sampling frequencies and sampling time.

The Experimental Data
The data set used from the bearing dataset of the University of Ottawa [46]. The sampling frequency of the test bench is 200 kHz. The encoder and acceleration sensor measures the speed and bearing vibration signals. The measured data include normal, inner race fault, and outer race fault. There are four speed shifting schemes, which are acceleration (↑), deceleration (↓), acceleration then deceleration (↑↓) and deceleration then acceleration (↓↑). The minimum speed in data collection is 9.9 Hz, and the original signal is segmented with 20,000 sampling points to ensure that each sample contains as much as possible a period. The length of each sample is also reduced to 800 to obtain a total of 3600 samples, including 1200 for each condition, which is randomly divided into the training set, validation set, and test set in the same ratio (6:2:2). More detailed information is shown in Table 1. To verify the superiority of our proposed method, the proportion of samples between the training set and the test set is still 3:1, in which the test set samples are the same in each experiment, and the training set is randomly selected from the rest of samples in proportion. In other words, the total sample size is 3600, the sample size of the test set is fixed at 720, 240 for each fault, and 2160 samples are randomly and evenly selected from the remaining 2880 samples in each experiment training set. The time-domain waveform of some samples is shown in Figure 4a-c. It can be seen that the time-domain waveform of vibration signals of rolling bearing with variable speed is complex, and different fault types contain signal components with frequency transients, making it difficult to extract features directly.

Time-Frequency Image of Vibration Signal
To improve the model's peformance, TFCF is used to transform vibration signals into dual-channel time-frequency images containing both SCRVF and SCSVF. A sample is randomly selected from each fault type, and their STFT, TMSST and MSST time-frequency images are shown in Figure 4. It can be seen that a two-dimensional time-frequency image converted from a one-dimensional vibration signal by a time-frequency analysis algorithm can more intuitively reflect the variation rule of various frequency components in vibration signal with time.

Model Parameter Setting
In the experiment of TFCF time-frequency image classification of rolling bearing vibration signal, the setting of hyperparameters required by the training network model is as follows. The model is trained through stochastic gradient descent in small batches, and the sample size of small batches is set to 8. Adam algorithm is used to optimize the gradient value of each weight update, and the initial learning rate is set to 0.001. At the same time, L2 regularization is introduced to impose penalty constraints on weight parameters, and the penalty factor is set to 0.0001. The equal interval attenuation strategy is adopted to adjust the learning rate in the training process. The adjustment interval is set as five epochs, the adjustment multiplier gamma is set as 0.5, and other parameters can be seen in Table 2. The kernel size of the two convolution layers in the middle of RCA, RTA, and RFA modules is 1 × 1. The corresponding parameter settings in Table 2 refer to the number of output channels of the two convolution layers. In addition, all models are trained and tested using PyTorch deep learning framework and NVIDIA GeForce GTX 1650 GPU.

Different Time-Frequency Image Input
TFCF images with rich time-frequency information are proposed as the input of our diagnostic model. The time-frequency images of STFT, TMSST, MSST and TFCF are input into RTFANet for experiments to verify their superiority. Considering that the time-frequency images of STFT, TMSST and MSST are single-channel images, they are directly copied and extended into dual-channel images to ensure the consistency of model parameters. Ten experiments are conducted for each input, the training samples are randomly selected for each experiment, and the diagnostic model is input to train until convergence. During the training of optimal model obtained from ten experiments, the loss of training set and accuracy of test set varies with the number of iterations, which are shown in Figure 5. In addition, the average accuracy and standard deviation of ten experimental results are recorded in Table 3.  It can be seen from Figure 5 that no matter which time-frequency images are used as input. The model can converge to a small loss value eventually. It indicates that the model has a solid fitting ability, but the recognition accuracy of the test set is inconsistent at the end. It suggests that the information quality of images with different time-frequency is different, directly affecting the recognition results. Combined with Table 3, it can be seen that STFT has the worst effect, mainly because its time-frequency energy is too vague. MSST and TMSST compress the time-frequency energy, while the time-frequency information of some signal components is lost in compression, and the recognition effect is not good. While TFCF directly splices MSST and TMSST into a dual-channel image without eliminating any time-frequency information and achieves the highest average recognition accuracy.
To further explore the reasons for the best effect of TFCF input into RTFANet, we investigate the gradient-weighted class activation mapping of three types of TFCF images [47]. As shown in Figure 6, it can be seen that regardless of the fault type, the SCRVF, the SCSVF, and the partial dispersion time-frequency information in the TFCF image, all contribute to the final decision of the model. Therefore, it is more advantageous to use a TFCF image with more time-frequency information as the input of the network model.

Different Model Combinations
To verify the effectiveness of the proposed method, different module combinations are used to identify TFCF images. The average recognition accuracy and standard deviation of ten experiments are recorded in Table 4, and the tick mark under the module in Table 4 indicates that the module is adopted in the recognition method. The variation curves of training set loss and validation set accuracy of high accuracy models obtained by different methods in ten experiments are shown in Figure 7. It can be seen that the proposed model only converges after 7 epochs, which is faster than other methods.
Overall, the proposed method has the best recognition effect and the fastest convergence speed.  By comparing the experimental results of methods 4 and 5, it can be seen that the recognition effect is greatly improved after CNN is added to the neural network. Because the same time-frequency energy appearing in different positions of time-frequency images is essentially different, the traditional neural network directly reconstructs the tensor, completely ignoring the position information of the image. Since RCA introduces residual structure to reduce the information loss of eigenmatrix after channel weighting, it can be seen from methods 2 and 3 that RCA has a better effect than the channel attention mechanism in traditional SENet [48]. Comparing the experimental results of method 1 and method 2, it can be seen that the effect of using only the channel attention mechanism is not as good as adding an attention mechanism in all three dimensions. The main signal components of each fault type are different, and the time and frequency dimensions correspond to SCRVF and SCSVF, respectively. Time and frequency need to be further assigned to the weight of the network. Figure 8 shows the confusion matrix of the optimal model on the test set in the ten experiments of RTFANet. It can be seen that the accuracy of the test set is 99.86%, and only one sample with an inner ring fault is incorrectly identified as the normal state. In contrast, the other samples can be correctly identified. Therefore, it can be verified that the model has good generalization ability.

Comparisons with Other Methods
To verify the superiority of the proposed algorithm, Table 5 shows the recognition accuracy of different rolling bearing fault diagnosis methods. As can be seen from Table 5, the proposed method achieves the highest average accuracy of 99.86% under the same working conditions. The methods in Table 5 fail to extract the complete time-frequency information, and some even directly take the original signal as the input, which leads to information overload and increases the training time. Moreover, the model also learns irrelevant information, affecting the recognition accuracy. The proposed method introduces an attention mechanism from three perspectives of the channel, time and frequency combined with residual connection, which can obtain useful time-frequency information more effectively and facilitate subsequent model diagnosis.

Model Performance Test
To further test the performance of the proposed method, different sample sizes, different sampling times and different sampling frequencies of each sample are investigated. The detailed experimental design is shown in Table 6. Ten test experiments are conducted under each design, and the training set and test set are randomly assigned to each experiment in a fixed proportion. The experimental results are shown in Figure 9. Combined with Table 6 and Figure 9, in general, a smaller sample size of the training set, shorter sampling time, or reduced sampling frequency affects the model performance. Still, the average accuracy is no less than 98%. Above a specific threshold condition, the average recognition accuracy of the model is more excellent than 99.70%, and experiment C is the best, with an average recognition accuracy of 99.90% and a standard deviation of only 0.01%. In addition, according to experiments A, B, C and D, when the sample size of the training set is less than 180, the accuracy decreases obviously. According to experiments C, E and F, when the sampling time is less than 0.05 s, the accuracy decreases obviously. According to experiments E, G and H, accuracy is decreased when the sampling frequency is lower than 4 kHz. In other words, there are only 60 samples in each fault type and the sampling time is only half of the rotation cycle of the lowest speed signal. And when the sampling length only includes 200 sampling points, the model can still maintain good performance.

Average accuracy /
The serial number Figure 9. Experimental results of model performance test.

Conclusions
Vibration signals of rolling bearings have the problems of overload information of time-frequency image and difficulty in fault diagnosis. To solve the problems, we propose a fault diagnosis method based on time-frequency compression fusion and residual timefrequency mixed attention network. The proposed method is verified on the bearing dataset of the University of Ottawa, and carries out the performance tests under different sample sizes, sampling times and sampling frequencies. The experimental results show that the time-frequency information of fast, slow and diffuse signals all contribute to the fault identification of the model and the TFCF time-frequency image can give full play to the performance of the diagnosis model. The residual time-frequency mixed attention module reduces the information loss after feature matrix weighting, and focuses on the important time-frequency information from the three dimensions of the TFCF image channel, time and frequency, which accelerates the convergence speed of the model training and improves the recognition accuracy to 99.86%. The proposed diagnosis model can not only solve the fault diagnosis under normal working conditions, but also maintain good performance under small sample size, short sampling time and small sampling frequency, and has broad application prospects.

Conflicts of Interest:
The authors declare no conflict of interest.