Remaining Useful Life Prediction for Rolling Bearings Based on TCN–Transformer Networks Using Vibration Signals

Xiaochao Jin; Yaping Ji; Shiteng Li; Kailang Lv; Jianzheng Xu; Haonan Jiang; Shengnan Fu

doi:10.3390/s25113571

,

and

¹

Xi’an Key Laboratory of Extreme Environment and Protection Technology, School of Aerospace Engineering, Xi’an Jiaotong University, Xi’an 710049, China

²

Xi’an Institute of Electromechanical Information Technology, Xi’an 710065, China

³

Xi’an Modern Control Technology Research Institute, Xi’an 710065, China

^*

Authors to whom correspondence should be addressed.

Sensors2025, 25(11), 3571;https://doi.org/10.3390/s25113571

This article belongs to the Section Fault Diagnosis & Sensors

Version Notes

Order Reprints

Abstract

Remaining useful life (RUL) prediction plays a core role in industrial prognostics and health management (PHM), requiring data-driven models with higher predictive capability for accurate long time series prediction. Developing reliable deep learning-based models based on multi-sensor monitoring data is fundamental for accurately predicting vibration trends during bearing operation and is crucial for bearing fault diagnosis and RUL prediction. In this work, a method for constructing a health index based on vibration signal is developed to describe the performance features of rolling bearings, which mainly includes feature extraction, sensitive feature index selection, dimensionality reduction, and normalization methods. In addition, a new RUL prediction method, TCN–Transformer, is developed which can efficiently learn and integrate local and global features of vibration signals, addressing the long time series prediction problem in RUL prediction. The TCN extracts local features, while the Transformer learns global features, both of which are seamlessly integrated through a specially designed feature fusion attention module. Both the health indicator (HI) constructed from extracted time domain and frequency domain feature parameters and the RUL prediction method were rigorously validated using the IEEE PHM 2012 Data Challenge dataset for rolling bearing prognostics. By employing the proposed HI construction method, the average comprehensive bearing performance index, used to evaluate RUL prediction accuracy, is improved by 8.69% across the entire dataset compared to the original feature-based composite index. The proposed RUL prediction model can more accurately predict the RUL of rolling bearings under different conditions, reducing the RMSE and MAE by 14.62% and 9.26%, respectively, and improving the SCORE by 13.04%. These results underscore the efficacy and superiority of our approach in RUL prediction of rotating machinery across varying conditions.

Keywords:

rolling bearings; deep learning; remaining useful life prediction; health index; performance degradation

1. Introduction

The deep learning method has demonstrated exceptional performance across various aspects of the industrial sector, particularly in health monitoring and the intelligent operation and maintenance of critical industrial components. Prognostics and health management (PHM), which includes fault detection, diagnosis, and remaining useful life (RUL) prediction, has recently attracted great research interest [,,,]. PHM based on using deep learning methods has great potential in the capability of deploying these maintenance strategies provides the opportunity of setting efficient, just-in-time and just-right maintenance strategies [,]. Rolling bearings are key components in rotating machinery, directly affecting the safety of the entire mechanical system. Vibration monitoring is crucial for early fault detection, localization, and differentiation [,,,,]. Developing reliable deep learning-based models based on multi-sensor monitoring data is fundamental for accurately predicting vibration trends during bearing operation and is crucial for bearing fault diagnosis and RUL prediction [,,,].

Generally, PHM primarily employs various techniques to analyze monitoring data, extract discriminative knowledge, and assess the health status of mechanical equipment. It is generally expected to achieve three functions: health status monitoring, fault diagnosis, and RUL prediction. Among them, RUL estimation is considered to be the most challenging task because the continuous use time of mechanical equipment is inconsistent, and it is difficult to accurately extract sensitive degradation features under different degradation modes [,,]. Constructing a health index (HI) to describe performance features from continuous operational signals is a critical prerequisite for effective RUL prediction using data-driven methods. Deep learning-based methods, such as Recurrent Neural Networks (RNNs) and their variants, are increasingly being utilized to extract identifiable degradation features through their specialized cyclic memory structures, establishing themselves as a prominent area of research [].

Current RUL prediction is generally divided into three categories: physical model-based methods, data-driven techniques, and hybrid strategies []. Physical model-based methods are established according to component damage mechanisms and deterioration laws of specific failure modes, with prominent examples including Fatigue Crack Growth (FCG) [] and Fatigue Spall Progression Life (FSPL) models []. These approaches describe structural degradation evolution through physical mechanisms but typically need substantial prior knowledge, making accurate degradation estimation difficult in complex conditions [,,]. In contrast, data-driven methods construct models based on sensor data without depending on particular degradation patterns, using extensive historical data for empirical learning []. With the progress of machine learning, data-driven approaches are being applied more frequently in industrial applications to learn empirical patterns from historical data.

The advent of machine learning technologies has significantly influenced RUL prediction development. Bearing sensor monitoring data are time series data. Therefore, the problem of bearing degradation trend prediction is essentially a regression problem related to time series []. Consequently, the ability of the constructed model to learn effective time information is crucial to the final prediction result. Traditional machine learning-based prediction methods usually require a feature extraction process before prediction. The procedure involves first extracting a set of features from condition monitoring data and then inputting these features into a machine learning model to perform the RUL prediction task. Traditional machine learning-based methods do not consider the correlations between time series signals that reflect the changes in the health state of mechanical equipment. Additionally, they typically rely on manually extracting features from raw sensor data, estimating health indicators, degradation states, and predicting RUL using failure thresholds. With the advancements in deep learning technologies, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers [], the application of this method has garnered increasing attention. Deep learning techniques possess the capability to analyze high-dimensional data and automatically extract features. Deep learning-based methods have emerged and achieved remarkable results across various fields, primarily due to their robust capability to map the relationship between degradation paths and measurement data, and their ability to automatically learn degradation features, thus eliminating the need for manual feature extraction and expert knowledge of mechanical systems. Among these methods, the RNN and Long Short-Term Memory (LSTM) models have been particularly prominent in RUL prediction tasks, effectively utilizing temporal information []. Additionally, both BiLSTM and LSTM models demonstrate strong performance in time domain health monitoring applications [,].

However, as the service life of mechanical equipment continues to extend, long-term degradation behavior prediction becomes increasingly essential, and the shortcomings of RNN-based prediction frameworks are gradually exposed, mainly in the following aspects: (1) RNNs’ inability to process time series in parallel, necessitating strict chronological order; (2) difficulties in memorizing long-term historical data, leading to error accumulation in predictions; and (3) increased computational complexity due to the intricate gating structures of RNN variants like LSTM and Gated Recurrent Units (GRUs) []. Therefore, how to process long time series efficiently and accurately has become an urgent problem that needs to be solved.

Furthermore, those existing models have fully succeeded in effectively learning local correlation features and global features, which is the key to RUL prediction. Recently, Transformer-based models have successfully learned global features for different types of data, including time series []. Their attention mechanisms enhance training speed, support parallel computing, and improve accuracy compared to RNNs. The unique output mechanism of the Transformer-based models can greatly reduce the error accumulation in the prediction process. In response to the unique challenges in time series modeling tasks, many variants of the Transformer model have been developed, which have been successfully applied to a variety of time series tasks, including but not limited to prediction, anomaly identification, and classification problems. However, directly applying these models to multivariate time series data for bearing vibration prediction may not fully utilize the inherent characteristics of the data, such as temporal dynamics and the relationship between different dimensions. It often fails to capture the overall feature distribution of the time series, which limits its prediction effect.

In contrast, CNN-based architectures demonstrate superior local pattern extraction capabilities through their hierarchical filter structures []. Temporal Convolutional Networks (TCNs) [] further enhance this advantage by incorporating dilated convolutional layers, preserving CNN’s inherent local feature extraction while effectively capturing long-range temporal dependencies in sequential data []. This complementary functionality motivates the integration of Transformer architectures with TCNs for enhanced time series representation learning. While existing studies have employed serial arrangements of these models for temporal data processing [], such sequential architectures often neglect the intrinsic interplay between local and global features in vibration signals. A more effective approach requires independent learning of hierarchical features, followed by deliberate fusion through optimized integration mechanisms.

The main contributions of this study are summarized as follows:

A method for constructing a health index based on vibration signal (HIVS) is developed to describe the performance features of rolling bearings. The bearing vibration signals can be decomposed into eight wave packets using wavelet transformation, resulting in an initial feature set comprising 32 feature indexes that capture the signal characteristics. These indexes are derived from the original vibration signals across the time domain, frequency domain, and time–frequency domain. Subsequently, irrelevant and redundant features are filtered out, retaining eight key sensitive feature indexes. Finally, using the principal component analysis (PCA) method, these sensitive feature indexes are reduced from high-dimensional space to one dimension and then normalized, thereby constructing a HIVS that can significantly indicate the performance status of the bearings.

A new RUL prediction model, named TCN–Transformer, has been developed to efficiently learn and integrate both local and global features from bearing vibration signals, thereby addressing the challenges associated with long time series predictions in RUL estimation. The model utilizes TCN to integrate signals from different frequency domains, while leveraging the Transformer’s capabilities to process time domain signals, effectively handling the complexities of bearing life evolution.

The subsequent sections of this paper are structured as follows: In Section 2, the methods of feature extraction and HI construction for rolling bearings using vibration signals are proposed, including feature extraction, sensitive feature index selection, dimensionality reduction, and normalization. In Section 3, the RUL prediction model, TCN–Transformer, is proposed, and the RUL process for rolling bearings is introduced. In Section 4, both the HI construction and RUL prediction methods are tested and verified on the IEEE PHM 2012 Data Challenge dataset for RUL prediction of rolling bearings. Finally, conclusions are summarized in Section 5.

2. Feature Extraction and Health Index Construction Method

Rolling bearings generate complex vibration signals during operation, which contain a large amount of information about their health status. Through effective feature extraction, key indexes describing the bearing degradation state, such as vibration amplitude, frequency distribution, etc., can be identified from these signals. These indexes can intuitively reflect the working status of the bearing and are important indicators for evaluating bearing performance.

This work first extracts 32 feature indexes in the time domain, frequency domain, and time–frequency domain from the original vibration signals to construct an initial feature set. Then, based on the four feature evaluation indices of monotonicity, correlation, predictability, and robustness, irrelevant and redundant features are filtered out from the original feature set, and eight key sensitive feature indexes that can accurately reflect the performance of rolling bearings are retained. Finally, these sensitive feature indexes are reduced from high-dimensional space to one dimension and then normalized, thereby constructing a HI that can significantly indicate the performance status of the bearings.

2.1. Feature Extraction Method

In our previous work [], it was demonstrated that vibration signals of rolling bearings can be decomposed into multiple characteristic waveforms in the frequency domain through empirical functional decomposition.

As bearing performance degrades over time, both frequency domain and time domain characteristics exhibit temporal variations. Considering that the four primary failure modes of bearings, fatigue pitting, plastic deformation, wear, and cage damage, are mutually independent, selecting several dominant features to characterize this model is of significant importance for reducing model parameters.

Time domain feature extraction

Usually, feature extraction methods include three categories: time domain analysis, frequency domain analysis, and time–frequency domain analysis. Time domain features include basic statistics (such as mean, variance, peak, etc.) and advanced statistical indicators (such as skewness, kurtosis, etc.). These features can describe the distribution characteristics and change patterns of vibration signals from different perspectives. However, the reliability of many parameters will decrease as the fault progresses, after reaching a certain level. By extracting and analyzing these time domain features, a bearing HI can be constructed, thereby realizing early identification and prediction of bearing performance degradation. In this work, 10 dimensional and 9 dimensionless time domain features are extracted from the original vibration signals of rolling bearings, as listed in Table 1.

Table 1. Dimensional and dimensionless time domain feature indexes used in this work.

2.: Frequency domain feature extraction

Frequency domain feature extraction of rolling bearings is one of the key techniques for understanding and analyzing the health state. By converting the vibration signal from the time domain to the frequency domain, the main frequency components and their amplitudes in the signal can be identified, which is extremely important for detecting specific fault types of bearings. Frequency domain analysis is usually implemented with the help of Fourier transform, which can decompose time series data into a series of frequency components, thereby revealing the characteristic frequencies and harmonics of the working bearings. Frequency domain analysis has obvious advantages over time domain analysis in identifying complex fault modes. It can more accurately distinguish and locate early signs of bearing failure, especially small changes in the background of noise. In this work, five frequency domain features are extracted from the original vibration signals of rolling bearings, as listed in Table 2.

Table 2. Frequency domain feature indexes used in this work.

3.: Time–frequency domain feature extraction

When the performance of rolling bearings degrades, the energy of each node after wavelet packet decomposition will also change accordingly. Therefore, the wavelet packet energy of the vibration signal after wavelet packet decomposition can be used to select certain specific sub-band energies as characteristic indexes to characterize the degradation of rolling bearings. According to the literature [], the Haar wavelet is used as the basis function to perform a three-layer wavelet packet decomposition on the vibration signals of the rolling bearing in this work. After decomposition, eight sub-bands are obtained, and the energy ratios of the eight sub-bands are used as the time–frequency feature indexes (S25–S32). The energy of the wavelet packet sub-band is defined as follows:

E_{j}^{i} (t) = \sum_{n = 1}^{N} [x_{j n}^{i} {(t)}^{2}]

(1)

where

E_{j}^{i} (t)

is the wavelet packet energy. N is the length of the node signal

x_{j}^{i} (t)

after wavelet packet decomposition.

The wavelet packet energy ratio reflects the performance degradation of the rolling bearing by calculating the energy ratio of the wavelet packet reconstructed signal at different time scales. The energy ratio

p_{j}^{i}

of each sub-band after wavelet packet decomposition is defined as follows:

p_{j}^{i} = \frac{E_{j}^{i} (t)}{\sum_{j} E_{j}^{i} (t)}

(2)

2.2. Constructing the Sensitive Feature Set for Rolling Bearings

The goal of feature selection is to find a set of feature subsets that are effective for performance assessment, which can ensure that the prediction performance is maintained at a good level while the feature dimension is reduced. The purpose of constructing a sensitive feature set is to select the features that are most sensitive to changes in the bearing state and best reflect its health status from a large number of original or derived features. It can not only significantly reduce the dimensionality of data and reduce the computational task of model training but also improve the robustness and interpretability of prediction models. By eliminating redundant and irrelevant features, feature selection helps focus on those most representative indexes, thereby providing a more accurate assessment of bearing performance. This work defines four evaluation indices for effective feature selection, namely monotonicity index, correlation index, predictive index, and robustness index, and establishes the process of selecting optimal features as a combinatorial optimization problem to construct sensitive feature set for rolling bearings.

Monotonicity refers to the unidirectional change tendency of features as performance degrades. This work uses Spearman’s rank correlation coefficient as the monotonicity index [], and the formula is as follows:

M o (Y) = |1 - \frac{6 \sum_{i = 1}^{N} d_{i}^{2}}{N (N^{2} - 1)}|

(3)

where d is the difference between the two variables, and N is the total number of monitoring times during the entire performance degradation process.

Temporal correlation emphasizes the dependence of features on time series. This work uses the Pearson correlation coefficient to describe the correlation between features and time series. The formula is as follows:

M o (Y) = |1 - \frac{|6 \sum_{i = 1}^{N} (y_{i} - \bar{y}) (t_{i} - \bar{t})|}{\sqrt{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2} \sum_{i = 1}^{N} {(t_{i} - \bar{t})}^{2}}}|

(4)

where

Y = [y_{1}, y_{2}, \dots, y_{N}]

is the performance feature sequence,

t_{i}

represents the i-th monitoring moment,

\bar{t}

and

\bar{y}

represent their means. N is the total number of monitoring times in the entire performance process.

Predictability means that the features can provide enough information to predict the future state. Predictive index is defined as follows:

P r e (Y) = \exp (- \frac{σ (y_{f})}{|{\bar{y}}_{f} - {\bar{y}}_{s}|})

(5)

where y is the performance characteristic,

{\bar{y}}_{s}

is the mean at the initial moment,

{\bar{y}}_{f}

is the mean at the failure moment, and

σ (y_{f})

is the standard deviation at the failure moment.

Robustness refers to the ability of a feature set to maintain its predictive performance in the face of various disturbances and noise. Robustness index is adopted as follows:

R o b (Y) = \frac{1}{N} \sum_{n = 1}^{N} \exp (- |\frac{y_{i} - {\tilde{y}}_{i}}{y_{i}}|)

(6)

where

Y = [y_{1}, y_{2}, \dots, y_{N}]

is the performance feature sequence, N is the total number of monitoring times in the entire performance process, and

Y = [{\tilde{y}}_{1}, {\tilde{y}}_{2}, \dots, {\tilde{y}}_{N}]

, is the trend sequence of the corresponding performance feature.

Relying only on a single evaluation indicator to evaluate the features is often incomplete for performance degradation assessment. To fully utilize multiple evaluation indices, and assuming that the discrete features can be approximated linearly, a weighted linear combination model integrating multiple evaluation indices is constructed to determine performance degradation features, in view of the important role of monotonicity, correlation, predictability, and robustness in the performance prediction of rolling bearings. A comprehensive index is introduced and defined as follows:

\begin{array}{l} J = w_{1} M o n (Y) + w_{2} C o r r (Y, T) + w_{3} P r e (Y) + w_{4} R o b (F), w_{i} > 0 \\ \sum_{i} w_{i} = 1, i = 1, 2, 3, 4 \end{array}

(7)

For the characterization and prediction of the performance of rolling bearings, the key is to select features that reflect the overall degradation trend because the degradation of rolling bearings is monotonic and irreversible. Therefore, in the comprehensive index, the importance of monotonicity should be fully considered, and its weight is relatively high. Performance degradation is a continuous process, and its changes show certain regularity in time series. Considering temporal correlation can help understand the dynamic degradation process and extract features that can reflect the degradation trend, thereby improving the time sensitivity and accuracy of the prediction model. However, in experiments, it was found that the extracted features usually have higher robustness, resulting in a decrease in robustness discrimination, so the weight is lower. By referring to the parameter settings and experimental results in the literature, the weights of

w_{1}, w_{2}, w_{3}, w_{4}

are set to 0.4, 0.4, 0.1, and 0.1, respectively [].

2.3. Dimensionality Reduction Method for Sensitive Feature Index

PCA, as a widely adopted statistical technique, can be used for data dimensionality reduction and feature extraction []. It converts multiple variables that may be correlated into a series of linearly independent variables through orthogonal transformation. These new variables are called principal components. In the application of dimensionality reduction in sensitive feature indexes, PCA shows significant advantages. By using PCA to reduce dimensionality, the complexity of the model can be effectively reduced, the computing efficiency can be improved, while overfitting can be avoided, and the generalization ability of the model can be enhanced.

The dimensionality reduction process of sensitive feature indexes based on PCA is shown in Figure 1, and the details are as follows:

Figure 1. Flowchart of dimensionality reduction in sensitive feature indexes based on principal component analysis method.

Feature extraction of vibration signals. There are 32 feature indexes used here, namely 10 dimensionless time domain indexes (S1–S10), 9 dimensionless time domain indexes (S11–S19), 5 frequency domain characteristics (S20–S24), and 8 time–frequency domain indexes (S25–S32), regarding the energy ratio of sub-bands. Using the original vibration signal, a total of 32 feature indexes above are extracted to form the original feature set.
Sensitive feature index selection based on evaluation indices. Based on the comprehensive index that takes into account the monotonicity, correlation, predictability, and robustness of the features, eight sensitive feature indexes of rolling bearings are selected.
Dimensionality reduction in sensitive feature indexes. The selected sensitive degradation features are input into the PCA algorithm as input data, and the first principal component is extracted as the rolling bearing performance degradation feature index after dimensionality reduction.

2.4. Constructing the Health Index for Rolling Bearing

After selecting, the eight feature indexes are spliced into a 2D array by column, in which each column represents a feature vector. This 2D array is input into the PCA algorithm as input data, and the first principal component is extracted as the performance degradation feature index of rolling bearing after dimensionality reduction. Then, the obtained performance feature index is processed to remove outliers and then normalized by min–max normalization approach to obtain its health index. Finally, the health index is smoothed by the moving average method to obtain the final performance degradation trend of the rolling bearing. The normalization formula is as follows:

x_{nor} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

(8)

where

x_{nor}

is the normalized value, and

x_{\max}

and

x_{\min}

are the maximum and minimum values of this group of sequences, respectively. The moving average formula is as follows:

y_{t} = \frac{\sum_{i = 1}^{n} (x_{t - i} + x_{t + i}) + x_{t}}{2 n + 1}

(9)

where

y_{t}

is the smoothed value obtained at time t.

x_{t}

is the value of original sequence at time t, and n is the window size of sliding average.

3. TCN–Transformer Networks and RUL Prediction

As a structural innovation model based on CNN, TCN achieves overall parallel processing capabilities for long-term sequences by using a unified filter in each layer. Compared with traditional recursive network models, such as LSTM and GRU, TCN is more concise and clearer in structural design, while also improving accuracy []. TCN can effectively adjust the size of the receptive field by stacking more causal convolution layers, increasing the expansion factor, and increasing the number of filters, thereby flexibly controlling the memory usage of the model. Faced with the common problem of gradient explosion or disappearance in RNN, TCN effectively avoids this problem with its unique backpropagation path and ability to handle different sequence times. In addition, TCN can significantly shorten the training cycle due to its low memory requirements when processing long-term sequences.

Given the Transformer’s limitations in effectively utilizing information across multi-parameter temporal domains, this study adopts two key strategies: (1) manual selection of critical feature parameters, and (2) processing of cross-feature temporal variations through a dedicated TCN module. In this architecture, the Transformer’s role is specifically focused on applying attention mechanisms to perform RUL (remaining useful life) prediction using these processed features.

Causal convolution

TCN needs to meet two major requirements: (1) ensuring that the network output length is consistent with the input length, using a one-dimensional fully convolutional network (FCN), and keeping the length of each layer unchanged by zero padding; (2) ensuring that future inputs will not affect past inputs, which is achieved by using causal convolution to eliminate the interference of future elements, as the output at any time t is only related to the current and previous input elements.

2.: Dilated convolution

When processing historical data, causal convolution requires more hidden layers as the depth of historical data increases, which requires a deeper network structure or more filters. To solve this problem, dilated convolution technology was introduced []. By adding holes to standard convolution, dilated convolution can effectively expand the size of the receptive field, so that the output data can cover a wider range of information without losing data in the pooling layer.

3.: Residual module

In order to solve the performance degradation problem caused by the increase in network depth, the concept of residual module is proposed. In the residual module, the rectified linear unit (ReLU) is used as the activation function, and the weight normalization method is used to normalize the weight of the convolution filter for normalization. At the same time, in order to further enhance the generalization ability of the model, a spatial dropout step is added for regularization after each dilated convolution operation, that is, the output of the entire channel is randomly set to zero in each step of the training process. In addition, to deal with the shape mismatch problem between input and output, an additional 1 × 1 convolution layer is introduced in TCN to ensure that tensors of the same shape can be transferred between different layers, as shown in Figure 2.

Figure 2. Framework of TCN model.

3.1. Construction of TCN–Transformer Networks

The TCN–Transformer architecture employs a hierarchical parallelization approach to integrate TCN and Transformer components. In the design, the TCN layer extracts local temporal features from the input sequence, while the Transformer module captures global data patterns. These learned representations are subsequently combined using a multi-head feature fusion attention module. The parallel structure concludes by concatenating both branches’ outputs, followed by a fully connected layer that projects them to the target dimension while maintaining the original time series structure. Compared with conventional TCN and Transformer, TCN–Transformer introduces two key innovations: (1) a hierarchical parallel architecture that simultaneously utilizes the Transformer block’s local window self-attention and the TCN block’s deep convolutional operations; (2) the incorporation of a specialized multi-head features fusion attention module for effective branch feature integration. Figure 3 illustrates the framework of the TCN–Transformer network.

Figure 3. Framework of proposed TCN–Transformer networks.

(1): Hierarchical parallel design

As shown in Figure 3, the TCN–Transformer network comprises parallel computation flows for both the Temporal Convolutional Network (TCN) and the Transformer (Trans) modules. Inside the TCN flow, there are two dilated causal convolution layers with weight normalization, which constitute a hidden layer of the TCN model. The input of the TCN layer is the data after the rolling bearing vibration signal is preprocessed or the features of the local and global features fused from the previous layer, and the output of the TCN layer is the local features of the bearing signal. The Transformer module has a multi-head self-attention mechanism and a feedforward neural network, both equipped with layer normalization. Similarly, the input to this module comprises the preprocessed rolling bearing vibration signal data, which includes the fused local and global features from the prior layer. The output of the Transformer module captures the global features of the bearing’s full life vibration signal [].

(2): Multi-head feature fusion attention module

The multi-head feature fusion attention module contains two attention mechanisms, which aims to establish an interaction between two parallel branches to fuse local and global features. As shown on the right side of Figure 3, the output of the TCN layer

\tilde{Y_{i}} \in R^{t \times c}

and the output of the Transformer module

{\tilde{Z}}_{i} \in R^{t \times c}

interact in the multi-head feature fusion attention module to bidirectionally fuse local features

\tilde{Y_{i}}

and global features

\tilde{Y_{i}}

. Specifically, the output value of the TCN layer is updated by residual connection with the multi-head feature fusion attention module to obtain the new TCN layer output value

Y_{i + 1}

, as described below:

Y_{i + 1} = {\tilde{Y}}_{i} + A_{{\tilde{Z}}_{i} \to {\tilde{Y}}_{i}} \cdot Z_{i} W_{e v}

(10)

where

W_{e v}

is the learnable parameter of embedding layer, and

A_{{\tilde{Z}}_{i} \to {\tilde{Y}}_{i}}

is the fusion matrix from Transformer to TCN, which can be calculated by matrix multiplication and Softmax function:

A_{{\tilde{Z}}_{i} \to {\tilde{Y}}_{i}} = s o f t m a x (\frac{{\tilde{Y}}_{i} W_{e q} \cdot {({\tilde{Z}}_{i} W_{e k})}^{T}}{\sqrt{c}})

(11)

where

W_{e q}

and

W_{e q}

are the learnable parameters of the two linear layers. Similarly, the output value of the updated Transformer module is defined as follows:

Z_{i + 1} = {\tilde{Z}}_{i} + A_{{\tilde{Y}}_{i} \to {\tilde{Z}}_{i}} \cdot Y_{i} W_{d v}

(12)

A_{{\tilde{Y}}_{i} \to {\tilde{Z}}_{i}} = s o f t m a x (\frac{{\tilde{Z}}_{i} W_{d q} \cdot {({\tilde{Y}}_{i} W_{d k})}^{T}}{\sqrt{c}})

(13)

where

W_{d v}

,

W_{d q}

, and

W_{d k}

are the learnable parameters of the three linear layers.

A_{{\tilde{Y}}_{i} \to {\tilde{Z}}_{i}}

is the fusion matrix from TCN to Transformer.

3.2. RUL Prediction Based on TCN–Transformer

This section describes the RUL prediction process of rolling bearing based on the TCN–Transformer networks in detail. The flowchart is shown in Figure 4. The specific steps are as follows:

Figure 4. The remaining useful life prediction process of rolling bearings based on the TCN–Transformer networks.

(1): Data input. The original vibration signal data of the rolling bearing is processed. According to the method proposed, the original vibration signal is extracted in the time domain, frequency domain, and time–frequency domain. Subsequently, sensitive features are selected to construct a feature set. The selected sensitive degradation feature data is input into the model to train the model for remaining life prediction.
(2): Dataset division. Referring to the most commonly used dataset division method, the dataset is divided into training set, validation set, and test set in a ratio of 7:1:2.
(3): Model training. The training set data is input into the constructed TCN–Transformer networks. TCN–Transformer trains the model and completes the steps of forward propagation, backpropagation, and parameter optimization. The TCN–Transformer network with the optimal parameters is obtained.
(4): Model prediction. Input the test set data into the optimal model trained in the third step and finally output the RUL prediction result of the rolling bearing.

4. Results and Discussion

4.1. Verification of Feature Extraction and Health Index Construction

The IEEE PHM 2012 Data Challenge published a rolling bearing full lifecycle dataset, which collected on the PRONOSTIA experimental system that was designed to test bearing fault detection, diagnosis, and RUL prediction methods []. The main goal of PRONOSTIA is to provide experimental data to describe the degradation process of rolling bearings throughout their service life. The IEEE PHM 2012 dataset provides accelerated degradation test data for a total of 17 bearings, namely, 7 bearings for condition 1 (4000 N, 1800 rpm), 7 bearings each for condition 2 (4200 N, 1650 rpm), and 3 bearings for condition 3 (5000 N, 1500 rpm). In this work, all the 17 bearings were used as a benchmark to test our prediction method, but only the results for seven bearings under condition 1, marked as Bearing 1-1, 1-2, 1-3, 1-4, 1-5, 1-6, and 1-7, respectively, were selected for further investigation. When using vibration signals to track the degradation state of rolling bearings, the horizontal vibration signal usually carries more degradation information than the vertical vibration signal []. Therefore, this work only uses horizontal experimental data.

The 32 feature indexes introduced above are extracted to form the original dataset for the seven bearings. Subsequently, these 32 indices are selected accordingly. Considering that the failure vibration signals of the bearings can be represented by eight wave packets, the assessment of bearing failure utilizes the 32 indices, which are comprehensive in nature. For the dataset of each bearing, the monotonicity index, correlation index, predictive index, robustness index, and comprehensive index are calculated. The statistical results of the feature indexes are shown in Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9. Only the data of bearings 1-1, 1-2, and 1-3 are shown in the charts. In particular, Figure 9 shows the comprehensive index statistics of the three bearings, and also gives the average values of the comprehensive index of the seven bearings under working condition 1.

Figure 5. A statistical chart of the monotonicity index of the 32 feature indexes.

Figure 6. A statistical chart of the correlation index of the 32 feature indexes.

Figure 7. Statistical chart of predictive index of 32 feature indexes.

Figure 8. A statistical chart of the robustness index of the 32 feature indexes.

Figure 9. Statistical chart of comprehensive indexes of 32 feature indexes.

The feature indexes are sorted according to the average value of the comprehensive index, as shown in Figure 10. Only the eight feature indexes with the highest average value of the comprehensive index are selected, which are the energy ratio of the third frequency sub-band (S27), the energy ratio of the seventh frequency sub-band (S31), the energy ratio of the fourth frequency sub-band (S28), the energy ratio of the eight frequency sub-band (S32), the kurtosis index (S14), the energy ratio of the second frequency sub-band (S26), the center of gravity frequency (S20), and the minimum value (S3), respectively. To make a further demonstration, the degradation trend of the eight selected sensitive feature indexes over time is shown in Figure 11.

Figure 10. Ranking of the values of the comprehensive indexes of the 32 feature indexes.

Figure 11. The selected sensitive feature indexes for Bearing 1-1: (a) the energy ratio of the third frequency sub-band (S27), (b) the energy ratio of the seventh frequency sub-band (S31), (c) the energy ratio of the fourth frequency sub-band (S28), (d) the energy ratio of the eight frequency sub-band (S32), (e) the kurtosis index (S14), (f) the energy ratio of the second frequency sub-band (S26), (g) the center of gravity frequency (S20), and (h) the minimum value (S3).

Figure 12 shows the health index trend of the seven bearings obtained using the feature extraction, selection, and PCA dimensionality reduction methods. It can be seen that the health index of rolling bearings obtained based on the proposed method in this work has good monotonicity and time series correlation. Subsequently, the monotonicity, correlation, predictability, robustness, and comprehensive indexes of the selected feature indexes for the seven datasets are calculated, as given in Table 3. It can be seen that the obtained monotonicity, correlation, predictability, robustness, and comprehensive indexes are relatively good, all above 0.8. Meanwhile, the PCA result indicates that the seven main features can account for 91.387% of the variance in the data. For industrial problems, such explanatory power is sufficient. In addition, all the comprehensive indexes obtained in this work are all higher than the highest original comprehensive index for each feature set. The average comprehensive index of the seven bearings is 0.8995, improved by 8.69% on average.

Figure 12. The health index trend over time of the seven bearings obtained using the proposed method.

Table 3. Calculated evaluation indexes using the eight sensitive feature indexes for the seven rolling bearings.

4.2. Experiment and Verification of the TCN–Transformer Networks

This section uses the IEEE PHM 2012 rolling bearing full life dataset as the verification dataset for the proposed method. The detailed information of the dataset used in this section is shown in Table 4. It shows information such as working conditions and actual life. In order to better verify the effectiveness of the proposed method, this paper refers to the task setting of previous literature [] and carefully sets up six groups of RUL estimation tasks to evaluate the prediction performance of the proposed method. The specific arrangement is shown in Table 5.

Table 4. The details of the dataset used in this section selected from the IEEE PHM 2012 dataset.

Table 5. Remaining useful life prediction task description using IEEE PHM 2012 dataset.

The training of the neural network based on 32 feature indexes was conducted on an Ubuntu 18.04 server equipped with four NVIDIA 2080 Ti 11 GB graphics cards, using PyTorch 1.0 as the deep learning framework.

First, data preprocessing should be carried out in the analysis. While deep learning models inherently can extract features directly from raw vibration signals (an approach widely adopted in PHM applications), the high computational demands and compromised prediction accuracy associated with processing full lifecycle raw data make this method impractical. To address these challenges in long-sequence RUL prediction, we employ the optimized eight-feature degradation set as the model input, achieving significant reduction in data dimensionality while preserving degradation signatures, and elimination of noise interference in raw signals. This preprocessing strategy balances computational efficiency with predictive performance.

Next, RUL labeling and normalization are implemented to standardize the prognostic framework. Since operational conditions vary significantly, the absolute RUL values (measured in cycles) naturally differ in magnitude. Direct use of these unprocessed values as training labels would adversely affect both the model’s convergence rate during training and its generalization performance during deployment. To address this, we adopt normalized RUL values—a well-established practice in prognostic modeling []. The normalization process computes the relative degradation state by taking the ratio of current RUL to the entire RUL value, expressed as follows:

R U L_{t} = T_{t o t a l} - t

(14)

R U L_{t}^{n o r m} = \frac{R U L_{t}}{T_{t o t a l}}

(15)

where

T_{t o t a l}

is the total life time, and the normalized remaining life

R U L_{t}^{n o r m}

is between 0 and 1.

Effective sample embedding obviously influences the predictive capability of data-driven prognostic models. Utilizing isolated single-time step data as model input fails to capture the essential temporal dependencies between current and historical degradation states. To address this limitation, we implement a temporal window embedding strategy [] that explicitly models degradation continuity through causal relationships. This method employs a fixed time window W_L to sequentially concatenate multiple time steps of monitoring data and treats each window as an independent input. The embedded sample consists of the current time step MS and the previous L − 1 time step

M S_{s}

, noted as follows:

M S_{I n p u t}^{t} = {(M S_{t - L + 1}, \dots, M S_{t - 2}, M S_{t - 1}, M S_{t})}_{W_{L}}

(16)

where L is the time window length. In order to obtain as many training samples as possible, the moving step of the time window is generally set to 1.

In order to make a comparison with the existing advanced prediction models, three criteria of root mean square error (RMSE), mean absolute error (MAE), and scoring function (SCORE) are used to evaluate the prediction performance. Among them, the first two can evaluate well the fitting ability and prediction accuracy of each prediction model, and the calculation formulas are as follows:

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} ({\hat{y}}_{i} - y_{i})^{2}}

(17)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |{\hat{y}}_{i} - y_{i}|

(18)

where

{\hat{y}}_{i}

is the predicted value at time i,

y_{i}

is the true value at time i, and n represents the total number of samples. In addition, the last evaluation criterion is used to evaluate the rationality of the prediction results, that is, the proactive prediction and the lagging prediction are not considered with a unified standard. In the actual service environment, the risk brought by the advanced prediction is much smaller than that of the lagging prediction. Therefore, a good prediction RUL tends to be a conservative prediction, so this scoring algorithm imposes a penalty on the lagging prediction. Therefore, SCORE can be defined as follows []:

S C O R E = \frac{1}{N - 1} \sum_{i = 1}^{N - 1} A_{i}; A_{i} = \{\begin{array}{l} e^{- \ln (0.5) \cdot (E r_{i} / 5)}, E r_{i} \leq 0 \\ e^{+ \ln (0.5) \cdot (E r_{i} / 20)}, E r_{i} > 0 \end{array}

(19)

E r_{i} = \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} \times 100 %

(20)

where N is the length of the prediction data, and

E r_{i}

is the percentage error.

In order to further clarify the advantages of TCN–Transformer in RUL prediction, this work makes two comparisons. One is to compare the RUL prediction performance of TCN–Transformer networks, TCN model, and Transformer model, and the other is to compare the proposed model with several baseline models and advanced prediction models. To explore the degree of improvement of the TCN–Transformer networks on each task, the improvement index (IMP) is calculated as follows:

I M P = (1 - \frac{T T}{C M T}) \times 100 %

(21)

wherein TT represents the evaluation index value of the TCN–Transformer networks, and CMT represents the evaluation index value of the current optimal model. During the model training process, the settings of various hyperparameters are shown in Table 6. At the same time, 10 cross-validation experiments are performed, where 70% of the data is used for training, 10% for validation, and 20% for testing. Each task obtained the average results to avoid the randomness of the prediction results.

Table 6. Hyperparameter values of TCN–Transformer networks.

Figure 13 displays the comparative results between predicted and actual RUL values for the six tasks, including detailed error analysis. The TCN–Transformer demonstrates remarkable tracking capability, with its prediction curve maintaining close proximity to the truth RUL, thereby successfully capturing the bearing’s degradation characteristics and providing initial validation of the model’s efficacy. Particularly noteworthy is the model’s superior prediction performance during approximately 80% of the lifespan, while most deviations tend to emerge in the final degradation stage. This may originate from the traditional linear degradation assumption in RUL prediction, which is valid during stable operation. However, actual failure process often exhibits complex nonlinear behavior, especially during accelerated deterioration stages where degradation rates follow exponential growth patterns. Simultaneously, during the advanced stages of degradation, significant variations in degradation rates are observed among different bearings, indicating a discrepancy between the assigned training labels and actual degradation states. The prediction errors of our model predominantly occur in the later degradation stages. Furthermore, the cumulative effect of these errors amplifies the observed divergence.

Figure 13. RUL prediction results for rolling bearings: (a) Task A; (b) Task B; (c) Task C; (d) Task D; (e) Task E; (f) Task F.

Table 7 shows the comparison between TCN–Transformer networks and other advanced models. The baseline models compared here include RNN, LSTM, and GRU, and the five most advanced prediction models include Dual-LSTM [], LSTM-AON [], BiGRU-GSA [], TCN-RSA [], and TFT []. Compared with the prediction results of baseline and state-of-the-art models (even including the current best-performing TFT model), the TCN–Transformer networks proposed in this work achieved the best evaluation index results. The IMP of RMSE, MAE, and SCORE are 14.62%, 9.26%, and 13.04%, respectively, which shows that the proposed model is more competitive in RUL prediction. Specifically in terms of the rationality of RUL prediction, the SCORE of the TCN–Transformer networks is significantly higher than other models, which is the key to evaluating the reliability of RUL prediction. Figure 14 is a visual comparison between TCN–Transformer and other models, which more intuitively shows that the model proposed has the best results in the three evaluation indices. The above comparative analysis further verified the advantages of TCN–Transformer.

Table 7. Comparison of results of TCN–Transformer and other advanced models.

Figure 14. Comparison of evaluation indices of TCN–Transformer and other models.

4.3. Ablation Experiment and Results of TCN–Transformer Networks

The TCN–Transformer networks are compared with the TCN model and Transformer model on six tasks. Table 8 shows the results of TCN–Transformer networks ablation experiment. Table 5 presents the composition of the experimental dataset. It can be seen that, compared with the TCN model and the Transformer model, the prediction effect of the TCN–Transformer networks has improved on all tasks. The average IMP of RMSE, MAE, and SCORE reached 39.67%, 38.07%, and 26.63%, respectively, which shows that the proposed TCN–Transformer networks can obviously improve the accuracy and rationality of RUL prediction. In addition, the IMP degree of RMSE and MAE is significantly higher than that of SCORE, which indicates that the RUL prediction curve of the proposed TCN–Transformer networks is more consistent with the real RUL curve. The comparison proves that the proposed TCN–Transformer network is superior to the original TCN model and Transformer model, and also proves that the hierarchical parallel design of the TCN–Transformer networks and the multi-head feature fusion attention module are reasonable and effective.

Table 8. Results of TCN–Transformer ablation experiment.

The main purpose of this paper is to develop an advanced technical framework to optimize the performance monitoring and RUL prediction of rolling bearings, contributing to the health monitoring and maintenance methods of key mechanical components in intelligent manufacturing. First, by extracting and selecting key features, a HI set that can accurately reflect bearing performance degradation is constructed, effectively solving the problem of characterizing the performance degradation state. Second, by combining the TCN model and Transformer model, TCN–Transformer networks are proposed, which can efficiently learn and integrate local features and global features, providing a new solution for RUL prediction. These methods not only demonstrate the great potential of deep learning in complex system monitoring and prediction but also provide support for maintenance decisions in practical applications, especially in the field of preventive maintenance. In addition, these methods have broad application prospects which are not limited to rolling bearings or rotating machinery. They can also be extended to performance monitoring and life prediction of other key industrial components, thereby bringing a wider impact to the field of intelligent manufacturing.

Based on the innovative points of this work, there will be more future research focused on aspects such as the mathematical interpretability of the model, parameter optimization, and improvement of the performance of individual modules. Meanwhile, the frequency of the sensor is also of great significance for the performance study of the model. This requires a more systematic experimental platform and data analysis.

5. Conclusions

In this work, a method for constructing a HIVS is first developed to describe the performance features of rolling bearings. And then, a new RUL prediction model, TCN–Transformer, was developed to efficiently solve the long time series prediction problem in RUL prediction. The conclusions are summarized as follows:

The method for constructing a HIVS was developed to describe the performance features of rolling bearings. The eight sensitive feature indexes that can accurately reflect the performance of rolling bearings were selected from the 32 indexes to construct the feature set, and then the obtained sensitive feature index after dimensionality reduction was processed to remove outliers and then normalized to obtain the HI. The average comprehensive index of bearings improved by 8.69% on average.
The TCN–Transformer employs a hierarchical parallel architecture combining TCN and Transformer modules, achieving higher computational efficiency and a more compact network scale. Compared with classical standalone TCN or Transformer networks, our approach significantly reduces the required number of channels through feature compression. The outputs from the TCN and Transformer modules interact through a novel multi-head feature fusion attention mechanism, enabling bidirectional integration of local temporal patterns (captured by TCN) and global dependencies (learned by Transformer). This specialized attention module dynamically prioritizes the most discriminative features extracted by both sub-networks, ensuring precise focus on performance-critical characteristics for RUL prediction.
Compared with existing methods, the proposed TCN–Transformer demonstrates superior accuracy in predicting the RUL of rolling bearings across diverse operating conditions. Specifically, in ablation studies, TCN–Transformer outperforms both the standalone TCN and Transformer models, achieving consistent improvements across all evaluation tasks. When compared with state-of-the-art methods, TCN–Transformer reduces RMSE and MAE by 14.62% and 9.26%, respectively, while improving the SCORE metric by 13.04%. These results conclusively validate the superiority of our approach in RUL prediction.

Author Contributions

Conceptualization, X.J., Y.J., and S.F.; Methodology, X.J. and Y.J.; Validation, H.J. and S.F.; Formal Analysis, Y.J., S.L. and K.L.; Investigation, X.J., Y.J., S.L., K.L. and J.X.; Writing—Original Draft, X.J.; Writing—Review and Editing, X.J. and S.F.; Visualization, J.X.; Supervision, H.J. and S.F.; Funding Acquisition, X.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (12102320), and the National Major Science and Technology Project (J2019-IV-0003-0070). APC was funded by (12102320).

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Zhao, Z.; Liang, B.; Wang, X.; Lu, W. Remaining useful life prediction of aircraft engine based on degradation pattern learning. Reliab. Eng. Syst. Saf. 2017, 164, 74–83. [Google Scholar] [CrossRef]
AlShorman, O.; Irfan, M.; Saad, N.; Zhen, D.; Haider, N.; Glowacz, A.; AlShorman, A. A review of artificial intelligence methods for condition monitoring and fault diagnosis of rolling element bearings for induction motor. Shock Vib. 2020, 2020, 1–20. [Google Scholar] [CrossRef]
Li, Y.; Wang, S.; Li, N.; Deng, Z. Multiscale symbolic diversity entropy: A novel measurement approach for time-series analysis and its application in fault diagnosis of planetary gearboxes. IEEE Trans. Industr. Inform. 2022, 18, 1121–1131. [Google Scholar] [CrossRef]
Ma, M.; Mao, Z. Deep wavelet sequence-based gated recurrent units for the prognosis of rotating machinery. Struct. Health Monit. 2021, 20, 1794–1804. [Google Scholar] [CrossRef]
Zio, E. Prognostics and Health Management (PHM): Where are we and where do we (need to) go in theory and practice. Reliab. Eng. Syst. Saf. 2022, 218 Pt A, 108119. [Google Scholar] [CrossRef]
Ahang, M.; Jalayer, M.; Shojaeinasab, A.; Ogunfowora, O.; Charter, T.; Najjaran, H. Synthesizing Rolling Bearing Fault Samples in New Conditions: A Framework Based on a Modified CGAN. Sensors 2022, 22, 5413. [Google Scholar] [CrossRef] [PubMed]
Forest, F.; Fink, O. Calibrated Adaptive Teacher for Domain-Adaptive Intelligent Fault Diagnosis. Sensors 2024, 24, 7539. [Google Scholar] [CrossRef]
Lv, K.L.; Jiang, H.N.; Fu, S.N.; Du, T.C.; Jin, X.C.; Fan, X.L. A predictive analytics framework for rolling bearing vibration signal using deep learning and time series techniques. Comput. Electr. Eng. 2024, 117, 109314. [Google Scholar] [CrossRef]
Giraudo, L.; Di Maggio, L.G.; Giorio, L.; Delprete, C. Dynamic Multibody Modeling of Spherical Roller Bearings with Localized Defects for Large-Scale Rotating Machinery. Sensors 2025, 25, 2419. [Google Scholar] [CrossRef]
Liu, X.R.; Yan, C.F.; Ming Lv Wu, L.X. Multi-rolling element faults diagnosis of rolling bearing based on time-frequency analysis and multi-curves extraction. Meas. Sci. Technol. 2024, 35, 106113. [Google Scholar] [CrossRef]
Jiang, L.L.; Shi, C.Z.; Sheng, H.S.; Li, X.J.; Yang, T.G. Lightweight CNN architecture design for rolling bearing fault diagnosis. Meas. Sci. Technol. 2024, 35, 126142. [Google Scholar] [CrossRef]
Wang, T.; Li, X.; Wang, W.; Du, J.; Yang, X. A spatiotemporal feature learning-based RUL estimation method for predictive maintenance. Measurement 2023, 214, 112824. [Google Scholar] [CrossRef]
Guo, J.X.; Zhang, T.Y.; Xue, K.L.; Liu, J.H.; Wu, J.; Zhao, Y.D. Fault diagnosis of rolling bearing based on parameter-adaptive re-constraint VMD optimized by SABO. Meas. Sci. Technol. 2025, 36, 016174. [Google Scholar] [CrossRef]
Kiakojouri, A.; Wang, L. A Generalized Convolutional Neural Network Model Trained on Simulated Data for Fault Diagnosis in a Wide Range of Bearing Designs. Sensors 2025, 25, 2378. [Google Scholar] [CrossRef]
Wang, H.; An, J.; Yang, J.; Xu, S.; Wang, Z.M.; Cao, Y.; Yuan, W.Q. Remaining useful life prediction method of bearings based on the interactive learning strategy. Comput. Electr. Eng. 2025, 121, 109853. [Google Scholar] [CrossRef]
Xu, Z.; Guo, Y.; Saleh, J.H. Accurate remaining useful life prediction with uncertainty quantification: A deep learning and nonstationary gaussian process approach. IEEE Trans. Reliab. 2021, 71, 443–456. [Google Scholar] [CrossRef]
Hai, B.; Jiang, H.K.; Yao, P.; Wang, K.B.; Yao, R.H. Rolling bearing fault feature extraction using non-convex periodic group sparse method. Meas. Sci. Technol. 2021, 32, 105005. [Google Scholar] [CrossRef]
Hu, C.F.; Liu, Z.J.; Xiao, X.W.; Jin, Y.F.; Wang, T.; Zhou, L.H.; Su, L. A degradation evaluation method with the convolutional neural network for the cyclic symmetry rolling bearing. Meas. Sci. Technol. 2025, 36, 016188. [Google Scholar] [CrossRef]
Chang, Y.; Chen, Q.; Chen, J.; He, S.; Li, F.; Zhou, Z. Intelligent fault diagnosis scheme via multi-module supervised-learning network with essential features capture-regulation strategy. ISA Trans. 2022, 129, 459–475. [Google Scholar] [CrossRef]
Zhu, J.; Chen, N.; Shen, C. A new data-driven transferable remaining useful life prediction approach for bearing under different working conditions. Mech. Syst. Signal Process. 2020, 139, 106602. [Google Scholar] [CrossRef]
Shih, Y.S.; Chen, J.J. Analysis of fatigue crack growth on a cracked shaft. Int. J. Fatigue 1997, 19, 477–485. [Google Scholar] [CrossRef]
Choi, Y.; Liu, C.R. Spall progression life model for rolling contact verified by finish hard machined surfaces. Wear 2007, 262, 24–35. [Google Scholar] [CrossRef]
Chen, C.; Liu, Y.; Wang, S.; Sun, X.; Cairano-Gilfedder, C.; Titmus, S.; Syntetos, A.A. Predictive maintenance using cox proportional hazard deep learning. Adv. Eng. Inform. 2020, 44, 101054. [Google Scholar] [CrossRef]
Aremu, O.O.; Hyland-Wood, D.; McAree, P.R. A Relative Entropy Weibull-SAX framework for health indices construction and health stage division in degradation modeling of multivariate time series asset data. Adv. Eng. Inform. 2019, 40, 121–134. [Google Scholar] [CrossRef]
Caravaca, C.F.; Flamant, Q.; Anglada, M.; Gremillard, L.; Chevalier, J. Impact of sandblasting on the mechanical properties and aging resistance of alumina and zirconia based ceramics. J. Eur. Ceram. Soc. 2018, 38, 915–925. [Google Scholar] [CrossRef]
Wu, R.T.; Jahanshahi, M.R. Data fusion approaches for structural health monitoring and system identification: Past, present, and future. Struct. Health Monit. 2020, 19, 552–586. [Google Scholar] [CrossRef]
Ye, R.; Dai, Q. A novel transfer learning framework for time series forecasting. Knowl.-Based Syst. 2018, 156, 74–99. [Google Scholar] [CrossRef]
Dang, D.Z.; Su, B.Y.; Wang, Y.W.; Ao, W.K.; Ni, Y.Q. A pencil lead break-triggered, adversarial autoencoder-based approach for rapid and robust rail damage detection. Eng. Appl. Artif. Intel. 2025, 150, 110637. [Google Scholar] [CrossRef]
Mao, W.; He, J.; Tang, J.; Li, Y. Predicting remaining useful life of rolling bearings based on deep feature representation and long short-term memory neural network. Adv. Mech. Eng. 2018, 10, 1–18. [Google Scholar] [CrossRef]
Wang, Y.L.; Lu, Y.; Tan, Y.K.; Ao, W.K.; Ni, Y.-Q.; Tang, Q.-C. Bayesian optimization bidirectional LSTM approach for the condition assessment of underground-operating trains. J. Civ. Struct. Health Monit. 2025. [Google Scholar] [CrossRef]
Silka, J.; Wieczorek, M.; Wozniak, M. Recurrent neural network model for high-speed train vibration prediction from time series. Neural Comput. Appl. 2022, 34, 13305–13318. [Google Scholar] [CrossRef]
Caterini, A.L.; Chang, D.E. Recurrent neural networks. In Deep Neural Networks in a Mathematical Framework; SpringerBriefs in Computer Science; Springer: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar] [CrossRef]
Andrew, G.; Menglong, Z. Efficient convolutional neural networks for mobile vision applications, mobilenets. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Lin, L.; Xu, B.; Wu, W.; Richardson, T.W.; Bernal, E.A. Medical Time Series Classification with Hierarchical Attention-based Temporal Convolutional Networks: A Case Study of Myotonic Dystrophy Diagnosis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 83–86. [Google Scholar] [CrossRef]
Zheng, X.; Qian, Y.; Wang, S. GRU prediction forperformance degradation of rolling bearings based on optimalwavelet packet and Mahalanobis distance. J. Vib. Shock 2020, 39, 9–46+63. (In Chinese) [Google Scholar] [CrossRef]
Moradi, M.; Broer, A.; Chiachío, J.; Benedictus, R.; Loutas, T.H.; Zarouchas, D. Intelligent health indicator construction for prognostics of composite structures utilizing a semi-supervised deep neural network and SHM data. Eng. Appl. Artif. Intel. 2023, 117, 105502. [Google Scholar] [CrossRef]
Wei, X.P. Deep Learning Based Health State Assessment and Remaining Life Prediction of Rolling Bearings. Master’s Thesis, Southwest Jiaotong University, Chengdu, China, 2021. (In Chinese). [Google Scholar] [CrossRef]
Ao, W.K.; Hester, D.; O’Higgins, C.; Brownjohn, J. Tracking long-term modal behaviour of a footbridge and identifying potential SHM approaches. J. Civil Struct. Health Monit. 2025, 14, 1311–1337. [Google Scholar] [CrossRef]
Liu, Y.; Wijewickrema, S.; Li, A.; Bester, C.; O’Leary, S.; Bailey, J. Time-transformer: Integrating local and global features for better time series generation. arXiv 2023, arXiv:2312.11714. [Google Scholar] [CrossRef]
Nectoux, P.; Gouriveau, R.; Medjaher, K.; Ramasso, E.; Morello, B.; Zerhouni, N.; Varnier, C. PRONOSTIA: An experimental platform for bearings accelerated degradation tests. In Proceedings of the IEEE International Conference on Prognostics and Health Management, Denver, CO, USA, 18–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1–8. Available online: https://hal.science/hal-00719503v1 (accessed on 2 June 2025).
Soualhi, A.; Medjaher, K.; Zerhouni, N. Bearing health monitoring based on hilbert-huang transform, support vector machine, and regression. IEEE Trans. Instrum. Meas. 2015, 64, 52–62. [Google Scholar] [CrossRef]
Chang, Y.; Li, F.; Chen, J.; Liu, Y.; Li, Z. Efficient temporal flow Transformer accompanied with multi-head probsparse self-attention mechanism for remaining useful life prognostics. Reliab. Eng. Syst. Saf. 2022, 226, 108701. [Google Scholar] [CrossRef]
Cao, Y.; Ding, Y.; Jia, M.; Tian, R. A novel temporal convolutional network with residual self-attention mechanism for remaining useful life prediction of rolling bearings. Reliab. Eng. Syst. Saf. 2021, 215, 107813. [Google Scholar] [CrossRef]
Shi, Z.; Chehade, A. A dual-LSTM framework combining change point detection and remaining useful life prediction. Reliab. Eng. Syst. Saf. 2021, 205, 107257. [Google Scholar] [CrossRef]
Xiang, S.; Qin, Y.; Zhu, C.; Wang, Y.; Chen, H. LSTM networks based on attention ordered neurons for gear remaining life prediction. ISA Trans. 2020, 106, 343–354. [Google Scholar] [CrossRef]
Chang, Y.; Chen, J.; Lv, H.; Liu, S. Heterogeneous bi-directional recurrent neural network combining fusion health indicator for predictive analytics of rotating machinery. ISA Trans. 2022, 122, 409–423. [Google Scholar] [CrossRef]

Figure 1. Flowchart of dimensionality reduction in sensitive feature indexes based on principal component analysis method.

Figure 2. Framework of TCN model.

Figure 3. Framework of proposed TCN–Transformer networks.

Figure 4. The remaining useful life prediction process of rolling bearings based on the TCN–Transformer networks.

Figure 5. A statistical chart of the monotonicity index of the 32 feature indexes.

Figure 6. A statistical chart of the correlation index of the 32 feature indexes.

Figure 7. Statistical chart of predictive index of 32 feature indexes.

Figure 8. A statistical chart of the robustness index of the 32 feature indexes.

Figure 9. Statistical chart of comprehensive indexes of 32 feature indexes.

Figure 10. Ranking of the values of the comprehensive indexes of the 32 feature indexes.

Figure 11. The selected sensitive feature indexes for Bearing 1-1: (a) the energy ratio of the third frequency sub-band (S27), (b) the energy ratio of the seventh frequency sub-band (S31), (c) the energy ratio of the fourth frequency sub-band (S28), (d) the energy ratio of the eight frequency sub-band (S32), (e) the kurtosis index (S14), (f) the energy ratio of the second frequency sub-band (S26), (g) the center of gravity frequency (S20), and (h) the minimum value (S3).

Figure 12. The health index trend over time of the seven bearings obtained using the proposed method.

Figure 13. RUL prediction results for rolling bearings: (a) Task A; (b) Task B; (c) Task C; (d) Task D; (e) Task E; (f) Task F.

Figure 14. Comparison of evaluation indices of TCN–Transformer and other models.

Table 1. Dimensional and dimensionless time domain feature indexes used in this work.

Dimensional Index	Function	Dimensionless Index	Function
Mean absolute value (S1)	$x_{a v} = \frac{1}{N} \sum_{i = 1}^{N} \|x_{i}\|$	Skewness (S11)	$x_{s k e} = \frac{\sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{3}}{(N - 1) x_{σ}^{3}}$
Peak (S2)	$x_{p} = \max \| x_{i} \|$	Kurtosis (S12)	$x_{k u r} = \frac{\sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{4}}{(N - 1) x_{σ}^{4}}$
Minimum (S3)	$x_{\min} = \min \|x_{i}\|$	Skewness factor (S13)	$α = \frac{x_{s k e}}{x_{r m s}^{3}}$
Mean value (S4)	$\bar{x} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}$	Kurtosis factor (S14)	$β = \frac{x_{k u r}}{x_{r m s}^{4}}$
Maximum (S5)	$x_{\max} = \max \|x_{i}\|$	Crest factor (S15)	$C_{f} = \frac{x_{p}}{x_{r m s}}$
Root mean square (S6)	$x_{r m s} = \sqrt{\frac{1}{N} \sum_{i = 1}^{n} {x_{i}}^{2}}$	Shape factor (S16)	$S_{f} = \frac{x_{rms}}{x_{av}}$
Root amplitude (S7)	$x_{r} = {(\frac{1}{N} \sum_{i = 1}^{N} \sqrt{\|x_{i}\|})}^{2}$	Impulse factor (S17)	$I_{f} = \frac{x_{p}}{x_{a v}}$
Variance (S8)	$D_{x} = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{2}$	Clearance factor (S18)	$C L_{f} = \frac{x_{p}}{x_{r}}$
Standard deviation (S9)	$x_{σ} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{2}}$	Coefficient of variation (S19)	$K_{v} = \sqrt{D_{x}} / x_{av}$
Maximum to minimum difference (S10)	$x_{p - p} = \max x_{i} - \min x_{i}$

Note: x_i denotes the vibration signal sequence collected by the sensor,

x_{i} = [x_{1}, x_{2}, \dots, x_{N}]

. N denotes the number of data points.

Table 2. Frequency domain feature indexes used in this work.

Index	Function
Centroid frequency (S20)	$f_{c} = \frac{\sum_{k = 0}^{N - 1} f_{k} X (k)}{\sum_{k = 0}^{N - 1} X (k)}$
Average frequency (S21)	$f_{m} = \frac{1}{N} \sum_{k = 0}^{N - 1} X (k)$
Standard deviation of frequency (S22)	$σ_{f} = \sqrt{\frac{\sum_{k = 0}^{N - 1} {(f_{k} - f_{m})}^{2} X (k)}{\sum_{k = 0}^{N - 1} X (k)}}$
Root mean square of frequency (S23)	$f_{r m s} = \sqrt{\frac{\sum_{k = 0}^{N - 1} f_{k}^{2}}{N}}$
Variance of frequency (S24)	${σ_{f}}^{2} = \frac{\sum_{k = 0}^{N - 1} {(f_{k} - f_{m})}^{2} X (k)}{\sum_{k = 0}^{N - 1} X (k)}$

Table 3. Calculated evaluation indexes using the eight sensitive feature indexes for the seven rolling bearings.

Bearing	Monotonicity Index	Correlation Index	Predictive Index	Robustness Index	Comprehensive Index	Original Maximum Comprehensive Index	Improvement
1-1	0.9458	0.9654	0.9428	0.8943	0.9482	0.8904	6.5%
1-2	0.8664	0.8628	0.9428	0.8805	0.8720	0.8173	6.7%
1-3	0.9593	0.9573	0.9428	0.8801	0.9490	0.8524	11.3%
1-4	0.8033	0.7707	0.9428	0.9097	0.8149	0.7518	8.4%
1-5	0.8779	0.8952	0.9428	0.8893	0.8924	0.8218	8.6%
1-6	0.9017	0.9132	0.9428	0.8982	0.9140	0.8096	12.9%
1-7	0.9167	0.8978	0.9428	0.8587	0.9059	0.8512	6.4%

Table 4. The details of the dataset used in this section selected from the IEEE PHM 2012 dataset.

Dataset 1 Load (N)	Rotation Speed (rpm)	Dataset 2 Load (N)	Rotation Speed (rpm)
4000	1800	4200	1650
Bearing	Actual life	Bearing	Actual life
Bearing 1-1	7 h 47 min 00 s	Bearing 2-1	2 h 31 min 40 s
Bearing 1-2	2 h 25 min 00 s	Bearing 2-2	2 h 12 min 40 s
Bearing 1-3	5 h 00 min 10 s	Bearing 2-3	3 h 20 min 10 s
Bearing 1-4	3 h 09 min 40 s	Bearing 2-4	1 h 41 min 50 s
Bearing 1-5	6 h 23 min 29 s	Bearing 2-5	5 h 33 min 30 s
Bearing 1-6	4 h 10 min 11 s	Bearing 2-6	1 h 35 min 10 s

Table 5. Remaining useful life prediction task description using IEEE PHM 2012 dataset.

Task	Training Bearing	Test Bearing
A	Bearing 1-1, 1-2, 1-3	Bearing 1-4
B	Bearing 1-1, 1-2, 1-3	Bearing 1-5
C	Bearing 1-1, 1-2, 1-3	Bearing 1-6
D	Bearing 2-1, 2-2, 2-3	Bearing 2-4
E	Bearing 2-1, 2-2, 2-3	Bearing 2-5
F	Bearing 2-1, 2-2, 2-3	Bearing 2-6

Table 6. Hyperparameter values of TCN–Transformer networks.

Hyperparameter	Value	Hyperparamete	Value
Batch Size	32	Epochs	10
Activation Function	GELU	Learning Rate	0.0001
Embedding Dimension	64	Hidden Unit Dimension	256
Temporal Window Length	30	Loss Function	MSE

Table 7. Comparison of results of TCN–Transformer and other advanced models.

Model	Average RMSE	Average MAE	Average SCORE
RNN	0.1093	0.0901	0.2129
LSTM	0.0969	0.0815	0.2004
GRU	0.0991	0.0831	0.2496
Dual-LSTM	0.1055	0.0728	0.2714
LSTM-AON	0.0873	0.0695	0.3286
BiGRU-GSA	0.0852	0.0634	0.4553
TCN-RSA	0.0765	0.0529	0.507
TFT	0.0602	0.0432	0.5614
TCN–Transformer	0.0514	0.0392	0.6346
IMP	14.62%	9.26%	13.04%

Table 8. Results of TCN–Transformer ablation experiment.

	RMSE			MAE			SCORE
Task	TCN- Transformer	Transformer	TCN	TCN- Transformer	Transformer	TCN	TCN- Transformer	Transformer	TCN
A	0.0312	0.0177	0.6225	0.0227	0.0135	0.5131	0.6356	0.5290	0.1226
B	0.0660	0.0491	1.0182	0.0483	0.0406	0.8467	0.5673	0.5384	0.0724
C	0.0607	0.1208	0.9761	0.0530	0.1077	0.8184	0.5916	0.4122	0.0700
D	0.0275	0.1278	0.5127	0.0226	0.0624	0.4201	0.6764	0.5465	0.1961
E	0.0800	0.1257	1.0087	0.0436	0.0894	0.8567	0.6178	0.3828	0.0491
F	0.0430	0.0703	0.4298	0.0448	0.0662	0.3498	0.7188	0.5743	0.2675
Average	0.0514	0.0852	0.7613	0.0392	0.0633	0.6341	0.6346	0.4972	0.1296
IMP	39.67%			38.07%			26.63%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Remaining Useful Life Prediction for Rolling Bearings Based on TCN–Transformer Networks Using Vibration Signals

Abstract

1. Introduction

2. Feature Extraction and Health Index Construction Method

2.1. Feature Extraction Method

2.2. Constructing the Sensitive Feature Set for Rolling Bearings

2.3. Dimensionality Reduction Method for Sensitive Feature Index

2.4. Constructing the Health Index for Rolling Bearing

3. TCN–Transformer Networks and RUL Prediction

3.1. Construction of TCN–Transformer Networks

3.2. RUL Prediction Based on TCN–Transformer

4. Results and Discussion

4.1. Verification of Feature Extraction and Health Index Construction

4.2. Experiment and Verification of the TCN–Transformer Networks

4.3. Ablation Experiment and Results of TCN–Transformer Networks

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics