Bearing Lifespan Reliability Prediction Method Based on Multiscale Feature Extraction and Dual Attention Mechanism

Luo, Xudong; Wang, Minghui

doi:10.3390/app15073662

Open AccessArticle

Bearing Lifespan Reliability Prediction Method Based on Multiscale Feature Extraction and Dual Attention Mechanism

by

Xudong Luo

¹ and

Minghui Wang

^2,*

¹

College of Biomedical Engineering, Sichuan University, Chengdu 610065, China

²

Institute of Regulatory Science for Medical Devices, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3662; https://doi.org/10.3390/app15073662

Submission received: 14 February 2025 / Revised: 15 March 2025 / Accepted: 23 March 2025 / Published: 27 March 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate prediction of the remaining useful life (RUL) of rolling bearings was crucial for ensuring the safe operation of machinery and reducing maintenance losses. However, due to the high nonlinearity and complexity of mechanical systems, traditional methods failed to meet the requirements of medium- and long-term prediction tasks. To address this issue, this paper proposed a recurrent neural network with a dual attention model. By employing path weight selection methods, Discrete Fourier transform, and selection mechanisms, the prediction accuracy and generalization ability in complex time series analysis were significantly improved. Evaluation results based on mean absolute error (MAE) and root mean square error (RMSE) indicated that the dual attention mechanism effectively focused on key features, optimized feature extraction, and improved prediction performance. An end-to-end RUL prediction model was established based on the MS-DAN network, and the effectiveness of the method was validated using the IEEE PHM 2012 Data Challenge dataset, providing more accurate decision support for equipment maintenance engineers.

Keywords:

reliability assessment; deep learning; dual attention mechanism; multi-scale partitioning

1. Introduction

Accurate prediction of the remaining useful life (RUL) of rolling bearings is crucial for ensuring the safe operation of machinery and reducing maintenance costs [1,2,3]. Modern industrial equipment operates in complex and dynamic environments, where its reliability directly impacts production efficiency and safety [4,5,6]. Therefore, accurate reliability assessment was crucial. However, the actual structure of the equipment was influenced by various internal and external time-varying effects, complex operational disturbances, measurement noise, and other factors, making it challenging to ensure the accuracy of long-term predictions [7,8]. To address this issue, this paper analyzed real operational data and developed new methods and tools for reliability evaluation.

In recent years, deep learning has attracted widespread attention in fields such as natural language processing, transfer learning, and computer vision [9,10,11,12]. Gomez et al. [13] proposed an improved time fusion Transformer method, which replaces the traditional long short-term memory (LSTM) [14] with a bidirectional long short-term memory (Bi-LSTM) encoder–decoder to enhance the ability to capture time-series features. This method integrates Bayesian optimization based on a tree-structured Parzen estimator for the state of health and RUL prediction of lithium batteries. Niazi et al. [15] developed a parallel neural network framework that employs multi-channel processing combined with Time Transformer [16], LSTM, and other methods to capture spatiotemporal dependencies, improving efficiency and accuracy in handling multidimensional features. Zhu et al. [17] proposed the TACT model constrained by L2 regularization for RUL prediction. The model combines multi-scale convolutional neural network CNN and Transformer to achieve simultaneous extraction of local and global features and introduces delayed prediction constraints to optimize training. Lin et al. [18] introduced a nonlinear multi-stage degradation model based on the Wiener process, incorporating a stage division method to automatically determine the number of stages, change point locations, and drift model forms. Variational Bayesian methods were used to adaptively estimate parameters and derive RUL analytically. Traditional RUL prediction models [1,2] had inherent limitations in extracting critical information. They lacked the ability to effectively represent significant features, particularly when dealing with multi-directional data [19]. With the deepening research on equipment degradation processes, attention mechanisms have been gradually introduced into RUL prediction. This mechanism helps dynamically capture key features and improves the model’s prediction performance in complex degradation processes. For example, Zhang et al. [20] proposed an algorithm that integrates an improved self-attention mechanism, temporal convolutional network (TCN) [21], and squeeze-and-excitation mechanism to weight the contributions of input features across both the time-step and channel dimensions, highlighting key features highly relevant to the RUL. Xu et al. [22] proposed an RUL prediction method based on an improved Transformer model that integrates attention mechanisms and deep learning, comprehensively considering spatiotemporal characteristics and various operating conditions. Ding et al. [23] designed a multi-scale convolution module [24] combined with the Swish activation function, embedding local feature learning into global sequence modeling. This approach simultaneously extracts local dependencies and global interaction information from raw time signals and transforms them into trainable class labels. Zhao et al. [25] proposed a novel gated attention mechanism called Capsule Neural Network [26].

These methods achieved significant results in equipment reliability assessment and RUL prediction but also exhibited some notable shortcomings. Firstly, the requirements for different features might have varied across different degradation stages. Certain features might have been critical to the degradation process at specific stages, yet traditional methods often fail to dynamically weight or focus on features based on the changing degradation stages, leading to the neglect of key features. Secondly, the models lacked adaptability, as many traditional models relied on fixed parameters or feature extraction methods, making it challenging to flexibly adjust to varying degradation stages or environmental conditions. Although attention mechanisms could capture long-term dependencies, they might have still lacked sufficient adaptability in adjusting focus points. To address these issues, this paper proposed a time-series analysis method based on multi-scale feature extraction and path weight selection. As shown in Figure 1, this method employed Discrete Fourier Transform (DFT) [27] to extract periodic components from time-series data and used a TopK selection mechanism [28] to retain the most critical path weights. This allowed the importance of feature paths to be dynamically adjusted based on the degradation stage of the equipment. In the healthy stage, when degradation signs were not yet apparent, the path weight selection mechanism automatically reduced the influence of paths with lower contributions. In the degradation stage, when key features became more critical, the mechanism enhanced the weights of these paths, improving the extraction rate of degradation-related information. After extracting path features, the model first utilized the proposed attention model, EM-Net, to extract initial features, ensuring an effective representation of the data in the feature space. Subsequently, the model employed a recurrent neural network (RNN) [29] to progressively capture the temporal dependencies of the signals. Finally, the processed features were passed to the RUL prediction module, which incorporated activation functions and dropout regularization to prevent overfitting, ultimately generating accurate RUL predictions to provide precise lifespan estimations for the equipment.

This study made significant contributions to effectively capturing information during the equipment degradation process, extracting key information, and enhancing the model’s robustness against interference. Compared to existing methods, our model demonstrated superior prediction accuracy and generalization capability. In particular, by integrating path weight selection with an innovative approach based on one-dimensional convolution and spatial attention mechanisms, the model further improved its ability to predict the degradation process of bearings. The path weight selection mechanism dynamically adjusted the weights of feature paths according to different stages of bearing degradation, enabling more precise capture of degradation information. Experimental results showed that the proposed method outperformed other existing methods on both the PHM 2012 [30] bearing dataset and real-world equipment data, validating its effectiveness and superiority.

The main contributions of this paper were as follows:

1.: A path weight selection mechanism was proposed, which could dynamically adjust the weights of feature paths according to different stages of bearing degradation, thereby capturing degradation information more accurately;
2.: A dual attention mechanism was constructed, capable of flexibly capturing dependencies between channels and automatically adjusting the importance of each channel, which effectively enhanced the model’s feature representation capability;
3.: The MS-DAN prediction method was proposed, which enhanced the feature extraction capability during the equipment degradation process and demonstrated excellent performance in prediction accuracy.

2. Materials and Methods

2.1. Materials

This paper used the dataset provided by the IEEE PHM2012 Challenge to verify the effectiveness of the proposed RUL prediction method. The dataset was collected by the PRONOSTIA experimental platform, and the collection device is shown in Figure 2. The PRONOSTIA platform is an experimental setup designed and implemented by the French FEMTO-ST Institute, specifically for testing and verifying bearing fault detection, diagnosis, and prediction methods. The platform operated under three different working conditions (speed 1800 rpm/load 4000 N, speed 1650 rpm/load 4200 N, speed 1500 rpm/load 5000 N), with accelerometers installed on both the vertical and horizontal axes to measure the vibration signals of the rolling bearings. Vibration data were collected every 10 s for 0.1 s, with a sampling frequency of 25.6 kHz to obtain online health monitoring data (such as speed, load, temperature, and vibration). A total of 2560 sets of sample data were collected every 10 s. The deep groove ball bearing was chosen as the experimental bearing primarily because of its excellent load-bearing capacity and wide applicability, especially in environments where variable loads needed to be supported.

The PHM2012 dataset contained 17 full life cycle tests of bearings. It was divided into 6 training sets and 11 test sets. Both the training and test sets covered different speed and load conditions to verify the effectiveness of bearing fault diagnosis and prediction methods. Bearings 1-1 and 2-1 represented different loads and speeds to simulate the degradation process under actual operating conditions, as shown in Figure 3 and Table 1. Through the PRONOSTIA platform, researchers were able to obtain complete data from normal operation to failure of the bearing in a controlled environment, providing experimental data for training and validation of machine learning models. According to relevant studies in the literature [31,32], horizontal vibration signals typically provide more useful information than vertical vibration signals for tracking bearing degradation. Therefore, this paper only used horizontal vibration signals for the experiments.

2.2. Methods

2.2.1. Overview

The MS-DAN model, as shown in Figure 4, was built on the traditional RNN model with the addition of the attention mechanism and feature selection proposed in this paper. Based on the different stages of bearing degradation, it dynamically adjusted the weights of each feature path to more accurately capture degradation information. The MS-DAN model first performed feature selection and converted the time-domain signal into the frequency-domain signal through DFT to extract useful features. Then, the frequency-domain features were downsampled using kernel average pooling, and important features were highlighted by SoftMax weighting. The most important K features were retained, and the frequency-domain features were converted back to the time domain through Inverse Discrete Fourier Transform (IDFT). Next, the data entered the feature learning stage, where the features were further optimized through the feature extractor and enhanced feature modules. Through the attention mechanism, the model automatically assigned weights to different features, strengthened the areas containing more degradation information, and helped the model focus on important features. The cross-attention mechanism further enhanced the interactive information between features. In the RNN module, the model used recurrent neural networks to capture long-term dependencies in the time series. After passing through the RNN model, the loss L was calculated, and the model parameters were optimized through backpropagation.

Ultimately, this method not only enhanced the accuracy of RUL predictions but also improved the model’s adaptability to various operating conditions, providing more reliable support for maintenance decision-making of mechanical equipment. The pseudocode of the algorithm is shown in Algorithm 1.

Algorithm 1: Applying proposed to Prediction

Input: A set of dataset samples Learning_set = {(X₁), (X₂), …, (X_n)}. The Full_Test_Set is the test set. The number of learning epochs is M.
Output: the optimal model and its predicted RUL

1.: Load the training set and validation set;
2.: Begin:
3.: Initialize all wights and biases.
4.: For m = 1, 2, …, M do
5.: Extract features through Multiscale model → F_R;
6.: Input F_R to MS-DAN;
7.: Calculate the output for the MS-DAN;
8.: Input the feature of F_MS into sequence X, and input RNN;
9.: Calculate the output for the RNN layer;
10.: Calculate the RUL;
11.: Model Fit (Adam, (train X)) → M(m);
12.: Model Evaluate (M(m), (Val X)) → R_mae(m).
13.: End For
14.: Save the optimal model which has min R_mae in M epochs.
15.: End
16.: Load the testing set;
17.: Load the optimal model in terms of RUL performances

2.2.2. Multi-Scale Partitioning

Multi-scale partitioning could be easily extended to the multivariable case by independently considering each variable. In the multi-scale module, we defined a set S = {S₁, …, S_M} that contained M patch size values, where each patch size S corresponded to a patch partitioning operation. For the input time series X_i ∈ R^H×d, where H represents the length of the time series and d represents the dimensionality of the features, the partitioning operation for a patch size S divides X into P patches, (X₁, X₂, …, X_P), where each patch X_i ∈ R^S×d contained S time steps. As shown in Figure 5.

The extraction of periodic components was primarily achieved through the DFT. By applying the DFT to the input time series X, it was converted from the time domain to the frequency domain. f_k represents a single frequency component.

{X_{f} = D F T (X) = {f}_{1}, f_{2}, \dots, f_{x}}

(1)

The amplitudes of various frequency components were computed, and the Top K_f frequencies with the largest amplitudes were selected. This selection process not only ensured the sparsity of the frequency domain but also effectively retained the most important frequency components, thereby reducing redundant information. K_f represented the number of selected frequency components.

X_{f} = T o p K ({X_{f}}, K_{f})

(2)

The amplitudes of various frequency components were computed, and the Top K_f frequencies with the largest amplitudes were selected. This selection process not only ensured the sparsity of the frequency domain but also effectively retained the most important frequency components, thereby reducing redundant information.

X_{f} (t) = \sum_{k = 1}^{K} A_{k} c o s (2 π f_{k} t + φ_{k})

(3)

A and Φ represent the amplitude and phase of the selected frequencies, respectively.

Using the IDFT, the selected frequency components were converted back to the time domain to obtain the periodic part X_f, reconstructing the periodic fluctuations in the time series, as described by the following formula.

X_{s} = I D F T (X_{f}, K_{f})

(4)

The residual component was then averaged through pooling using kernels of different sizes. Multiple convolution operations were performed on the residual component with different pooling kernels, and the output for each kernel was computed. Subsequently, the SoftMax function was applied to determine the weight of each kernel. Y_k represents the output value after the pooling operation.

Y_{k} = A v g p o o l (X_{s}, k e r n e l)

(5)

The features of each segment were normalized using the SoftMax function to obtain the corresponding weights.

W_{k} = \frac{\exp (Y_{k})}{\sum_{i - 1}^{p} \exp (Y_{i})}

(6)

The periodic components captured cyclical fluctuations, while the trend components reflected long-term changes in the time series. The TopK selection mechanism allowed us to retain the most important path weights, thereby optimizing the feature extraction process. The pseudocode of feature selection is shown in Algorithm 2.

F = \sum_{k = 1}^{K} W_{k} Y_{k}

(7)

W_{p a t h} = s o f t m a x (F)

(8)

T o p K (W_{p a t h}, K)

(9)

Algorithm 2: Feature Selection using DFT

Input: Time series X∈R^H×d. Number of K.
Output: Selected_k

1.: Perform DFT on the input time series X to obtain frequency components;
2.: X_f = DFT (X):
3.: X_f = TopK({X_f },K)
4.: X_s = IDFT(X_f)
5.: Pooling (X_f, kernel)
6.: For size in kernel
7.: Apply the pooling operation to the input X_f
8.: Return X_pooled
9.: For weights in SoftMax weights
10.: Calculate the SoftMax weight of each pooling kernel output
11.: Return selected_K
12.: Selected_K = TopK (X_pooled, K)

2.2.3. Attention Mechanism

The introduction of the attention mechanism in deep learning significantly improved model performance, especially in handling sequence data and natural language processing tasks [33,34,35]. The attention mechanism assigned different weights to each element in the input sequence, allowing the model to focus on the most relevant parts for the current task when calculating the output. This enabled the model to allocate weighted attention across different input positions, thereby improving the efficiency of information utilization. This paper designed an EM-Net, which included an efficient channel attention network (ECA-Net) [36] and convolutional block attention module (CBAM) [37]. ECA-Net replaced the channel attention module in CBAM by generating channel attention through the weighted cross-channel information and used a simple 1D convolution to model the relationships between channels. This approach not only reduced the number of parameters but also improved computational efficiency, as shown in Figure 6.

Performed global average pooling on the input feature X₁ ∈ R^W×H×C. The pooling operation averaged along the spatial dimensions (W, H) for each channel, resulting in a feature g(X), where c is the number of channels.

g (X) = \frac{1}{W \times H} \sum_{w = 1}^{W} \sum_{h = 1}^{H} X_{1} (w, h, c)

(10)

The formula for calculating the kernel size k is based on the number of channels c.

k = | {l o g}_{2} (c) + b / γ |

(11)

To prevent the calculated kernel size from being 1, which would make it ineffective at extracting information between channels, set b = 1 and γ = 2. Perform a Conv1D operation on g(X) using the computed kernel size k, and normalize the convolution result through the Sigmoid function to obtain the weighted coefficient for each channel F_c.

σ (x) = 1 / 1 + e^{- x}

(12)

F_{c} = C o n v 1 D (g (x)) \times σ (k)

(13)

The two pooled features were concatenated to form a joint feature containing information from the pooling operations. Then, a 7 × 7 convolutional layer was used to convolve the concatenated feature. The purpose of this step was to extract richer spatial features through the convolution operation and generate a spatial attention matrix F_d, which contained attention weights for each spatial location. This helped the model focus more on important regions, improving its ability to perceive and enhance the features.

F_{d} = σ ({C o n v 2 D}_{7 \times 7} ([A v g p o o l (F_{c} \times X), M a x p o o l (F_{c} \times X)]))

(14)

X_{2} = X_{1} \times F_{C} \times F_{d}

(15)

2.2.4. RNN

RNN was a type of neural network model capable of processing sequential data. This enabled it to excel in tasks such as speech recognition, natural language processing, and time series prediction [38,39]. Especially in RUL prediction, RNN could capture the temporal relationships in data by passing information through hidden states across time steps. To better extract features from equipment condition monitoring data, the independent recurrent neural network (IndRNN) [40] network was employed. IndRNN achieved this by making the update of each neuron independent, meaning that each neuron was updated based solely on its own state and input, without relying on the states of other neurons. This approach effectively avoided the gradient vanishing and explosion issues present [41,42,43] in RNN and LSTM. Since each neuron’s computation was independent, the information transfer over long time steps became more stable.

As shown in Figure 7. The hidden layer h_t was updated based on X_t and the previously hidden layer h_t₋₁ at step t.

h_{t} = σ (W_{X_{t}} {+ u}_{t - 1} \cdot h_{t - 1} + b)

(16)

The hidden state update of each neuron was calculated based on the current input and the hidden state from the previous time step. For the n neuron, the hidden state h_n_,t at time step t was given by the following equation.

h_{n, t} = σ (W_{{n X}_{t}} + u_{n, t - 1} \cdot h_{n, t - 1} + b_{n})

(17)

X_t ∈ R^M represents the input at time step t, h_t−1 ∈ R^N was the hidden state from the previous time step, W_n and h_n were the weight matrices for the current input and the previous hidden state, respectively, b_n was the bias term, and σ was the activation function. In this equation, the current hidden state was influenced not only by the input but also by the hidden state from the previous time step, reflecting the temporal dependency characteristic of recurrent neural networks. In IndRNN, the design of the loss function was crucial for the optimization process. A commonly used loss function was the mean absolute error, which was given by the following form.

L = \frac{1}{M} \sum_{1}^{M} | y_{t} - y_{t}^{'} |

(18)

Among them,

y_{t}

represented the actual target output, while

y_{t}^{'}

was the predicted output of the network at time step t. The loss function was computed based on the errors at all time steps, which reflected the difference between the network’s output and the actual target. By minimizing this loss function, IndRNN could continuously adjust its parameters to improve prediction accuracy. To minimize the loss function and optimize the network parameters, IndRNN used the backpropagation algorithm. The core idea of backpropagation was to gradually update the weights and biases in the network based on the gradient information of each neuron with respect to the loss function.

\frac{\partial J_{n}}{\partial h_{n, t}} = \frac{\partial J_{n}}{\partial h_{n, T}} \frac{\partial h_{n, T}}{\partial h_{n, t}} = \frac{\partial J_{n}}{\partial h_{n, T}} \prod_{k = t}^{T - 1} σ_{n, k + 1}^{'} u_{n} = \frac{\partial J_{n}}{\partial h_{n, T}} u_{n}^{T - 1} \prod_{k = t}^{T - 1} σ_{n, k + 1}^{'}

(19)

Since the hidden state update of each neuron was independent, IndRNN could accelerate the training process by performing independent backpropagation operations when computing gradients.

3. Results

3.1. Evaluation Criteria

This paper used TensorFlow [44] for code compilation to predict the remaining useful life of PHM-bearing data. Mean absolute error (MAE) [45] and root mean square error (RMSE) [46] were used as evaluation metrics. Lower MAE and RMSE values were preferred. A smaller MAE indicated lower prediction error, implying higher prediction accuracy and reduced variability in predictions. Similarly, a smaller RMSE suggested greater stability in predictions.

M A E = \frac{1}{m} \sum_{t - 1}^{m} | y_{t} - {\hat{y}}_{t} |

(20)

R M S E = \sqrt{\frac{1}{m} \sum_{t - 1}^{m} {(y_{t} - {\hat{y}}_{t})}^{2}}

(21)

3.2. Experimental Setup and Performance

As shown in Table 2 the proposed method exhibited a clear advantage over CNN, TCN, gated recurrent unit (GRU), bidirectional GRU (BiGRU), and BiLSTM in both Bearing 1 and Bearing 2 test sets. It demonstrated lower MAE and RMSE compared to the other models. The network model outperformed the baseline models in direct prediction, highlighting its enhanced generalization. Adaptability was a main advantage of the proposed network architecture, allowing it to select different scales for various temporal dynamics. This adaptability enabled it to effectively capture the complex temporal patterns present in different datasets, demonstrating superior generalization ability. Table 3 and Figure 8 show the performance comparison of the model under different parameters.

3.3. Ablation Experiment

In order to validate the effectiveness of the proposed improved attention mechanism, ablation experiments were conducted with the following configurations: base model, base model + CBAM, and ours (Multiscale + CBAM + IndRNN). As shown in Table 4, the proposed model (ours) achieved the best performance across multiple metrics, demonstrating the effectiveness of integrating multiscale, CBAM, and IndRNN. Furthermore, the proposed model maintained a competitive parameter size, balancing performance and computational efficiency effectively.

3.4. Comparison of Different Modules

The evaluation results were presented. The main advantage of the model was its adaptability, which allowed it to select different scales according to varying temporal dynamics. Through this adaptive mechanism, the model was able to identify and capture complex temporal dependencies and dynamic changes in various time series data. This adaptability had great potential in real-world applications, enabling the model to better handle a wide range of time series problems with high prediction accuracy and stability, as shown in Table 5.

4. Discussion

To effectively predict the remaining useful life (RUL) of condition monitoring data, this paper proposed a multi-stage RUL prediction method. First, real-time signals of the monitoring data were preprocessed using path weight selection. The purpose of path weight selection was to choose the most representative and important features from the monitoring signals in order to better reflect the degradation state of the equipment. As shown in Table 4, the ablation experiment results indicated that after applying path weight selection, compared to (base + EM-NET), the performance improved across all bearing data, with an average reduction of 20% in MAE and 15% in RMSE. This improvement suggested that path weight selection effectively optimized the data features, thereby enhancing the model’s prediction accuracy. Additionally, this paper introduces an improved attention mechanism module, which strengthens the traditional attention mechanism to capture more degradation information and preserve more detailed features. The features processed by this module were then input into the RNN, which effectively captured long-term dependencies in the time series [19,47], allowing the model to better learn the degradation patterns and trends of the equipment. By combining path weight selection with the dual attention mechanism, the model was able to focus more on regions containing more degradation information, helping it emphasize features that were crucial for prediction. This integrated approach further improved the model’s accuracy and robustness. Finally, the MS-DAN model demonstrated significant performance improvements on the PHM-bearing dataset. As shown in Table 5, the MS-DAN model, compared to newer models, reduced RMSE by an average of 13.51% and MAE by an average of 10.14%. These results indicated that the MS-DAN model exhibited notable improvements in both prediction accuracy and generalization ability, providing a more reliable solution for condition monitoring and RUL prediction.

Multiscale selected the top K patch sizes for combination to adapt to different time series samples. As shown in Figure 9, the impact of different K values on the prediction result was evaluated in Figure 9. The results showed that performance with K = 2 and K = 3 outperformed K = 1 and K = 6, highlighting the advantage of adaptively modeling critical multi-scale features to enhance accuracy. Furthermore, different time series samples benefited from feature extraction using various patch sizes, but not all patch sizes were equally effective. These findings highlighted the adaptability of the model, emphasizing its ability to identify and apply optimal combinations of patch sizes to address the diverse periodic and trend patterns present in the samples.

Table 4 presents the hyperparameter settings for model training across six different experiments, with all experiments using 50 epochs. Comparing Exp2 and Exp4, it could be observed that, with other parameters held constant, a larger batch size led to better model performance. Comparing Exp1, Exp3, and Exp4, it was evident that using SGD or RMSprop as optimizers was less efficient in complex tasks compared to Adam. Finally, comparing Exp4 and Exp6, it could be seen that using a smaller learning rate allowed for more precise model adjustments and prevented skipping over optimal weights.

5. Conclusions

This study developed a novel model that incorporates an improved attention mechanism, utilizing efficient one-dimensional convolution to generate channel weights. This design significantly reduces the number of parameters while avoiding the introduction of dimensionality reduction operations, thereby enhancing the model’s efficiency. The features extracted through path weight selection were input into the RNN and the enhanced attention module to further improve the model’s predictive performance. Experimental results demonstrated that the model achieved excellent predictive performance on the condition monitoring dataset. In the future, this study plans to adopt adaptive techniques to further optimize the network model. For example, a multi-scale adaptive attention mechanism [48] was introduced to extract information from both minor fluctuations and the long-term stability of equipment, thereby enhancing the model’s capability to predict stability under different operating conditions. Additionally, the research team plans to collaborate further with West China Hospital to collect operational data from ventilators, monitors, and portable extracorporeal devices and apply the proposed method to the prediction of medical equipment stability.

Author Contributions

Conceptualization, X.L. and M.W.; methodology, X.L. and M.W.; software, X.L.; formal analysis, X.L.; investigation, X.L. and M.W.; resources, M.W.; data curation, X.L.; writing—original draft preparation, X.L.; writing—review and editing, X.L. and M.W.; visualization, X.L.; supervision, M.W.; project administration, M.W.; funding acquisition, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (Project No. 2022YFC2407600, Project No. 2022YFC3601000).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The research in this paper uses a publicly available dataset. The dataset can be downloaded from the following link: [https://github.com/wkzs111/phm-ieee-2012-data-challenge-dataset] (accessed on 23 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MAE	Mean absolute error
RMSE	Root mean square error
RUL	Remaining useful life
LSTM	Long short-term memory
Bi-LSTM	Bidirectional long short-term memory
CNN	Convolutional neural network
TCN	Temporal convolutional network
DFT	Discrete Fourier Transform
RNN	Recurrent neural network
ECA-Net	Efficient channel attention network
CBAM	Convolutional block attention module
IndRNN	Independent recurrent neural network
GRU	Gated recurrent unit
BiGRU	Bidirectional gated recurrent unit

References

Wang, Y.; Zhao, Y.; Addepalli, S. Remaining useful life prediction using deep learning approaches: A review. Procedia Manuf. 2020, 49, 81–88. [Google Scholar]
Ferreira, C.; Gonçalves, G. Remaining Useful Life prediction and challenges: A literature review on the use of Machine Learning Methods. J. Manuf. Syst. 2022, 63, 550–562. [Google Scholar]
Zhang, Y.; Fang, L.; Qi, Z.; Deng, H. A review of remaining useful life prediction approaches for mechanical equipment. IEEE Sens. J. 2023, 23, 29991–30006. [Google Scholar]
Zio, E. Some challenges and opportunities in reliability engineering. IEEE Trans. Reliab. 2016, 65, 1769–1782. [Google Scholar]
Wang, Q.; Liu, W.; Xin, Z.; Yang, J.; Yuan, Q. Development and application of equipment maintenance and safety integrity management system. J. Loss Prev. Process Ind. 2011, 24, 321–332. [Google Scholar]
Cepin, M.; Radim, B. Safety and Reliability. Theory and Applications; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
Bagri, I.; Tahiry, K.; Hraiba, A.; Touil, A.; Mousrij, A. Vibration Signal Analysis for Intelligent Rotating Machinery Diagnosis and Prognosis: A Comprehensive Systematic Literature Review. Vibration 2024, 7, 1013–1062. [Google Scholar] [CrossRef]
Zhang, P.; Chen, R.; Xu, X.; Yang, L.; Ran, M. Recent progress and prospective evaluation of fault diagnosis strategies for electrified drive powertrains: A comprehensive review. Measurement 2023, 222, 113711. [Google Scholar]
Alyafeai, Z.; AlShaibani, M.S.; Ahmad, I. A survey on transfer learning in natural language processing. arXiv 2020, arXiv:2007.04239. [Google Scholar]
Chai, J.; Zeng, H.; Li, A.; Ngai, E.W. Deep learning in computer vision: A critical review of emerging techniques and application scenarios. Mach. Learn. Appl. 2021, 6, 100134. [Google Scholar]
Reza, M.; Mannan, M.; Mansor, M.; Ker, P.J.; Mahlia, T.M.I.; Hannan, M. Recent advancement of remaining useful life prediction of lithium-ion battery in electric vehicle applications: A review of modelling mechanisms, network configurations, factors, and outstanding issues. Energy Rep. 2024, 11, 4824–4848. [Google Scholar]
Song, L.; Jin, Y.; Lin, T.; Zhao, S.; Wei, Z.; Wang, H. Remaining useful life prediction method based on the spatiotemporal graph and GCN nested parallel route model. IEEE Trans. Instrum. Meas. 2024, 73, 1–12. [Google Scholar]
Gomez, W.; Wang, F.K.; Chou, J.H. Li-ion battery capacity prediction using improved temporal fusion transformer model. Energy 2024, 296, 131114. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [PubMed]
Niazi, S.G.; Huang, T.; Zhou, H.; Bai, S.; Huang, H.-Z. Multi-scale time series analysis using TT-ConvLSTM technique for bearing remaining useful life prediction. Mech. Syst. Signal Process. 2024, 206, 110888. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Gomez, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Zhu, J.; Ma, J.; Wu, J. A regularized constrained two-stream convolution augmented transformer for aircraft engine remaining useful life prediction. Eng. Appl. Artif. Intell. 2024, 133, 108161. [Google Scholar]
Lin, W.; Chai, Y.; Fan, L.; Zhang, K. Remaining useful life prediction using nonlinear multi-phase Wiener process and variational Bayesian approach. Reliab. Eng. Syst. Saf. 2024, 242, 109800. [Google Scholar]
Kumar, A.; Parkash, C.; Vashishtha, G.; Tang, H.; Kundu, P.; Xiang, J. State-space modeling and novel entropy-based health indicator for dynamic degradation monitoring of rolling element bearing. Reliab. Eng. Syst. Saf. 2022, 221, 108356. [Google Scholar]
Zhang, Q.; Liu, Q.; Ye, Q. An attention-based temporal convolutional network method for predicting remaining useful life of aero-engine. Eng. Appl. Artif. Intell. 2024, 127, 107241. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
Xu, D.; Xiao, X.; Liu, J.; Sui, S. Spatio-temporal degradation modeling and remaining useful life prediction under multiple operating conditions based on attention mechanism and deep learning. Reliab. Eng. Syst. Saf. 2023, 229, 108886. [Google Scholar]
Ding, Y.; Jia, M. Convolutional transformer: An enhanced attention mechanism architecture for remaining useful life estimation of bearings. IEEE Trans. Instrum. Meas. 2022, 71, 3515010. [Google Scholar]
Cai, Z.; Fan, Q.; Vasconcelos, N. A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; Volume 14, pp. 354–370. [Google Scholar]
Zhao, C.; Huang, X.; Li, Y.; Li, S. A novel remaining useful life prediction method based on gated attention mechanism capsule neural network. Measurement 2022, 189, 110637. [Google Scholar]
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic routing between capsules. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Wang, Z. Fast algorithms for the discrete W transform and for the discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 803–816. [Google Scholar]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar]
Nectoux, P.; Gouriveau, R.; Medjaher, K. An experimental platform for bearings accelerated degradation tests. In Proceedings of the IEEE International Conference on Prognostics and Health Management IEEE, Beijing, China, 18–21 June 2012; pp. 23–25. [Google Scholar]
Soualhi, A.; Medjaher, K.; Zerhouni, N. Bearing health monitoring based on Hilbert–Huang transform, support vector machine, and regression. IEEE Trans. Instrum. Meas. 2014, 64, 52–62. [Google Scholar]
Singleton, R.K.; Strangas, E.G.; Aviyente, S. Extended Kalman filtering for remaining-useful-life estimation of bearings. IEEE Trans. Ind. Electron. 2014, 62, 1781–1790. [Google Scholar]
Xu, L.; Huang, J.; Nitanda, A.; Asaoka, R.; Yamanishi, K. A Novel Global Spatial Attention Mechanism in Convolutional Neural Network for Medical Image Classification. arXiv 2020, arXiv:2007.15897. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual Attention Network for Image Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
Liu, C.; Huang, L.; Wei, Z.; Zhang, W. Subtler mixed attention network on fine-grained image classification. Appl. Intell. 2021, 51, 7903–7916. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Shastry, K.A.; Shastry, A. An integrated deep learning and natural language processing approach for continuous remote monitoring in digital health. Decis. Anal. J. 2023, 8, 100301. [Google Scholar]
Wei, D.; Wang, B.; Lin, G.; Liu, D.; Dong, Z.; Liu, H.; Liu, Y. Research on unstructured text data mining and fault classification based on RNN-LSTM with malfunction inspection report. Energies 2017, 10, 406. [Google Scholar] [CrossRef]
Li, S.; Li, W.; Cook, C.; Zhu, C.; Gao, Y. Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zhao, B.; Li, S.; Gao, Y. IndRNN based long-term temporal recognition in the spatial and frequency domain. In Adjunct Proceedings of the 2020 ACM International Joint; Association for Computing Machinery: New York, NY, USA, 2020; pp. 368–372. [Google Scholar]
Zhang, P.; Meng, J.; Luan, Y.; Liu, C. Plant miRNA-lncRNA Interaction Prediction with the Ensemble of CNN and IndRNN. Interdiscip. Sci. 2020, 12, 82–89. [Google Scholar] [PubMed]
Liao, H. Image Classification Based on IndCRNN Module. In Proceedings of the ICVISP 2020: 2020 4th International Conference on Vision, Image and Signal Processing, Bangkok, Thailand, 9–11 December 2020; pp. 1–6. [Google Scholar]
Pang, B.; Nijkamp, E.; Wu, Y.N. Deep learning with tensorflow: A review. J. Educ. Behav. Stat. 2020, 45, 227–248. [Google Scholar]
Qiao, C.; Li, D.; Guo, Y.; Liu, C.; Jiang, T.; Dai, Q.; Li, D. Evaluation and development of deep neural networks for image super-resolution in optical microscopy. Nat. Methods 2021, 18, 194–202. [Google Scholar]
Hodson, T. Root mean square error (RMSE) or mean absolute error (MAE): When to use them or not. Geosci. Model Dev. Discuss. 2022, 15, 5481–5487. [Google Scholar]
Wei, G.; Zhao, J.; Feng, Y.; He, A.; Yu, J. A novel hybrid feature selection method based on dynamic feature importance. Appl. Soft Comput. 2020, 93, 106337. [Google Scholar]
Shao, X.; Kim, C.-S. Adaptive multi-scale attention convolution neural network for cross-domain fault diagnosis. Expert Syst. Appl. 2024, 236, 121216. [Google Scholar]

Figure 1. The schematic illustration of the proposed.

Figure 2. The PRONOSTIA platform.

Figure 3. Dataset of bearing.

Figure 4. The flowchart of the proposed method.

Figure 5. Multi-scale partitioning.

Figure 6. Layered architecture of EM-Net.

Figure 7. Basic structure of IndRNN.

Figure 8. Results of different hyperparameters of the experiment.

Figure 9. Network performance with different K. (a) represents the MAE for bearings 1-3 to 1-7 with different K values, (b) represents the RMSE for bearings 1-3 to 1-7 with different K values.

Table 1. Operating condition information of PHM2012 dataset.

Operating Condition	Radial Force/N	Rotational Speed/(r·min⁻¹)	Training Set	Testing Set
Condition 1	4000	1800	Bearing 1-1, Bearing 1-2	Bearing 1-3, Bearing 1-4, Bearing 1-5, Bearing 1-6, Bearing 1-7
Condition 2	4200	1650	Bearing 2-1, Bearing 2-2	Bearing 2-3, Bearing 2-4, Bearing 2-5, Bearing 2-6, Bearing 2-7
Condition 3	4400	1500	Bearing 3-1, Bearing 3-2	Bearing 3-3

Table 2. Results of metrics using different models.

Method		CNN		TCN		GRU
Metric		MAE	RMSE	MAE	RMSE	MAE	RMSE
Bearing 1	3	0.161	0.193	0.108	0.122	0.102	0.133
	4	0.105	0.128	0.105	0.142	0.096	0.135
	5	0.162	0.193	0.185	0.251	0.153	0.228
	6	0.145	0.168	0.155	0.186	0.198	0.265
	7	0.125	0.149	0.172	0.256	0.182	0.236
Bearing 2	3	0.154	0.195	0.196	0.235	0.205	0.221
	4	0.112	0.158	0.093	0.131	0.087	0.132
	5	0.151	0.186	0.189	0.215	0.191	0.238
	6	0.179	0.203	0.205	0.218	0.212	0.256
	7	0.184	0.216	0.195	0.232	0.196	0.245
Method		BiGRU		BiLSTM		Proposed (ours)
Metric		MAE	RMSE	MAE	RMSE	MAE	RMSE
Bearing 1	3	0.089	0.108	0.079	0.085	0.089	0.105
	4	0.095	0.115	0.084	0.103	0.058	0.075
	5	0.128	0.156	0.106	0.132	0.072	0.084
	6	0.102	0.139	0.133	0.195	0.085	0.103
	7	0.106	0.122	0.085	0.105	0.048	0.059
Bearing 1	3	0.152	0.187	0.126	0.156	0.065	0.075
	4	0.128	0.153	0.066	0.085	0.097	0.109
	5	0.151	0.208	0.132	0.166	0.080	0.098
	6	0.165	0.212	0.092	0.108	0.092	0.102
	7	0.159	0.195	0.132	0.176	0.112	0.119

Table 3. Experiment hyperparameters of the model.

Hyperparameters	Exp1	Exp2	Exp3
Epochs	50	50	50
Batch size	256	128	256
optimizer	RMSprop	Adam	SGD
Learning rate	10⁻³	10⁻³	10⁻³
-	Exp4	Exp5	Exp6
Epochs	50	50	50
Batch size	256	128	256
optimizer	Adam	Adam	Adam
Learning rate	10⁻³	10⁻³	10⁻⁴

Table 4. Ablation experiment of modules.

Model	Evaluated Metrics	Bearing 1-5	Bearing 2-3	Bearing 2-5
Base model	MAE	0.128	0.121	0.133
Base model	RMSE	0.145	0.133	0.177
Base model + EM-NET	MAE	0.093	0.081	0.103
Base model + EM-NET	RMSE	0.102	0.096	0.112
Ours (Base model + EM-NET + Multiscale)	MAE	0.072	0.065	0.080
Ours (Base model + EM-NET + Multiscale)	RMSE	0.084	0.075	0.098

Table 5. Comparison of classification performance on Bearings 1–3.

Methods	Evaluated Metrics
Methods	MAE	RMSE
TCN + Hybrid Attention Mechanism	0.095	0.109
Patch + PAS + Multiscale	0.093	0.105
DMW-Trans	0.099	0.137
MLP + Transformer	0.111	0.140
Ours	0.089	0.105

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, X.; Wang, M. Bearing Lifespan Reliability Prediction Method Based on Multiscale Feature Extraction and Dual Attention Mechanism. Appl. Sci. 2025, 15, 3662. https://doi.org/10.3390/app15073662

AMA Style

Luo X, Wang M. Bearing Lifespan Reliability Prediction Method Based on Multiscale Feature Extraction and Dual Attention Mechanism. Applied Sciences. 2025; 15(7):3662. https://doi.org/10.3390/app15073662

Chicago/Turabian Style

Luo, Xudong, and Minghui Wang. 2025. "Bearing Lifespan Reliability Prediction Method Based on Multiscale Feature Extraction and Dual Attention Mechanism" Applied Sciences 15, no. 7: 3662. https://doi.org/10.3390/app15073662

APA Style

Luo, X., & Wang, M. (2025). Bearing Lifespan Reliability Prediction Method Based on Multiscale Feature Extraction and Dual Attention Mechanism. Applied Sciences, 15(7), 3662. https://doi.org/10.3390/app15073662

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bearing Lifespan Reliability Prediction Method Based on Multiscale Feature Extraction and Dual Attention Mechanism

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.2. Methods

2.2.1. Overview

2.2.2. Multi-Scale Partitioning

2.2.3. Attention Mechanism

2.2.4. RNN

3. Results

3.1. Evaluation Criteria

3.2. Experimental Setup and Performance

3.3. Ablation Experiment

3.4. Comparison of Different Modules

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI