Next Article in Journal
Smart Monitoring for Cancer Treatment: Feasibility Study of an IoT-Based Assessment System
Next Article in Special Issue
Cross-Bonded Cable Circuits Identification Based on Deep Embedded Clustering of Sheath Current Sensing
Previous Article in Journal
Material Classification from Non-Line-of-Sight Acoustic Echoes Using Wavelet-Acoustic Hybrid Feature Fusion
Previous Article in Special Issue
Phase Selection Method for 10 kV Three-Core Cables Under Single-Phase Grounding Fault Transient Based on Surface Magnetic Field Sensing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RUL Prediction Based on xLSTM–Transformer Neural Network for Rolling Element Bearings Under Different Working Conditions

1
School of Mechanical Engineering, Guangxi University, Nanning 530004, China
2
Liuzhou Wuling Automobile Industry Co., Ltd., Liuzhou 545007, China
3
Guangxi Key Laboratory of Manufacturing System and Advanced Manufacturing Technology, Guangxi University, Nanning 530004, China
*
Authors to whom correspondence should be addressed.
Sensors 2026, 26(5), 1578; https://doi.org/10.3390/s26051578
Submission received: 12 January 2026 / Revised: 10 February 2026 / Accepted: 25 February 2026 / Published: 3 March 2026
(This article belongs to the Special Issue Sensor-Based Fault Diagnosis and Prognosis)

Abstract

Remaining useful life (RUL) prediction of rolling bearings is a crucial issue in intelligent predictive maintenance, thereby ensuring equipment safety and reducing maintenance costs. To address the challenge that traditional deep learning models struggle to simultaneously capture local temporal features and global degradation trends when processing degradation health indicators (HI), this paper proposes a hybrid RUL prediction model based on extended Long Short-Term Memory (xLSTM) and Transformer. The model employs an encoder–decoder architecture, integrating the Multi-Head Attention mechanism with the xLSTM module. This design simultaneously enhances the modeling capability of short-term dynamic features and effectively captures long-term degradation patterns. Validation was conducted on the XJTU-SY and PHM2012 datasets. The proposed model outperformed the comparative models across evaluation metrics such as Root Mean Square Error (RMSE), Coefficient of Determination (R2) and the Score, achieving a significant improvement in prediction accuracy and multi-dataset generalization capability. The proposed network provides a more accurate and generalizable solution for bearing health assessment and remaining useful life prediction and demonstrates significant potential for intelligent health management of industrial equipment.

1. Introduction

As critical components in modern rotating machinery, rolling element bearings are widely utilized in various industrial fields. However, bearings are prone to failure due to fatigue and wear with the operating time. Although such failures occur at the component level, they often lead to more severe equipment breakdowns [1]. According to statistics from the Institute of Electrical and Electronics Engineers (IEEE), bearing failures are the most common type of fault in low-voltage permanent magnet synchronous motors, accounting for approximately 41% of all failures [2]. Considering the operating environment of rolling bearings, frequent shutdowns for maintenance are not an optimal maintenance strategy. Therefore, it is meaningful to perform remaining useful life (RUL) prediction for bearings to arrange scheduled maintenance before failure occurs and to reduce unnecessary losses caused by equipment damage or production downtime.
RUL prediction of bearings aims to estimate the remaining operational time of a bearing at a future moment based on its current condition and historical information. Traditional physics-based methods for RUL prediction have been widely applied to bearing life estimation [3,4]. However, traditional physics-based models struggle to capture the complex nonlinear relationships involved in equipment degradation. Accurate modeling under complex operating conditions and individual differences is tough work, resulting in difficulty in applying it in real industrial scenarios.
With the widespread adoption of sensor technologies and the advent of the industrial big data era, data-driven methods have become increasingly prominent. Among them, deep learning-based approaches can leverage the vast amounts of data collected by sensors to uncover hidden features and nonlinear relationships in the degradation process. These methods have demonstrated enhanced accuracy and applicability, which have attracted much more attention in recent years.
Deep learning methodologies have evolved significantly in RUL prediction. Early research utilized Convolutional Neural Networks [5,6] to extract spatial features. Recurrent Neural Networks (RNNs), such as LSTM and GRU [7,8,9,10,11], are used to model temporal dependencies. However, these single-architecture models struggle to balance local feature extraction with long-term dependency modeling.
To address this problem, Transformer network [12] is utilized through introducing the self-attention mechanism, which enables global dependency modeling. Yet, the lack of recurrent structure makes the Transformer less sensitive to local ordering in time-series data. Therefore, hybrid architectures have emerged as an effective way. Recent research has explored hybrid architectures to synergize the global receptive field of Transformers with the local temporal sensitivity of RNNs. Lei et al. integrated hierarchical clustering with a GRU–Transformer architecture to mitigate feature redundancy [13]. Ali and Kamal enhanced the robustness of such hybrid models by employing a hybrid HHO-WHO meta-heuristic algorithm for hyperparameter optimization [14]. Jin et al. combined EMD with a BiLSTM–Transformer network to process non-stationary vibration signals [15].
Although these studies demonstrate superior predictive accuracy compared to single models, their architectural designs are predominantly characterized by simple network stacking. The prevalent strategy of sequentially connecting heterogeneous module lacks deep internal integration or adaptive feature fusion mechanisms. Furthermore, the recurrent components in these hybrid architectures, such as standard LSTM and GRU, are fundamentally constrained by their sigmoid gating and scalar memory cells.
Overall, these constraints affect the performance of bearing RUL prediction in two ways. First, standard LSTM units rely on sigmoid gating, which compresses inputs into the range of [0, 1]. In the final stage of bearing life, the HI value typically exhibits an exponential rise with a wide dynamic range. The sigmoid function tends to saturate when processing these rapidly increasing values, losing sensitivity to the severity of degradation. Secondly, the degradation process of bearing is a long-term evolution. Standard LSTM utilizes scalar memory cells, which has limited capacity to encode the accumulation of damage over long historical windows. This restricts the ability of the RUL prediction model to capture the subtle trend occurring before the accelerated failure point.
To overcome these limitations, this study proposed an xLSTM–Transformer hybrid network. Unlike existing hybrid networks that simply stack standard modules, this study conceptually upgrades the recurrent unit by integrating xLSTM into the Transformer encoder. This integration introduces two key mechanisms absent in conventional hybrid networks. First, by replacing sigmoid gating with exponential gating, the xLSTM module prevents saturation. This is particularly crucial for tracking the accelerated degradation trend of the HI. The exponential mechanism allows the model to retain sensitivity to large values in the late failure stage. This ensures accurate RUL estimation when the degradation accelerates sharply. Secondly, the xLSTM upgrades the storage structure from scalar form, which is an inherent limitation of standard LSTM, to matrix form. This upgrade provides superior capacity to model the degradation trajectory. It enables the hybrid encoder to robustly memorize the long-term evolution of the HI sequence, which can effectively address the problem between early degradation signs and final failure.
The primary contributions of this work are in the following three points: (1) A novel hybrid architecture xLSTM–Transformer is proposed, which effectively fills the gap between global trend modeling and high-dynamic local feature extraction in HI sequences. (2) The saturation issue in the final degradation phase is addressed by the exponential gating of xLSTM, ensuring the prediction sensitivity in the late failure stage. (3) Extensive experiments on bearing datasets demonstrate that the proposed method significantly outperforms standard LSTM–Transformer and other baseline models in RUL prediction.
The rest of the paper is arranged as follows. Section 1 introduces the theoretical background and related work. Section 2 details the proposed xLSTM–Transformer hybrid RUL prediction model. Section 3 describes the experimental setup and datasets, while presenting the results and discussion. Finally, Section 4 provides the concluding remarks of this study.

2. Methods

2.1. ISOMAP

The Isometric Feature Mapping (ISOMAP) algorithm is a nonlinear dimensionality reduction method based on manifold learning [16]. This method replaces the Euclidean distance with the geodesic distance and applies it to the Multidimensional Scaling (MDS) framework to uncover the intrinsic low-dimensional structure within the data. The main computational steps of the ISOMAP algorithm are as follows:
(1)
Construct an adjacency graph G for all data points in the high-dimensional space, where the distance between nodes is defined by the Euclidean metric. Adjacency is defined as an ε-neighborhood or a k-nearest neighbor criterion.
(2)
The Dijkstra or Floyd–Warshall algorithm is employed to compute the shortest path between any pair of nodes, x i and x j , on the graph G. The shortest path is represented as the geodesic distance d G ( x i , x j ) on the manifold. Consequently, the shortest path matrix of the graph is obtained as D G = { d G ( x i , x j ) } .
(3)
The matrix D G is used as the input to the classical MDS algorithm, as described below: Compute the squared distance matrix S = S i j = ( D i j 2 ) . Next, double centering is applied: τ D = H S H / 2 , where H = H i j = ( δ i j 1 / N ) , select the d largest eigenvalues λ 1 , λ 2 , , λ d of τ ( D ) and their corresponding eigenvectors v 1 , v 2 , , v d to form the matrix. The low-dimensional embedding coordinates are given by Y = d i a g ( λ 1 , λ d ) V T .

2.2. xLSTM

The xLSTM is an improved version of the LSTM [17], which overcomes certain limitations of the standard LSTM and provides optimized performance. The xLSTM consists of an input layer, an LSTM layer, and an output layer. The input layer receives sequential data, which is processed and then fed into the LSTM layer. The LSTM layer comprises two types of LSTM modules: the scalar LSTM (sLSTM) and the matrix LSTM (mLSTM). After being processed by the LSTM layer, the data is then transmitted to the output layer. The structure of the xLSTM is illustrated in Figure 1.

2.2.1. sLSTM

The sLSTM enhances the standard LSTM by introducing an exponential gating to prevent saturation and a stabilizer state to permit revisiting storage decisions. Given an input x t at time step t, the sLSTM computes the exponential input gate i t , forget gate f t , and output gate o t , as follows:
i t = exp i ~ t ,   i ~ t = ω i T x t + r i h t 1 + b i
f t = exp f ~ t ,   f ~ t = ω f T x t + r f h t 1 + b f
o t = σ ( ω o T x t + r o h t 1 + b o )
Unlike standard LSTM, the cell state c t and normalizer state n t are updated via:
c t = f t c t 1 + i t z t
n t = f t n t 1 + i t
where z t = tanh ( W z x t + R z h t 1 + b z ) represents the input transformation.
Finally, the hidden state h t is normalized to ensure stability:
h t = o t ( c t / n t )

2.2.2. mLSTM

The mLSTM replaces the scalar memory cell with a matrix memory cell C t R d × d , which enables the model to strictly store more information than scalar standard LSTMs. It also abandons the memory mixing to achieve parallelizability.
First, the input x t is projected into query ( q t ), key ( k t ) and value ( v t ) vectors:
q t = W q x t + b q ,   k t = 1 / d W k x t + b k ,   v t = W v x t + b v
The core innovation is the covariance update rule for the matrix memory cell C t . Combined with exponential gating ( i t , f t ) , the update rule is formulated as:
C t = f t C t 1 + i t v t K t T
where v t K t T represents the outer product, resulting in a matrix update that encodes the key–value relationship.
To retrieve information, the model queries the memory matrix using q t :
h ~ t = C t q t
Finally, the output is obtained by normalizing the retrieved signal:
h t = o t ( h ~ t / m a x ( n t , 1 ) )

2.3. Transformer

The Transformer is a neural network model based on the attention mechanism. By employing Multi-Head Attention and positional encoding, it overcomes the limitation that RNN cannot perform parallel computations. It effectively addresses the poor performance and low efficiency of RNN when processing long sequential data [18].
Figure 2 illustrates the structure of the Transformer network. The Transformer mainly consists of three components: encoder, decoder, and positional encoding. Both the encoder and decoder consist of multiple identical layers. Each layer of the encoder consists of a Multi-Head Attention mechanism and two sub-layers of a feedforward neural network. Each layer of the decoder consists of a Masked Multi-Head Attention mechanism and three sub-layers of a feedforward neural network. The encoder processes the input sequence to generate an intermediate sequence containing the sequence information. The decoder then decodes this intermediate sequence to generate the output sequence. Positional encoding adds positional information to the input data, which addresses the issue that attention mechanisms cannot learn positional information in temporal sequences.
The Transformer network is built upon the attention mechanism, utilizing self-attention to capture the most relevant information within the input itself, thereby reducing the network’s reliance on external information. The introduction of the attention mechanism enables the dynamic generation of a unique context vector at each step of the decoding process. This mechanism allows the decoder to directly access all the output states of the encoder and automatically learn to assign attention to the parts of the input sequence most relevant to the current output through a differentiable weighted sum process. The Multi-Head Attention mechanism is illustrated in Figure 3.
The computation process of the Multi-Head Attention mechanism is as follows: Let the input sequence be denoted as X R B × F × E , where the maximum length of the output sequence is T. The number of attention heads is N, and the dimensionality of the output for each head is H. The batch size is B, and the embedding dimension satisfies E = N · H . The weight matrices in each attention head satisfy W Q R E × H , W K R E × H , and W V R E × H .
The linear transformations of the input data in the Multi-Head Attention mechanism are given by:
Q = l i n e a r Q = Q · W Q ,   ( W Q R E × H ,   Q R B × F × E ) K = l i n e a r K = K · W K ,   ( W K R E × H ,   K R B × F × E ) V = l i n e a r V = V · W V ,   ( W V R E × H ,   V R B × F × E )
The scaled dot-product attention is defined as:
o u t n = A t t e n t i o n Q , K , V = s o f t m a x Q K T H · V , n = 1 , 2 , , N y 1 , y 2 , , y l = s o f t m a x ( v 1 , v 2 , , v l ) y i = e v i i = 1 l e v i i = 1 l y i = 1
After the attention layer computation, the resulting outputs are obtained as { o u t 1 , o u t 2 , , o u t n } . These outputs are then concatenated as:
o u t _ C o n c a t = C o n c a t ( o u t 1 , o u t 2 , , o u t n )
After a linear transformation, the final output of the Multi-Head Attention mechanism is given by:
M u l t i H e a d Q , K , V = o u t _ C o n c a t · W O ,   ( W O R E × E )

2.4. The Structure of the Proposed xLSTM-Transformer

From the above analysis, owing to the multi-branch gating mechanisms of sLSTM and mLSTM, xLSTM is capable of capturing complex nonlinear relationships between the input and the hidden state within a single time step. All these characteristics make it more suitable for bearing RUL prediction tasks. Moreover, its recurrent structure enables the xLSTM to achieve stable convergence even when the available training samples are limited. However, because of its recursive structure, xLSTM suffers from information decay in long sequences, resulting in difficulty capturing global degradation trends.
Through its Multi-Head Attention mechanism, Transformer can simultaneously attend to interactions across multiple time steps within a single forward pass, thereby effectively capturing the global degradation trends throughout the entire degradation process. However, unlike recursive models, Transformer does not inherently encode temporal order. It will easily lose sequential consistency in small-sample scenarios, which could result in oscillations or instability in the predicted curves. Furthermore, since the attention mechanism is a weighted mechanism, local short-term fluctuations may be averaged out, leading to the loss of fine-grained temporal features.
Therefore, this study proposes an xLSTM–Transformer hybrid network, which aims to achieve a synergistic integration of local temporal prediction and global trend modeling. In this hybrid network, the xLSTM dynamic gating structure is incorporated into the encoder layers of the standard Transformer to enable a synergistic prediction of local temporal features and global dependencies.
In the encoder of the network, Multi-Head Attention layer first captures global dependencies across long time steps. To further enhance short-term nonlinear dynamics and local predictive capabilities, an xLSTM module is embedded as the nonlinear transition block. It is positioned between the Multi-Head Attention layer and the subsequent residual connection and layer normalization stage. In this hybrid encoder architecture, an input sequence of length L with dimensions D denoted as X i n R L × D , is first fed into the Multi-Head Attention layer. The global degradation features are extracted through the attention mechanism, which is formulated as:
X A t t = A t t e n t i o n X i n , X i n , X i n ,   X A t t R L × D
To ensure feature integration and dimension alignment, the xLSTM module employs an internal projection and expansion strategy:
  • Up-projection: The attention output X A t t is first mapped into a high-dimensional feature space via a linear projection layer, where p denotes the expansion factor.
  • Nonlinear processing: The exponential gating mechanism of the xLSTM is utilized to perform nonlinear degradation modeling on these high-dimensional features.
  • Down-projection: The processed features are then projected back to the original dimension D through a second linear layer.
This process yields the final output of the xLSTM layer, X x L S T M , as shown in Equations (16)–(18):
X u p = L i n e a r u p X A t t ,   X u p R L × ( p · D )
X p = x L S T M X u p ,   X p R L × ( p · D )
X x L S T M = L i n e a r d o w n X p ,   X x L S T M R L × D
This process is realized through the internal symmetric projection architecture of the xLSTM, where the expansion scale is determined by the projection factor p. Such a design not only enhances the model’s capacity to capture complex local dynamics but also ensures dimensional alignment between the xLSTM output and the residual paths within the Transformer encoder. Consequently, this architecture effectively balances the strengths of the Transformer in long-range prediction with the advantages of xLSTM in local temporal forecasting.
The decoder of the network shares the same input embedding as the encoder, with the purpose of reconstructing and reintegrating the encoded features rather than performing autoregressive prediction. Compared to the original Transformer network, the proposed network incorporates an additional normalization layer. With the combined effect of multi-layer residual connections and normalization, the model exhibits improved gradient stability and enhanced generalization capability. Finally, the decoder results are flattened and fed into a linear layer to produce the predicted RUL of the equipment.
In summary, compared to the standard Transformer, the proposed xLSTM–Transformer hybrid network integrates xLSTM submodules to achieve precise prediction of both local dynamics and global trends. This enhances the network’s capability to characterize the degradation process, thereby improving the accuracy and stability of RUL prediction. The structure of the proposed xLSTM–Transformer hybrid network is illustrated in Figure 4.
The workflow for bearing RUL prediction in this study is illustrated in Figure 5, and the detailed procedure is as follows:
Step 1 Feature extraction. Based on existing expert knowledge, the raw vibration signals of the rolling bearings are analyzed in the time, frequency, and time–frequency domains to compute commonly used statistical features for RUL prediction. Evaluation metrics are used to conduct quantitative analysis of the strengths and weaknesses of each feature. Features that meet predefined thresholds are selected, ultimately forming the feature set used to train the bearing RUL prediction neural network.
Step 2 Neural network construction. Based on the conventional Transformer network, xLSTM layers are embedded within the encoder layers of Transformer to construct the proposed xLSTM–Transformer hybrid neural network.
Step 3 Test set validation. After splitting the feature set into training and test subsets, the data are fed into the neural network. The bearing RUL prediction performance of the network is evaluated by comparing the prediction effectiveness of various networks on the test set.

3. Experimental Validation

3.1. Dataset Description

To verify the superiority of the proposed xLSTM–Transformer hybrid network, this study conducts experimental validation using the following two datasets.

3.1.1. XJTU-SY Dataset

The XJTU-SY dataset is an open-access accelerated life test dataset for rolling bearings provided by Xi’an Jiaotong University [19]. The experimental platform is shown in Figure 6. The dataset contains complete vibration data of 15 bearings under three operating conditions, spanning from healthy to failure. The data were sampled at 25.6 kHz, with one sample recorded per minute, and each sample lasting 1.26 s. The horizontal vibration signals from this dataset were selected for subsequent experiments. The usage details of the XJTU-SY dataset in this study are summarized in Table 1.

3.1.2. PHM2012 Dataset

The PHM2012 dataset is a publicly available dataset released by the IEEE [20]. It was collected using the PRONOSTIA experimental platform, which is shown in Figure 7. The experimental platform conducted accelerated degradation tests on 17 rolling bearings under three different operating conditions. Accelerometers were used to collect the full-life raw vibration signals of all 17 bearings in both horizontal and vertical directions. The data were sampled at 25.6 kHz, with one sample recorded every 10 s, and each sample lasting 0.1 s. The horizontal vibration signals from this dataset were selected for subsequent experiments. The usage details of the PHM2012 dataset in this study are summarized in Table 2.

3.2. Health Index Construction

3.2.1. Feature Extraction

The raw vibration signals generated during the operation of rolling bearings can reflect the health condition of the bearings to some extent. However, the raw vibration signal data are large in volume and contain noise interference, making them unsuitable for direct use in neural network training. By exploring the latent information within the raw vibration signals, features that effectively characterize the health condition of rolling bearings can be identified. In particular, when a bearing develops a fault, the waveform, frequency components, and energy at each frequency of the vibration signals undergo changes. Therefore, it is crucial to extract time-domain, frequency-domain, and time–frequency features from the raw vibration signals using feature extraction techniques.
In this study, 17 time-domain features are selected, including the maximum, minimum, median, peak-to-peak value, mean of absolute values, variance, standard deviation, kurtosis, skewness, root mean square (RMS), mean square value, square root amplitude, waveform factor, peak factor, impulse factor, margin factor, and others. Let p i ( i = 1 , 2 , , 17 ) denote the 17 time-domain features. Five frequency-domain features are selected, including the spectral centroid, mean square frequency, root mean square frequency, and frequency variance. Let p i ( i = 18 , 19 , 20 , 21 , 22 ) denote these five frequency-domain features. Four spectral kurtosis-related features are selected, including the mean, standard deviation, skewness, and kurtosis of the spectra. Let p i ( i = 23 , 24 , 25 , 26 ) denote these four spectral kurtosis features. Eight time–frequency features are selected, including the energy proportions of the eight nodes at the third level of wavelet packet decomposition. Let p i ( i = 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 ) denote these eight time–frequency features.

3.2.2. Feature Selection and Fusion

Not all the features obtained from time-domain, frequency-domain, and time–frequency analyses can effectively reflect the degradation state of bearings. To prevent low-performance features from adversely affecting the subsequent prediction accuracy, it is necessary to establish performance metrics for the features and perform feature selection. The performance of a bearing is an irreversible process that continuously degrades over time. Therefore, degradation features are required to exhibit strong temporal correlation and monotonicity. In addition, features need to be resistant to random noise; they should also exhibit strong robustness.
To evaluate the candidate features, each feature is decomposed into a stationary trend and a random residual using the moving average method formulated as (19). Let the time length of the feature parameter be k. The time series can be represented as T = [ t 1 , t 2 , , t k ] , and the feature parameter sequence can be represented as X = [ x ( t 1 ) , x ( t 2 ) , , x ( t k ) ] . X T represents the stationary trend of the feature parameter sequence and X R represents the random component of the feature parameter sequence.
X = X T + X R
Let the correlation metric be denoted as Corr, the monotonicity metric as Mon, and the robustness metric as Rob. The calculation formulas for these three performance metrics are given as follows:
Correlation metric:
C o r r ( X , T ) = k i = 1 k x t i t i i = 1 k x ( t i ) i = 1 k t i k i = 1 k ( x ( t i ) ) 2 ( i = 1 k x ( t i ) ) 2 k i = 1 k ( t i ) 2 ( i = 1 k t i ) 2
Monotonicity metric:
M o n X = 1 k 1 i = 1 k x t t 1 x ( t i )
Robustness metric:
R o b X = 1 k i = 1 k e x p ( X R ( t i ) X ( t i ) )
The correlation, monotonicity, and robustness metrics of each feature are calculated through Equations (20)–(22). These three evaluation metrics are then combined by a weighted approach, formulated as Equation (23), to obtain the degradation evaluation metric, T o , for the bearing features.
T o = ω 1 C o r r X , T + ω 2 M o n X + ω 3 R o b ( X ) ,   ω i > 0 i = 1 3 ω i = 1
In the equation, ω 1 , ω 2 , and ω 3 are the weighting coefficients. According to Reference [21], the three weights are set to 0.2, 0.5, and 0.3. A higher value of the evaluation metric T o indicates better performance of the feature in representing the bearing degradation. The evaluation metrics T o are ranked in descending order, and features with T o > 0.6 are selected to form the feature set for training the RUL prediction model.
The selected feature set is subjected to dimensionality reduction through the ISOMAP algorithm. The number of neighbors for each data point is set to 10, and the target dimensionality is set to one.
The label column used as input to the network in this study adopts a piecewise labeling scheme. Using the one-dimensional HI obtained via ISOMAP dimensionality reduction as a reference, the first 20% of the HI values are considered as the normal operation stage of the bearing. The degradation starting point of the bearing is then determined based on the 3σ criterion, and a piecewise bearing degradation label is constructed accordingly.

3.3. Bearing RUL Prediction Results

In the proposed xLSTM–Transformer fusion network, the xLSTM layer is composed of a stacked structure consisting of one mLSTM layer and one sLSTM layer. The detailed hyperparameters configurations are summarized in Table 3. Considering training efficiency and the length of the input data, the batch size is set to 32 for sequences shorter than 500 and increased to 64 for those exceeding 500. To ensure a fair comparison, the hidden layer dimensions were set within the range of 16 to 32, ensuring that the number of trainable parameters for all models remained at a comparable level.
Both the proposed method and the comparative methods are implemented based on the PyTorch 2.7.1 framework. The training and testing environments are equipped with an AMD Ryzen 9700X CPU, an NVIDIA GeForce RTX 5070 GPU, and 32 GB of RAM.
To evaluate the performance of the RUL prediction models, the RMSE and R2 are employed as evaluation metrics. Lower values of RMSE indicate better model performance, while a higher value of R2 indicates better performance. The calculation formulas for these three metrics are given in Equations (24) and (25).
R M S E = i = 1 k ( y i p r ) 2 k
R 2 = 1 i = 1 k ( y i p r ) 2 i = 1 k ( y ¯ p r ) 2
where y i denotes the true value, p r denotes the predicted value, y ¯ denotes the mean of the true values, and k denotes the length of the bearing life sequence.
To further evaluate the reliability of the proposed method, a Score function, which is defined in the IEEE PHM 2012 Prognostic challenge, was employed [20]. Unlike symmetric metrics such as RMSE, this Score function imposes different penalty coefficients for early and late predictions, which aligns with industrial maintenance requirements where late predictions carry higher risks. A lower value of the Score indicates better model performance. The Score function is formulated as Equations (26)–(28).
E r r o r i = y ^ i y i
s i = e E r r o r i / 13 1 ,   i f   E r r o r i < 0 e E r r o r i / 10 1 ,   i f   E r r o r i > 0
S c o r e = i = 1 n s i
where y ^ i denotes the predicted value, y i denotes the true value, n denotes the total number of samples.

3.3.1. Case 1: Validation on the XJTU-SY Dataset

To validate the effectiveness and superiority of the proposed RUL prediction method, experiments are first conducted using the XJTU-SY dataset. The usage details of the dataset are presented in Table 1. To improve prediction accuracy and increase the dataset size while preventing overfitting due to the complexity of the proposed fusion network, the input feature set of the network includes not only the HI obtained via the ISOMAP method described above but also a time index column.
Six networks—LSTM, xLSTM, Transformer, LSTM–Transformer, BiLSTM–Transformer, and GRU–Transformer—are selected for comparative experiments. The prediction results are shown in Figure 8.
As shown in Figure 8, the RUL prediction curves of the seven neural networks can all reflect the degradation trend of the bearing to some extent, but noticeable differences exist in prediction accuracy, smoothness, and degradation responsiveness.
The LSTM prediction results are able to closely track the actual RUL changes during the early degradation stage, exhibiting smooth and stable fitting. However, during the mid-degradation stage, the prediction results decline too rapidly, and in the late degradation stage, the predictions exhibit a noticeable lag, failing to accurately reflect the actual RUL decrease. This indicates that LSTM has limitations in capturing nonlinear degradation features and long-term dependencies.
xLSTM demonstrates good smoothness and stability in capturing the overall trend, but it exhibits instances of both underestimation and overestimation of the RUL. Although xLSTM improves the gating mechanism and enhances feature memory capability, it tends to overestimate the RUL in the late degradation stage, and the predicted RUL still decreases more slowly than the actual decline.
The Transformer is capable of effectively learning global temporal dependency features, and its prediction curve aligns well with the overall trend of the actual RUL. However, due to the high sensitivity of the attention mechanism, the prediction curve exhibits an upward trend during the early degradation stage. The LSTM–Transformer combined network integrates the smoothness of LSTMs with the global feature extraction capability of the Transformer, but its performance improvement is limited due to the inherent constraints of LSTMs.
Compared with the aforementioned networks, the proposed xLSTM–Transformer network achieves the best prediction results. The network produces a smooth and stable prediction curve during the early degradation stage. It can accurately track the degradation onset and trend in the mid-degradation stage and sensitively capture the actual RUL decline rate in the late degradation stage. As a result, the prediction curve closely fits the actual RUL overall. This indicates that the synergistic effect of the xLSTM’s improved gating structure and the Transformer’s self-attention mechanism enables the combined network to achieve a good balance between smoothness and accuracy, thereby enhancing the reliability of RUL prediction.
To further evaluate the performance of the networks in RUL prediction, a quantitative comparison of the seven aforementioned networks is conducted. The evaluation metric results under Condition 1, Condition 2, and Condition 3 are presented in Table 4.
As the results demonstrated in Table 4, the performance of individual models remains inconsistent across different conditions. For instance, xLSTM outperforms the Transformer in Conditions 1 and 2, whereas the Transformer slightly surpasses xLSTM in Condition 3. This reflects the inherent limitations of single-architecture models. However, the introduction of the hybrid architecture leads to a significant improvement in prediction accuracy. In particular, the proposed method consistently outperforms all single models across various metrics. It demonstrates that the hybrid architecture approach achieves a synergistic integration of global trends and local dynamics, effectively compensating for the shortcomings of individual models.
More importantly, a comparison between the proposed method and the closely related hybrid architecture, LSTM–Transformer, reveals that the proposed method exhibits superior performance. This superiority is attributed to the exponential gating and matrix memory structures of xLSTM, which provided a more robust nonlinear fitting capability and a broader feature representation space compared to the scalar memory of standard LSTM. These architectural advantages allow for a more complete retention of the deep representations extracted by the Multi-Head Attention mechanism, thereby demonstrating exceptional performance in RUL prediction tasks. Furthermore, the proposed method demonstrates superior overall performance in terms of the Score. This indicates that the proposed method not only improves prediction accuracy but also effectively minimizes the risk of late predictions, thereby offering superior reliability for industrial maintenance safety.

3.3.2. Case 2: Validation on the PHM2012 Dataset

To evaluate the generalizability of the proposed method and further validate its effectiveness, experiments are also conducted on the PHM2012 dataset. The usage details of the dataset are shown in Table 2. The experimental environment and procedures are the same as those described in the XJTU-SY experiments. The prediction results are shown in Figure 9.
From Figure 9, the overall prediction trends of the networks are consistent with those observed in the XJTU-SY experiments. LSTM still exhibits significant lag in the late degradation stage, indicating its limited capability in predicting complex nonlinear degradation processes. xLSTM shows certain improvements over LSTM, but issues of both underestimation and overestimation of the RUL still exist. The prediction results of the Transformer closely follow the overall trend of the actual RUL; however, fluctuations are still observed during the early degradation stage. The incorporation of LSTMs improves the prediction accuracy of the Transformer; however, lag is observed during the mid-degradation stage.
In contrast, the proposed xLSTM–Transformer network demonstrates good consistency with the actual RUL in both the stable and degradation stages. Overestimation or lag observed in the previously mentioned networks is mitigated, resulting in a prediction curve that fits the actual RUL more closely.
The quantitative analysis results of the comparative experiments under Condition 1, Condition 2, and Condition 3 are presented in Table 5.
As observed from Table 5, the proposed method consistently achieves the best prediction performance across different working conditions. Compared with the other networks, it exhibits lower errors and higher fitting accuracy.
Overall, based on the analyses of the experiments presented in the above sections, the proposed method maintains high prediction accuracy and trend consistency across different datasets and operating conditions. This demonstrates the strong applicability and robust cross-scenario adaptability of the proposed method for bearing RUL prediction.

3.3.3. Discussion

This study achieves high-precision RUL prediction on two distinct datasets, XJTU-SY and PHM2012. This demonstrates the superior generalization capability of the proposed method. This success is mainly attributed to the synergistic ability of the hybrid xLSTM–Transformer architecture to capture both short-term local variations and long-term global trends.
Specifically, the bearing degradation process includes both irreversible long-term wear trends and local intense fluctuations. Compared to the scalar memory of standard LSTMs, the matrix memory introduced by xLSTM offers a larger feature storage capacity. This allows for the retention of richer fluctuation information. Combined with the exponential gating mechanism, the model effectively avoids the numerical compression issue in long-sequence inputs, thereby accurately capturing local degradation details even in highly dynamic datasets like PHM2012. Furthermore, the Transformer’s self-attention mechanism excels at modeling long-range dependencies, ensuring the prediction curve globally adheres to the bearing’s life decay trajectory, and correcting the potential local overfitting risks of the xLSTM. This complementary fusion of “local-global” features enables the model to exhibit strong representation capabilities across different operating conditions and degradation stages.
Despite the overall excellent performance, there remains room for improvement under specific complex conditions. Notably, in Condition 2 of the PHM2012 dataset, the goodness of fit (R2) for all models showed a marked decline. Although the proposed method achieved the best result in this group, a significant gap remains compared to Condition 1. This indicates that the robustness of current feature extraction and model structure still faces challenges when encountering extremely atypical degradation patterns. Additionally, while the proposed method improves accuracy, its complex stacked structure and Multi-Head Attention mechanism increase computational overhead.
To address these limitations, future work will focus on two main aspects: First, enhancing model robustness under extreme conditions. To overcome the performance degradation observed in Condition 2 of the PHM2012 dataset, future research aims to introduce data augmentation strategies, such as injecting random noise during training to simulate harsher operating environments. Additionally, transfer learning could be employed, utilizing models trained under standard conditions as a baseline and fine-tuning them with a small number of samples to adapt to extreme conditions. Second, research on model lightweighting. To meet the real-time requirements of industrial sites, the model needs to be optimized for efficiency. Future work will explore model pruning techniques to remove unimportant weights or neurons from the neural network. The goal is to appropriately reduce the parameter count and computational load without significantly compromising prediction accuracy.

4. Conclusions

This paper proposes a hybrid network combining the xLSTM and Transformer architectures to improve the prediction accuracy of bearing RUL. The network integrates the local temporal dependency modeling capability of the xLSTM with the global attention mechanism of the Transformer. By leveraging enhanced deep temporal modeling ability, it accurately characterizes the degradation process and enhances the network’s generalization capability.
To validate the effectiveness and stability of the proposed network, comparative experiments were conducted under different operating conditions on two distinct datasets. The LSTM, xLSTM, Transformer, and LSTM–Transformer networks were selected as benchmarks. The results indicate that the proposed method accurately fits the degradation trend and achieves the best performance in evaluation metrics such as RMSE, R2, and the Score. Through validation on different datasets, the proposed method maintains high prediction accuracy, demonstrating superior generalization performance and robustness.
In summary, the proposed xLSTM–Transformer network demonstrates excellent prediction stability, generalization capability, and practical value across different datasets. It shows strong potential for engineering applications, providing a novel approach for time-series-based remaining useful life prediction in equipment health management.

Author Contributions

Conceptualization, W.H. and M.X.; methodology, R.J. and M.X.; validation, R.J., Z.L. and H.L.; formal analysis, R.J.; investigation, R.J. and W.M.; writing—original draft preparation, R.J.; writing—review and editing, R.J., Z.L., H.L., W.M., W.H. and M.X.; supervision, W.H. and M.X.; funding acquisition, W.H. and M.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Guangxi Science and Technology Major Program (Grant No. AA24206039), the Central Government Guidance Fund for Local Science and Technology Development (Grant No. ZY24212060), the Guangxi Natural Science Foundation (Grant Nos. 2024GXNSFBA010142, 2025GXNSFAA069184) and Project for Enhancing Young and Middle-aged Teacher’s Research Basis Ability in Colleges of Guangxi (Grant No. 2024KY0013).

Data Availability Statement

Publicly available datasets were analyzed in this study. The XJTU-SY dataset can be found at http://biaowang.tech/xjtu-sy-bearing-datasets/, accessed on 24 February 2026, and the PHM2012 dataset is available from the IEEE PHM 2012 Prognostic Challenge website. The processed data and code generated during the study are available on request from the corresponding author.

Conflicts of Interest

Author Weizhong Mo was employed by the company Liuzhou Wuling Automobile Industry Co., Ltd. The remaining authors declare that research was conducted in the absence of any commercial of financial relationships that could be construed as a potential conflict of interest.

References

  1. Peng, B.; Bi, Y.; Xue, B.; Zhang, M.; Wan, S. A Survey on Fault Diagnosis of Rolling Bearings. Algorithms 2022, 15, 347. [Google Scholar] [CrossRef]
  2. He, J.; Somogyi, C.; Strandt, A.; Demerdash, N.A.O. Diagnosis of Stator Winding Short-Circuit Faults in an Interior Permanent Magnet Synchronous Machine. In Proceedings of the 2014 IEEE Energy Conversion Congress and Exposition (ECCE), Pittsburgh, PA, USA, 14–18 September 2014; IEEE: New York, NY, USA, 2014. [Google Scholar]
  3. Ji, M.; Smith, A.; Soghrati, S. A Micromechanical Finite Element Model for Predicting the Fatigue Life of Heterogenous Adhesives. Comput. Mech. 2022, 69, 997–1020. [Google Scholar] [CrossRef]
  4. Wang, H.; Liu, X.; Wang, X.; Wang, Y. Numerical Method for Estimating Fatigue Crack Initiation Size Using Elastic–Plastic Fracture Mechanics Method. Appl. Math. Model. 2019, 73, 365–377. [Google Scholar] [CrossRef]
  5. Yang, B.; Liu, R.; Zio, E. Remaining Useful Life Prediction Based on a Double-Convolutional Neural Network Architecture. IEEE Trans. Ind. Electron. 2019, 66, 9521–9530. [Google Scholar] [CrossRef]
  6. Qiu, H.; Niu, Y.; Shang, J.; Gao, L.; Xu, D. A Piecewise Method for Bearing Remaining Useful Life Estimation Using Temporal Convolutional Networks. J. Manuf. Syst. 2023, 68, 227–241. [Google Scholar] [CrossRef]
  7. Guo, L.; Li, N.; Jia, F.; Lei, Y.; Lin, J. A Recurrent Neural Network Based Health Indicator for Remaining Useful Life Prediction of Bearings. Neurocomputing 2017, 240, 98–109. [Google Scholar] [CrossRef]
  8. Ma, M.; Mao, Z. Deep Convolution-Based LSTM Network for Remaining Useful Life Prediction. IEEE Trans. Ind. Inform. 2020, 17, 1658–1667. [Google Scholar] [CrossRef]
  9. Li, J.; Li, X.; He, D. A Directed Acyclic Graph Network Combined with CNN and LSTM for Remaining Useful Life Prediction. IEEE Access 2019, 7, 75464–75475. [Google Scholar] [CrossRef]
  10. Zhang, H.; Zhang, Q.; Shao, S.; Niu, T.; Yang, X. Attention-Based LSTM Network for Rotatory Machine Remaining Useful Life Prediction. IEEE Access 2020, 8, 132188–132199. [Google Scholar] [CrossRef]
  11. Zhou, J.; Qin, Y.; Chen, D.; Liu, F.; Qi, Q. Remaining Useful Life Prediction of Bearings by a New Reinforced Memory GRU Network. Adv. Eng. Inform. 2022, 53, 101682. [Google Scholar] [CrossRef]
  12. Jiang, L.; Zhang, X.; Cao, H.; Zhang, Y. A Transformer-Based Framework with Historical Data Fusion for RUL Prediction. Meas. Sci. Technol. 2025, 36, 106103. [Google Scholar] [CrossRef]
  13. Lei, W.; Dong, X.; Cui, F.; Huang, G. A Remaining Useful Life Prediction Method for Rolling Bearings Based on Hierarchical Clustering and Transformer–GRU. Appl. Sci. 2025, 15, 5369. [Google Scholar] [CrossRef]
  14. Ali, A.R.; Kamal, H. Hybrid HHO–WHO Optimized Transformer-GRU Model for Advanced Failure Prediction in Industrial Machinery and Engines. Sensors 2026, 26, 534. [Google Scholar] [CrossRef] [PubMed]
  15. Jin, C.; Li, B.; Yang, Y.; Yuan, X.; Tu, R.; Qiu, L.; Chen, X. Remaining Useful Life Prediction of Rolling Bearings Based on Empirical Mode Decomposition and Transformer Bi-LSTM Network. Appl. Sci. 2025, 15, 9529. [Google Scholar] [CrossRef]
  16. Tenenbaum, J.B.; Silva, V.D.; Langford, J.C. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2000, 290, 2319–2323. [Google Scholar] [CrossRef] [PubMed]
  17. Beck, M.; Poppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. Xlstm: Extended Long Short-Term Memory. In Advances in Neural Information Processing Systems 37, Proceedings of the 38th Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; Curran Associates, Inc.: New York, NY, USA, 2024; Volume 37, pp. 107547–107603. [Google Scholar]
  18. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomze, A.N.; Kaiser, L. Attention Is All You Need. In Advances in Neural Information Processing Systems 30, Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
  19. Wang, B.; Lei, Y.; Li, N.; Li, N. A Hybrid Prognostics Approach for Estimating Remaining Useful Life of Rolling Element Bearings. IEEE Trans. Reliab. 2020, 69, 401–412. [Google Scholar] [CrossRef]
  20. Nectoux, P.; Gouriveau, R.; Medjaher, K.; Ramasso, E.; Morello, B.; Zerhouni, N.; Varnier, C. PRONOSTIA: An Experimental Platform for Bearings Accelerated Life Test. In Proceedings of the IEEE International Conference on Prognostics and Health Management, Denver, CO, USA, 18–21 June 2012. [Google Scholar]
  21. Zhang, B.; Zhang, L.; Xu, J. Degradation Feature Selection for Remaining Useful Life Prediction of Rolling Element Bearings. Qual. Reliab. Eng. Int. 2015, 32, 547–554. [Google Scholar] [CrossRef]
Figure 1. The structure of the xLSTM.
Figure 1. The structure of the xLSTM.
Sensors 26 01578 g001
Figure 2. The structure of the Transformer network.
Figure 2. The structure of the Transformer network.
Sensors 26 01578 g002
Figure 3. The structure of the multi-head mechanism.
Figure 3. The structure of the multi-head mechanism.
Sensors 26 01578 g003
Figure 4. The structure of the proposed xLSTM–Transformer network.
Figure 4. The structure of the proposed xLSTM–Transformer network.
Sensors 26 01578 g004
Figure 5. The procedure of the proposed xLSTM–Transformer network.
Figure 5. The procedure of the proposed xLSTM–Transformer network.
Sensors 26 01578 g005
Figure 6. The test rig of the XJTU-SY dataset. Adapted from Ref. [19].
Figure 6. The test rig of the XJTU-SY dataset. Adapted from Ref. [19].
Sensors 26 01578 g006
Figure 7. The test rig of the PHM2012 dataset. Adapted from Ref. [20].
Figure 7. The test rig of the PHM2012 dataset. Adapted from Ref. [20].
Sensors 26 01578 g007
Figure 8. RUL prediction results of XJTU-SY dataset. (a) Result of Condition 1; (b) result of Condition 2; (c) result of Condition 3.
Figure 8. RUL prediction results of XJTU-SY dataset. (a) Result of Condition 1; (b) result of Condition 2; (c) result of Condition 3.
Sensors 26 01578 g008
Figure 9. RUL prediction results of PHM2012 dataset. (a) Result of Condition 1; (b) result of Condition 2; (c) result of Condition 3.
Figure 9. RUL prediction results of PHM2012 dataset. (a) Result of Condition 1; (b) result of Condition 2; (c) result of Condition 3.
Sensors 26 01578 g009
Table 1. The settings of the train set and test set in XJTU-SY dataset.
Table 1. The settings of the train set and test set in XJTU-SY dataset.
ConditionSpeed/rpmLoad (kN)Train SetTest Set
1210012Bearing1_1, Bearing1_2
Bearing1_4, Bearing1_5
Bearing1_3
2225011Bearing2_1, Bearing2_2
Bearing2_4, Bearing2_5
Bearing2_3
3240010Bearing3_1, Bearing3_2
Bearing3_4, Bearing3_5
Bearing3_3
Table 2. The settings of the train set and test set in PHM2012 dataset.
Table 2. The settings of the train set and test set in PHM2012 dataset.
ConditionSpeed/rpmLoad (kN)Train SetTest Set
1210012Bearing1_1, Bearing1_2Bearing1_3
2225011Bearing2_1, Bearing2_2Bearing2_3
3240010Bearing3_1, Bearing3_2Bearing3_3
Table 3. Setting of hyperparameters.
Table 3. Setting of hyperparameters.
ParameterValueParameterValue
Learning Rate0.001Sequences Length10
Batch size32 or 64Hidden dimensions16–32
Training Epochs50Dropout Rate0.1
OptimizerAdamLoss FunctionMSE
Table 4. Prediction errors under different conditions in XJTU-SY.
Table 4. Prediction errors under different conditions in XJTU-SY.
Condition123
Evaluation MetricsRMSER2ScoreRMSER2ScoreRMSER2Score
LSTM0.09820.91210.85310.12840.84414.24500.08190.80531.1687
xLSTM0.07780.94490.87070.11230.88063.69410.06110.89151.6017
Transformer0.08850.92870.96660.11100.88334.53340.07420.84010.9482
LSTM–Transformer0.06660.95960.67880.09420.91603.28630.05740.90450.8114
BiLSTM–Transformer0.06510.96140.68350.09140.92052.71220.06150.89030.9686
GRU–Transformer0.06860.95720.69780.08880.92553.03620.05930.89791.0050
Proposed method0.05830.96910.55720.07840.94182.57770.05320.91790.8410
Table 5. Prediction errors under different conditions in PHM2012.
Table 5. Prediction errors under different conditions in PHM2012.
Condition123
Evaluation MetricsRMSER2ScoreRMSER2ScoreRMSER2Score
LSTM0.12750.850221.99810.07250.57056.18990.14260.81544.4517
xLSTM0.12020.867022.43320.06930.60829.38490.14080.82024.7340
Transformer0.11380.880718.60700.06750.62814.53280.13760.82843.6868
LSTM–Transformer0.10070.906717.99400.08560.40218.03990.13350.83843.6223
BiLSTM–Transformer0.09610.915015.86980.08030.47388.25960.12470.85893.3644
GRU–Transformer0.09800.911716.32590.06520.65314.19510.12590.85633.3398
Proposed method0.05650.970610.16160.06510.65394.13670.12110.86713.4079
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, R.; Li, Z.; Lu, H.; Mo, W.; Huang, W.; Xu, M. RUL Prediction Based on xLSTM–Transformer Neural Network for Rolling Element Bearings Under Different Working Conditions. Sensors 2026, 26, 1578. https://doi.org/10.3390/s26051578

AMA Style

Jiang R, Li Z, Lu H, Mo W, Huang W, Xu M. RUL Prediction Based on xLSTM–Transformer Neural Network for Rolling Element Bearings Under Different Working Conditions. Sensors. 2026; 26(5):1578. https://doi.org/10.3390/s26051578

Chicago/Turabian Style

Jiang, Runzhong, Ziqi Li, Haiyu Lu, Weizhong Mo, Wei Huang, and Minmin Xu. 2026. "RUL Prediction Based on xLSTM–Transformer Neural Network for Rolling Element Bearings Under Different Working Conditions" Sensors 26, no. 5: 1578. https://doi.org/10.3390/s26051578

APA Style

Jiang, R., Li, Z., Lu, H., Mo, W., Huang, W., & Xu, M. (2026). RUL Prediction Based on xLSTM–Transformer Neural Network for Rolling Element Bearings Under Different Working Conditions. Sensors, 26(5), 1578. https://doi.org/10.3390/s26051578

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop