Next Article in Journal
Graph-Driven Medical Report Generation with Adaptive Knowledge Distillation
Previous Article in Journal
Non-Destructive Evaluation of Plantation Eucalyptus nitens Logs and Recovered Samples to Analyse the Stiffness Property
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery

by
Amrit Dahal
1,2,
Hong-Zhong Huang
1,2,*,
Cheng-Geng Huang
1,2,
Tudi Huang
1,2,
Smaran Khanal
1,2 and
Sajawal Gul Niazi
1,2
1
School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
2
Center for System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu 611731, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(20), 10972; https://doi.org/10.3390/app152010972
Submission received: 18 September 2025 / Revised: 7 October 2025 / Accepted: 10 October 2025 / Published: 13 October 2025
(This article belongs to the Section Mechanical Engineering)

Abstract

Prognostics for rotating machinery, particularly bearings, encounter significant challenges in constructing reliable health indicators (HIs) that accurately reflect degradation trajectories, thereby enabling precise remaining useful life (RUL) predictions. This article proposes a novel integrated approach for predicting the RUL of bearings without manual feature engineering. Specifically, a sequential autoencoder integrating a convolutional neural network (CNN) and vision Transformer (Vi-T) is employed to capture the local spatial patterns and global temporal correlations of time-domain vibration signals. The Wasserstein distance is introduced to quantify the divergence between healthy and degraded signal embeddings, resulting in a robust HI metric. Subsequently, the derived HI is fed into a CNN-bidirectional long short-term memory-regressor with Monte Carlo dropout to provide RUL predictions and Bayesian uncertainty estimates. Experimental results from the Xi’an Jiao-Tong University bearing dataset demonstrate that the proposed method surpasses conventional techniques in HI construction and RUL prediction accuracy, demonstrating its efficacy for complex industrial systems with minimal data preprocessing.

1. Introduction

In parallel with the burgeoning advancements in industrial Internet of Things, artificial intelligence algorithms, and computational capabilities, DL-based data-driven methodologies have emerged as predominant paradigms in PHM for rotating machinery [1,2,3]. Bearings, as critical components in sectors such as power generation, aerospace, and manufacturing, directly affect system reliability and performance, so accurate RUL estimation is essential for facilitating predictive maintenance, reducing unplanned downtimes, and ultimately extending the operational lifespan of machinery [4,5,6,7,8]. RUL prediction approaches are dichotomized into physics-based models [9,10], which demand intricate mechanical insights and precise simulations, and data-driven models [11,12,13], which harness operational data to discern degradation patterns, offering adaptability to multifaceted systems sans profound physical domain knowledge. Consequently, data-driven techniques are increasingly favored for handling the large and high-dimensional datasets common in modern predictive maintenance.
Constructing efficacious data-driven RUL models entails four pivotal stages: sensor data acquisition and preprocessing (e.g., denoising, normalization, dimensionality reduction); degradation feature extraction and health indicator (HI) formulation, which combines signal processing with domain expertise; FPT detection; and RUL regression using machine learning [14]. The HI construction is critical because it determines how faithfully degradation is represented and thus influences prediction accuracy. In this regard, Dong et al. [15] applied principal component analysis to combine the features, which were then input into a weighted complex support vector machine model to predict the RUL of bearings. Similarly, Guo et al. [16] introduced a degradation indicator for predicting the RUL of bearings, utilizing EMD. Zhang et al. [17] proposed a method that combines multi-scale entropy features as a health indicator for accurate RUL prediction of rolling bearings. Zhao et al. [18] employed a method for RUL prediction of rolling bearings that focuses on modeling degraded features using the derivative method of the fitting curve for maximum power spectral density. Long et al. [19] introduced a method for RUL prediction of rolling bearings that emphasizes the creation of HI based on EMD, refined composite multi-scale attention entropy, and dispersion entropy. These studies primarily relied on signal processing techniques to extract features from raw signals in order to develop degradation indicators. These methods are advantageous due to their clear physical interpretation and ease of validation. However, signal processing-based feature extraction has its limitations. The techniques often depend on expert knowledge and experience, may lack generalizability across equipment types, and can miss implicit nonlinear features present in complex data.
Conversely, DL paradigms automate HI extraction from raw vibrations. For instance, [20] was a pioneer in applying recurrent neural networks to autonomously extract HIs from selected features, creating an early data-driven framework to characterize the deterioration of rolling bearings. Similarly, Ref. [21] utilized a hybrid model combining CNN and bidirectional gated recurrent units to derive HIs from raw vibrational signals for roller bearings. Li et al. [22] used a multi-channel fusion hierarchical vision Transformer with wavelet packet decomposition representations to model degraded features for RUL prediction. Guo et al. [23] proposed a hybrid approach that initially constructs a nonlinear HI using complete ensemble empirical mode decomposition with adaptive noise and kernel principal component analysis. They then extracted multi-domain features from vibration signals and applied a dual-channel Transformer network with convolutional attention modules to generate HIs for other bearings. Ping et al. [24] proposed a multi-scale efficient channel attention convolutional neural network and bidirectional gated recurrent unit model for RUL prediction of rolling bearings, extracting multi-scale temporal features from Gram’s angle difference field images. Notwithstanding advancements, these predominantly supervised techniques necessitate abundant labeled data, impeding applicability in prevalent, unlabeled real-world industrial scenarios.
To facilitate data-driven automatic construction of HIs, this study investigates unsupervised learning paradigms. Autoencoders provide a viable framework, enabling label-agnostic HI development through feature extraction and degradation trend modeling. Guo et al. [25] proposed an unsupervised HI construction method using a multi-scale convolutional autoencoder optimized by a genetic algorithm, effectively identifying machine degradation through feature learning and data similarity. Ma et al. [26] developed a self-attention convolutional autoencoder, demonstrating enhanced trendability and scale similarity on bearing and milling cutter datasets. Xu et al. [27] proposed an enhanced stacked autoencoder with an exponential weighted moving average for health indicator construction using unlabeled vibration signals, improving noise reduction and eliminating the need for manual feature selection. De et al. [28] developed an LSTM autoencoder with attention for health indicator construction in aircraft systems. Xu et al. [29] developed a multi-scale-multi-head attention mechanism with an automatic encoder–decoder model, extracted multi-scale features from raw vibration signals, and measured the Wasserstein distance between healthy and unhealthy features, which demonstrated superiority over other similarity-based methods and provided a more reliable HI. Qu et al. [30] proposed an unsupervised method for bearing health indicator construction using multi-scale convolution and long short-term memory combined with an autoencoder for feature extraction and support vector regression to build health indicators, improving degradation trend detection. Wang et al. [31] developed a denoising Transformer autoencoder for tool condition monitoring, enhancing feature extraction and resistance to noise.
Although autoencoders are extensively employed in existing studies for unsupervised feature learning, they exhibit shortcomings in processing time-series signals, particularly in capturing the temporal dependencies of encoded features. Additionally, conventional HI construction techniques frequently necessitate manual selection and design of dominant features, rendering them inefficient and error-prone, especially with high-dimensional or nonlinear datasets. This reliance on manual feature engineering can lead to inaccurate similarity measurements and diminished HI fidelity. To address these challenges, this paper proposes an HCVT-WD for bearing HI construction and RUL prediction. The proposed method encompasses four key stages: (1) an unsupervised HCVT-WD autoencoder framework is developed to extract features from FFT-transformed raw vibration signals without manual feature engineering or labeling; (2) CNN and Vi-T are integrated to captures local spatial patterns and global temporal correlations; (3) Wasserstein distance is used to quantify divergences between healthy for robust HI formulation; and (4) an HI driven RUL prediction pipeline provides accurate forecasts with uncertainty quantification.
The structure of the paper is as follows: Section 2 focuses on the proposed framework, specifically highlighting the development of HIs using the proposed HCVT-WD model. Section 3 details the experimental setup, results, and analysis. Section 4 provides a summary of the key results and proposes possible directions for future studies.

2. Proposed Framework

The overall workflow of the proposed method is illustrated in Figure 1, with a detailed clarification of the HI construction process provided in Algorithm 1. In the initial phase, the HCVT-WD model is trained using a subset of the dataset { x i t r a i n } i = 1 N t r a i n composed of non-overlapping windows extracted from vibration signals recorded during healthy operation. N t r a i n refers to the number of time instances, with each instance corresponding to a specific time window from the healthy state data. The training process minimizes the reconstruction error of the network. Once trained, the encoded features from the HCVT-WD are extracted. In the next phase, the model is implemented on the whole dataset { x i t e s t } i = 1 N t e s t , which consists of both healthy and unhealthy state data samples, similarly done for the training dataset. The WD between the combined encoded features of the healthy state and the encoded features of all other states is computed and used as the HI. Finally, the HI is modeled using a CNN-BiLSTM network with MC simulations to predict the RUL and quantify uncertainty.
Algorithm 1 Train the HCVT-WD model and construct HI
  • Input: Frequency domain raw vibrational signals (healthy and unhealthy states): x i r a w L × C and initial parameters θ .
  • Build and Initialize: HCVT-WD model.
  • Consider training data { x i t r a i n } i = 1 N t r a i n as healthy data and { x i t e s t } i = 1 N t e s t as whole data.
  • While  i E p o c h s do
  •    Decoded output x ^ i t r a i n from model.
  •    Minimize loss between x i t r a i n and x ^ i t r a i n .
  •     θ Update parameters of the model.
  • End for
  • Obtain encoded features Z t r a i n and Z t e s t along with trained HCVT-WD model parameters θ * .
  • Calculate the Wassertian distance between encoded Z t r a i n and Z t e s t as HI.
  • H I j = W D Z t e s t , j , Z t r a i n = 1 N t r a i n i = 1 N t r a i n   z j z i   1
  • Detect the FPT using μ ± 3 σ
  • Output: H I j , FPT

2.1. HCVT Autoencoder Model

In this section, we employ a hybrid model consisting of 1D-CNN and Vi-T as the encoder. The encoder extracts deep features from the inputs using the combined strengths of CNN and Vi-T architectures, and these encoded features are then upsampled to the original image dimensions using a CNN-based decoder.

2.1.1. 1D-CNN Patch and Position Embedding

To process time series data effectively, the signal is first split into smaller segments. Patch embedding serves this purpose before applying the Vi-T to time series data, as it enables the Transformer model to process both one-dimensional and multi-dimensional time series data effectively.
The raw vibrational signals are initially segmented into non-overlapping segments. Denote each segment by X i r a w L × C , where L and C are the length of the segments and the dimension of the input signals. Each segment is further partitioned into fixed-size patches. Each patch is represented as x i p L p × C , where L p is the length of each patch. It can be deduced to L   =   L p   ×   m , where m is the number of patches. One-dimensional convolution operation is then applied to each patch as shown in Equation (1). The size of the convolution kernel is identical to that of patch p , and the stride is L p , to preserve the temporal dynamics in the raw signals.
d p k = i = 0 L p 1 j = 0 C 1 p ( i , j ) × w k ( i , j ) + b k ,   1 k n ,
where w k and b k represent the weight and bias of the k-th convolutional kernel, respectively. The variable denotes the number of convolutional kernels, as well as the dimension of the time-series patch embedding. The term d p corresponds to the embedding outcome of the patch. The embedding results for the entire set of patches in each sample X r a w are presented in Equation (2).
d p = d p 1 , d p 2 , , d p n
Since Vi-T does not inherently recognize the order of patches, positional encoding is incorporated into the patch embeddings to preserve spatial information. For position encoding, each embedded patch is encoded as follows:
d i P e n c o d e d = d i n + d i P o s
where the positions d i P o s for the i-th patch are defined as follows:
p i 2 k = s i n i / 10000 2 k / d p i 2 k + 1 = cos i / 10000 2 k / d
where p i 2 k and p i 2 k + 1 represent the positional encodings for the i-th position within the sequence, and d is the dimensionality of the feature vectors.

2.1.2. Vision Transformer Encoder

The patch embeddings, augmented with positional encodings, are subsequently processed through an MHSA module. This module is built upon the scaled dot-product attention mechanism, which facilitates interactions between all positions in the sequence. The MHSA architecture extends this principle by employing multiple attention heads in parallel. Each head has independent parameters, and learn distinct attention patterns and captures a diverse set of semantic relationships within the data. A schematic of the MHSA mechanism is illustrated in Figure 2.
The definition of the i-th attention function is defined as follows:
h e a d i = A t t e n t i o n Q W Q i , K W K i , V W V i
where W Q i , W K i , and W V i d × d k are learnable weight matrices and the vectors Q , K and V represent query, key, and value, respectively. Scalar dot-product attention is computed as follows:
A t t e n t i o n ( Q , K , V ) = s o f t max Q K T d k V
where d k denotes the dimensionality of the key vectors, which serves to counteract the issue of small gradients and produces a more stable attention distribution. The output of the MHSA mechanism is defined as follows:
M u l t i h e a d Q , K , V = c o n c a t H 1 , H 2 , H 3 , , H h W 0
where W 0 represents the trainable weight.
The output of the MHSA, denoted by X a t t e n t i o n , is then passed through an FFN with residual connections and layer normalization:
X o u t p u t 1 D = L a y e r N o r m ( X a t t e n t i o n + F F N ( X a t t e n t i o n ) )
Subsequently, an FFN transforms each element in the sequence independently. The FFN comprises two fully connected layers with a nonlinear activation between them and is mathematically expressed as follows:
F F N ( x ) = W 2 . G e L U ( W 1 x + b 1 ) + b 2
where W 1 and b 1 are the weight and bias of the first linear layer, which projects the input into a higher-dimensional space. Subsequently, W 2 and b 2 are the weight and bias of the second linear layer, projecting the transformed features back to the original dimensionality.
The final output of the Vi-T encoder retains the feature representations X o u t p u t 1 D for all time patches, effectively capturing the temporal dependencies learned within the sequence. These output features are then passed through FC layers with dropout regularization, performing downsampling to obtain the latent z e n c o d e d , which are also called the encoded features and are subsequently used for HI creation.
In this section, we employ a hybrid model consisting of 1D-CNN and Vi-T as the encoder. The encoder extracts deep features from the inputs using the combined strengths of CNN and Vi-T architectures, and these encoded features are then upsampled to the original image dimensions using a CNN-based decoder, as illustrated in Figure 2.

2.1.3. 1D Deconvolution

The decoding process begins by taking the final latent features and progressively reconstructing the original time series data. To achieve this, we first apply FC layers to expand the latent features, followed by a 1D deconvolution operation. The 1D deconvolution, also known as transposed convolution, allows us to upsample the latent representations to the original time series dimensions.
Mathematically, given the latent feature vector z e n c o d e d B × l , where B is the batch size and l is the size of the latent space, we use FC layers to expand the latent vector back to a higher dimension that matches the size of the Vi-T encoder output.
z exp a n d e d = F C 2 G e L U F C 1 z e n c o d e d
The expanded features are then reshaped and passed through the 1D Deconvolution layers to reconstruct the original input signal:
X ^ i r a w = C o n v T r a n s p o s e 1 D z e n c o d e d
During this reconstruction process, the HCVT-WD model is optimized using a composite loss function L that combines reconstruction accuracy with regularization:
L = 1 2 N i = 1 N X i r a w X ^ i r a w 2 + λ 2 j = 1 q   W j   2
The first term calculates the mean squared error (MSE) between the original input signals X i r a w and their reconstructions X ^ i r a w , where N denotes the number of samples and i indexes individual samples. The second term applies regularization to the model trainable parameters W j (with q total parameter tensors) controlled by the coefficient λ . This regularization prevents overfitting by penalizing large weight values while maintaining model generalizability. The complete optimization procedure, implemented using the Adam optimizer, is detailed in Table 1.

2.2. HI Construction

The HI is calculated using WD between the distribution of encoded test features and the empirical distribution of the healthy training encoded features. The WD measures the minimum cost required to transform one probability distribution into another, quantifying how far the test data deviates from the healthy distribution in the embedded feature space. Due to its ability to effectively quantify distributional shifts, the Wasserstein distance proved to be a more robust metric for constructing an HI compared to other distance measures [32].
The empirical distribution of healthy training encoded features is defined as follows:
Z t r a i n D = 1 N t r a i n i = 1 N t r a i n δ z i
where z i d are encoded training features having d dimensions, and δ z i denotes a Dirac delta centered at z i . Similarly, for each test sample z i , its distribution is defined as Z t e s t D = δ z j , and finally, the HI for j-th test sample is computed as follows:
H I j = W D Z t e s t , j D , Z t r a i n D = 1 N t r a i n i = 1 N t r a i n   z j z i   1
where . 1 is the L1 normalization across all feature dimensions.

2.3. FPT Detection

FPT detection based on the 3 σ criterion involves constructing a monitoring interval for the healthy state by calculating the mean μ and standard deviation σ of the initial healthy HI values [33]. A potential fault is indicated when the absolute deviation of the current HI value from the mean exceeds three standard deviations, as expressed below:
H I i     μ   >   3 σ
However, to minimize false alarms caused by random fluctuations, the FPT is not triggered immediately upon a single exceedance. Instead, it is confirmed only when three consecutive HI values exceed the threshold, ensuring a more reliable and robust detection of early fault conditions.

2.4. RUL Prediction and Uncertainty Quantification

2.4.1. RUL Prediction

The RUL prediction process for bearings begins with estimating the time interval from FPT detection to eventual failure. When the FPT is identified for a test bearing, its true RUL labels can be established based on the actual remaining operational lifespan. The prediction architecture employs a CNN-Bi-LSTM network [34] as illustrated in Figure 3 and the prediction process is detailed in Algorithm 2. This dual-stage approach enables comprehensive learning of degradation patterns by analyzing both forward and backward sequence relationships. The trained model processes post-FPT HI sequences to establish an accurate mapping between evolving degradation characteristics and remaining lifespan.
Algorithm 2 CNN-BiLSTM network training for prognosis and uncertainty quantification
  • Input: X T r a i n , Y T r a i n N , and initial parameters η , X T r a i n is the constructed H I j after FPT, and Y T r a i n is the RUL labels assigned to each H I j .
  • Build and Initialize: CNN-BILSTM model.
  • While  i E p o c h s do
  •    Calculate loss between Y T r a i n and predicted Y ^ T r a i n .
  •     η * Update parameters of the model.
  • End for
  • Obtain RUL via trained CNN-BiLSTM model parameters η * .
  • Perform MC-Dropout at inference: run S stochastic forward passes → compute μ as predicted RUL, and σ as model uncertainty.
  • Output: RUL prediction with confidence intervals μ ± 1.96 σ .
After the convolution operations in the CNN, the deep features are passed into the Bi-LSTM model. The LSTM network incorporates an input gate i t , a forget gate f t , an output gate o t , and a memory cell C t . The hidden layer uses the input vector x t and output vector h t to implement the activation function and weight updates. The core principle for the LSTM cell is as follows:
f t = σ W f h t 1 , x t + b f
i t = σ W i h t 1 , x t + b i
o t = σ W o h t 1 , x t + b o
C t = f t C t 1 + i t C ˜ t
h t = o t t a n h C t
where t a n h is the tangent activation function, and C ˜ t is an intermediate value obtained by applying the t a n h function to both the input and previous information. The function σ refers to the sigmoid function, and the terms W and b represent the weight matrices and bias, respectively.
The hidden states from both forward and backward directions are then concatenated and fed into FC layers to predict the RUL.

2.4.2. Uncertainty Quantification with MC Dropout and KDE

Given an input sequence x , our network produces a point prediction y ^ . Point estimates, however, ignore epistemic and aleatoric sources of uncertainty that arise during equipment operation. To characterize predictive uncertainty without incurring the computational cost of full Bayesian neural networks, we adopt MC dropout as a tractable approximation to Bayesian inference and then recover a full predictive probability density via KDE.
Let ω denote all stochastic weights of the model and D the training data. The Bayesian predictive distribution is expressed as follows:
p ( y x , D ) = p ( y x , ω ) p ( ω D ) d ω
Following the dropout-as-Bayesian-approximation view, we replace the posterior with a variational distribution induced by dropout masks applied at inference time. Concretely, for each dropout layer i , the effective weights are as follows:
W i = V i d i a g z i , j j = 1 K i ,   z i , j B e r n o u l l i ( p i )
where V i are the learned parameters, p i is the retain probability, and K i is the incoming dimensionality. During testing, dropout is kept active, and S stochastic forward passes are performed to generate samples:
y ( s ) p ( y x , ω ( s ) ) ,   s = 1 , , S
where ω ( s ) q ( ω ) is defined by the sampled dropout masks. From these samples, the predictive mean and variance are estimated as follows:
μ ^ ( x ) = 1 S s = 1 S y ( s ) ,   σ ^ 2 ( x ) = 1 S s = 1 S y ( s ) μ ^ ( x ) 2
These summarize uncertainty for the interval prediction as follows:
μ ^ ( x ) ± z 1 α / 2 σ ^ ( x )
where z 1 α / 2 is the critical value from the standard normal distribution corresponding to the desired confidence level. To capture the full shape of uncertainty, the MC samples are converted into a continuous density with KDE. Let { y ( s ) } s = 1 S be the MC predictions for input x . The conditional probability density f ^ ( y x ) is estimated as follows:
f ^ ( y x ) = 1 S h s = 1 S K y y ( s ) h
where K ( . ) is a Gaussian kernel and h is the bandwidth. Selecting the appropriate bandwidth is crucial, as it greatly impacts the smoothness and accuracy of the estimation. A grid search and cross-validation are used to identify the optimal bandwidth, choosing the one that minimizes the mean integrated squared error.

3. Experimental Validation and Analysis

3.1. Dataset Preparation

The proposed HI construction and RUL prediction method is validated on the XJTU-SY bearing dataset [35]. As shown in Figure 4, the test rig is designed to collect vibration signals under varying operating conditions. The experiment involved testing 15 LDK UER204 bearings until failure under controlled operating conditions, as shown in Table 1. Vibration data were recorded by two accelerometers operating at a frequency of 25.6 kHz, capturing both vertical and horizontal vibrations. The LDK UER204 bearings were chosen for their stable structural design and consistent performance, making them well-suited for repeatable accelerated degradation experiments and clear observation of the full degradation process. A 1.28 s sample was then recorded every 60 s during the testing. Table 2 further provides the description of the experimental details of the XJTU-SY dataset. In the XJTU-SY bearing dataset, defects on individual bearing components were generated through accelerated degradation tests. During these tests, bearings were continuously operated under heightened load and speed conditions to expedite the wear process and induce failure in a reduced timeframe. The degradation occurred naturally on the inner and outer raceways as well as the rolling elements due to prolonged mechanical stress and friction. The test was concluded when the vibration amplitude surpassed a specified failure threshold (20 g), at which point various types of defects—such as inner race wear, outer race wear, and outer race fracture—were identified.
For this study, only the horizontal vibration signals were considered, as they exhibited clearer trends in the degradation process.
In order to create the healthy dataset containing data for network training and validation, the Pauta criterion was employed, as used in [36]. The initial healthy dataset was constructed from the first 1 min of operation, which was defined as the baseline normal stage under the assumption that the bearing was in a healthy state at the beginning of its life. The Pauta criterion µ   ±   3 σ was then calculated from this set. For each new data point, if it falls within this range, it is added to the healthy dataset, and the criterion is updated. If a point falls outside, it is considered abnormal. The healthy dataset was finalized and locked once a predefined number of consecutive data points (e.g., 500) were classified as abnormal, marking the failure threshold. 80% of the healthy data from the normal state were randomly split for training, while 20% were randomly set aside for validation. For example, for Bearing1-1, the total number of samples was 3936, with 1580 samples identified as healthy. Of these healthy samples, 1264 were randomly selected for training, and 316 were used as the validation set. The test set consisted of the run-to-failure samples, which were used to extract the encoded features from the proposed model.
Following the determination of the normal dataset, the vibration data were segmented into samples of 1024 data points without overlapping. The FFT was then applied to each sample, and the shape of the data was changed to 512 by retaining only the first half of the FFT spectrum. For the CNN patch embedding, the input for the model was shaped into (128, 1, 512), where 128 represents the batch size, 1 represents the channel dimension, and 512 corresponds to the number of frequency components after FFT.

3.2. Structural Parameters and Hyperparameters

The hyperparameters and structural parameters of the proposed model are summarized in Table 3 and Table 4, respectively. The optimal hyperparameters are determined based on the reconstruction loss during validation through the use of the grid search algorithm. The model was trained using AdamW as the optimizer for up to 400 epochs, with early stopping employed to avoid overfitting by halting training when the validation loss ceased to improve. The training and validation losses with the optimized parameters are shown in Figure 5. The computational work was done on a computer in PyTorch with an i5 12400F CPU, an NVIDIA RTX3060 GPU, and 16GB of RAM.

3.3. HI Construction and FPT Detection

The HCVT-WD model was used to construct the HI for all bearings under various operating conditions. These indicators display a clear trend over time, which is crucial for FPT detection and RUL prediction. To further enhance the clarity of these indicators, Gaussian smoothing was applied to the raw HI as shown in Figure 6, effectively filtering out noise while preserving the overall degradation trend. The smoothed curves provide a more stable and interpretable representation of bearing degradation, facilitating reliable FPT detection and RUL estimation. The FPT and corresponding HIs for the bearings are illustrated in Figure 7.

3.4. Evaluation Metrics of HIs

This study presents three quantitative metrics: monotonicity (Mo), trendability (Tr), and prognosability (Pr) to evaluate the effectiveness of HI construction, as defined in Equations (1)–(3):
M o = 1 6 j = 1 N ( P j j ) N ( N 2 1 ) 2
T r e = N j = 1 N H I j t j j = 1 N H I j j = 1 N t j N j = 1 N H I j 2 j = 1 N H I j 2 N j = 1 N t j 2 j = 1 N t j 2
Pr = exp σ H I f μ H I i μ H I f
where P j represents the rank of the j-th HI within the HI sequence and N denotes the number of samples. H I j is the value of the HI at the time t j . μ H I i and μ H I f denote the means of the HI values at the initial and failure phases, and σ H I f is the standard deviation of the HI values in the failure phase. All three evaluation metrics range between 0 and 1, with values closer to 1 indicating higher-quality health indicators that exhibit more consistent degradation trends, stronger linear relationships with time, and more predictable failure behavior.
To validate the effectiveness of the proposed method, we compared its performance against three state-of-the-art HI construction methods: ensemble stacked autoencoder (ES-AE), deep CNN-auto-encoder–decoder (DCN-AE), and multi-scale CNN-auto-encoder–decoder (MSC-AE), using the aforementioned evaluation metrics. As shown in Figure 8 and Table 5, our method consistently outperforms the other approaches across all evaluation metrics.
While ES-AE is capable of capturing the overall degradation trend, it suffers from noise, which lowers both monotonicity and trendability. DCN-AE and MSC-AE demonstrate better performance by leveraging convolutional feature extraction and multi-scale representations, yet their HIs still exhibit noticeable fluctuations, particularly in the mid-life regions. In contrast, the proposed HCVT-WD achieves superior results through a dual-branch design: the CNN branch effectively filters out local noise from raw vibration signals, while the Vi-T branch captures long-range dependencies and global structural information. The integration of these two branches through skip fusion produces smoother HI curves with reduced oscillations, which explains the markedly higher monotonicity, trendability, and prognosability observed for HCVT-WD.

3.5. Visualization Analysis

In the visualization analysis, the healthy and degraded features are clearly separated in the uniform manifold approximation and projection embedding space, even when testing with full healthy and degraded samples, as shown in Figure 9. This separation indicates that the proposed HCVT-WD model effectively segregates the encoded features. Notably, the unhealthy features follow a clear degradation trajectory in the embedding space, reflecting the progressive failure process, while the healthy features remain well clustered. This separation is attributed to the multi-head self-attention mechanism in the Transformer, which enhances discriminative feature learning. The attention maps further validate this behavior, as illustrated in Figure 10, where the learned attention weights for three randomly selected testing samples exhibit sparsity in the attention matrices. This sparsity confirms that the proposed module successfully captures dynamic and temporal dependencies within the raw input signals, thereby improving the model’s cycle life prediction capability by focusing on the most informative features.

3.6. RUL Estimation and Uncertainty Analysis

The CNN-Bi-LSTM model, integrated with a Bayesian neural network, was employed to map the HIs to the RUL. To predict the RUL for each bearing, the FPT was initially identified. After determining the FPT, post-FPT HI values were collected, and RUL labels were assigned. These labels indicate the remaining life as a percentage of the total remaining time after the FPT for each bearing.
The samples were created by looking at sequences of 10 consecutive time steps of the HIs. Each sequence consisted of a sliding window of 10 time steps, and the model used these sequences to learn how the health of the bearing evolves over time. Once the sequences were created, the data were randomly split into training and validation sets, with 80% used for training and 20% used for validation. For each sequence, the RUL label was assigned based on the remaining life at the end of the sequence.
During the training phase, a leave-one-out cross-validation method was used for each operating condition. For each bearing, one was held out as the test set, while the others under the same operating condition were used to train the model. To evaluate the proposed method’s accuracy, two performance metrics—RMSE and MAE—were used. These metrics were calculated as follows:
R M S E = 1 n t = 1 n ( y y ^ ) 2
M A E = 1 n t = 1 n | y y ^ |
where y and y ^ represent the actual and predicted RUL values, respectively, and n is the total number of samples.
The hyperparameters of the proposed model were carefully tuned, with the optimized values being a batch size of 32, a learning rate of 0.0001, and two LSTM hidden layers with 64 and 32 neurons. A dropout was added after the first hidden layer, and an additional dropout of 0.3 was applied after the final FC layer to implement MC dropout for uncertainty estimation. The prediction results of the proposed methodology are shown in Figure 11 across different working conditions. Additionally, separate CNN and Bi-LSTM prediction networks were constructed for comparison, and the results are presented in Table 6. The performance comparison demonstrates that the CNN-Bi-LSTM model outperforms these alternatives, achieving superior accuracy in RUL prediction.
For uncertainty quantification, an MC dropout rate of 0.3 was applied during inference. Additionally, prediction uncertainty was quantified through statistically derived confidence intervals, as shown in Figure 12. Notably, all RUL predictions remain within the calculated 95% confidence bounds throughout the degradation process. KDE analysis revealed that the predicted RUL distributions are tightly clustered around the true values, with minimal skewness. This synergy of accurate point estimation and rigorous uncertainty quantification represents a significant advancement for industrial condition-based maintenance, where understanding both the predicted RUL and its associated confidence level is critical for operational decision-making. The consistent outperformance of conventional approaches validates the efficiency of merging convolutional feature extraction with sequential pattern identification in this integrated prognostic framework.

3.7. Ablation Studies on the HCVT-WD Model

This ablation study investigates the impact of CNN patch embedding, varying patch sizes, and skip connection fusion. As illustrated in Figure 13a the reconstruction loss, observed through both training and validation losses, remains high, fluctuates significantly, and converges slowly in the absence of CNN patch embedding. This indicates that CNN patch embedding performs effectively via convolution operations, and highlights the critical role of patch embedding. Similarly, for a patch size of 16, the reconstruction loss curves exhibit smoother and more rapid convergence compared to patch sizes 8 and 32, as shown in Figure 13b. Furthermore, the introduction of skip connection fusion after the transformer block, aimed at enhancing feature reusability and preserving spatial information, results in a marked improvement in reconstruction loss, as shown in Figure 13c. The skip connection fusion enables the decoder to leverage both high-level features from the transformer blocks and low-level features from the encoder, leading to more stable and smoother convergence of the loss curves and improving the accuracy of the reconstructed input data.

3.8. Comparative Experiments with Other State-of-the-Art Methods for RUL Prediction

To further validate the usefulness of our developed method, a comparative analysis was conducted with four alternative models: the memory fusion network (CLSTMF) [39], self-adaptive graph convolutional networks with self-attention (SAGCN-SA) [40], Time Transformer convolutional LSTM (TT-ConvLSTM) [41], and TCN-Transformer [42]. The first three models were designed for feature extraction and direct RUL prediction, whereas the TCN-Transformer model adopted a two-stage degradation process, considering both HI construction and RUL mapping.
The CLSTMF model achieved a lowest RMSE of 0.051, while the TT-ConvLSTM model showed the lowest RMSE and MAE values of 0.072 and 0.052, respectively. The TCN-Transformer model performed relatively well with a lowest RMSE of 0.0549 and MAE of 0.0441. In contrast, our proposed two-stage degradation model consistently outperformed most of the alternatives, yielding the lowest RMSE of 0.0441 and MAE of 0.0321, demonstrating superior predictive accuracy across all metrics. These results of the comparative experiments are summarized in Table 7, where the RMSE and MAE values for each model are provided. This emphasizes the effectiveness of our approach in accurately predicting RUL through the two-stage degradation process, including both point prediction and interval prediction.

4. Conclusions

This study proposed the HCVT-WD framework for constructing HIs in an unsupervised manner and predicting the RUL of bearings. Raw vibration signals are processed by a sequential CNN–vision Transformer architecture, eliminating manual feature engineering while capturing local spatial patterns and long-range temporal dependencies. The HI is defined as the Wasserstein distance between encoded representations of healthy and degraded states, providing a precise metric of degradation severity. Experimental validation on bearing datasets shows that the proposed HI outperforms state-of-the-art methods in monotonicity, trendability, and prognosability. For RUL estimation, the HI is seamlessly integrated into a CNN-BiLSTM regressor that models both temporal and nonlinear degradation dynamics. A Bayesian neural network yields uncertainty-aware predictions and confidence, which support risk-informed maintenance decisions for safety- and mission-critical applications. The HCVT-WD model requires minimal preprocessing and operates without full lifecycle data, making it adaptable to real-world industrial applications. Its capacity to yield robust HIs and quantify predictive uncertainty establishes it as a potent paradigm for PHM in intricate systems.
Future research will prioritize the development of adaptable HIs across varying operational conditions, validated through extensive testing on heterogeneous datasets to broaden their applicability. Additionally, attention will be directed towards addressing the uncertainty present in both the measurement process and the model’s predictions. This includes the integration of measurement uncertainty as well as other sources of uncertainty, such as model and environmental factors, into the development of more robust and reliable predictive models. Furthermore, future work will focus on fault diagnosis and prognosis through HI construction in a dual-task learning framework, with an emphasis on incorporating zero-fault-shot learning techniques to improve the model’s adaptability to unseen fault types and operational conditions.

Author Contributions

Methodology, A.D. and H.-Z.H. and C.-G.H.; Investigation, S.K., T.H. and S.G.N.; Writing—original draft, A.D.; Writing—review and editing, C.-G.H.; Funding acquisition, H.-Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This study is funded by the National Natural Science Foundation of China (Grant No. 52372349) and the Natural Science Foundation of Sichuan Province (Grant No. 23NSFSC0420).

Data Availability Statement

The study’s original contributions are outlined in the article, and any further questions can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Abbreviations and Nomenclatures
Prognostics and health management PHMMonte CarloMC
Remaining useful lifeRULMulti-head self-attentionMHSA
Health indicator HIFeedforward networkFFN
Empirical mode decompositionEMDFully connected FC
Convolutional neural networksCNNKernel Density estimationKDE
Long short-term memory LSTMRoot mean square error RMSE
Wasserstein DistanceWDMean absolute error MAE
Hybrid convolutional vision Transformer with Wasserstein distance HCVT-WDFast Fourier Transform FFT
Vision Transformer Vi-TBidirectional long short-term memoryBi-LSTM
{ x i t r a i n } i = 1 N t r a i n Train datasets X i r a w Raw signals
{ x i t e s t } i = 1 N t e s t Test datasets Z t r a i n Train encoded features
Z t e s t Test encoded features σ Standard deviation
μ mean

References

  1. Qi, F.; Huang, M. Joint Optimization of Maintenance and Spares Inventory Policy for a Series-Parallel System Considering Dependent Failure Processes. Reliab. Eng. Syst. Saf. 2024, 247, 110116. [Google Scholar] [CrossRef]
  2. Pei, X.; Li, X.; Gao, L. A Novel Machinery RUL Prediction Method Based on Exponential Model and Cross-Domain Health Indicator Considering First-to-End Prediction Time. Mech. Syst. Signal Process. 2024, 209, 111122. [Google Scholar] [CrossRef]
  3. Huang, C.-G.; Huang, H.-Z.; Li, Y.-F.; Peng, W. A Novel Deep Convolutional Neural Network-Bootstrap Integrated Method for RUL Prediction of Rolling Bearing. J. Manuf. Syst. 2021, 61, 757–772. [Google Scholar] [CrossRef]
  4. Khanal, S.; Huang, H.-Z.; Huang, C.-G.; Dahal, A.; Huang, T.; Niazi, S.G. Domain-Specific Dual Network with Unsupervised Domain Adaptation for Transfer Fault Prognosis Across Machines Using Multiple Source Domains. IEEE Trans. Instrum. Meas. 2025, 74, 3527813. [Google Scholar] [CrossRef]
  5. Zhu, R.; Peng, W.; Wang, D.; Huang, C.-G. Bayesian Transfer Learning with Active Querying for Intelligent Cross-Machine Fault Prognosis under Limited Data. Mech. Syst. Signal Process. 2023, 183, 109628. [Google Scholar] [CrossRef]
  6. Huang, H.-Z.; Li, H.; Shi, Y.; Huang, T.; Yang, Z.; He, L.; Liu, Y.; Jiang, C.; Li, Y.-F.; Beer, M.; et al. Theory and Application of Possibility and Evidence in Reliability Analysis and Design Optimization. J. Reliab. Sci. Eng. 2025, 1, 015007. [Google Scholar] [CrossRef]
  7. Huang, T.; Xiahou, T.; Mi, J.; Chen, H.; Huang, H.-Z.; Liu, Y. Merging Multi-Level Evidential Observations for Dynamic Reliability Assessment of Hierarchical Multi-State Systems: A Dynamic Bayesian Network Approach. Reliab. Eng. Syst. Saf. 2024, 249, 110225. [Google Scholar] [CrossRef]
  8. Huang, T.; Zhang, Q.; Beer, M.; Liu, Y.; Huang, H.-Z. A Dynamic Reliability Assessment Method for Multi-State Manufacturing System by Merging Imprecise Observational Information. Reliab. Eng. Syst. Saf. 2025, 266, 111722. [Google Scholar] [CrossRef]
  9. Si, X.-S.; Wang, W.; Hu, C.-H.; Zhou, D.-H.; Pecht, M.G. Remaining Useful Life Estimation Based on a Nonlinear Diffusion Degradation Process. IEEE Trans. Reliab. 2012, 61, 50–67. [Google Scholar] [CrossRef]
  10. Lei, Y.; Li, N.; Gontarz, S.; Lin, J.; Radkowski, S.; Dybala, J. A Model-Based Method for Remaining Useful Life Prediction of Machinery. IEEE Trans. Reliab. 2016, 65, 1314–1326. [Google Scholar] [CrossRef]
  11. Zhuang, J.; Chen, Y.; Zhao, X.; Jia, M.; Feng, K. A Graph-Embedded Subdomain Adaptation Approach for Remaining Useful Life Prediction of Industrial IoT Systems. IEEE Internet Things J. 2024, 11, 22903–22914. [Google Scholar] [CrossRef]
  12. Zhu, R.; Chen, Y.; Peng, W.; Ye, Z.-S. Bayesian Deep-Learning for RUL Prediction: An Active Learning Perspective. Reliab. Eng. Syst. Saf. 2022, 228, 108758. [Google Scholar] [CrossRef]
  13. Huang, C.-G.; Li, H.; Peng, W.; Tang, L.C.; Ye, Z.-S. Personalized Federated Transfer Learning for Cycle-Life Prediction of Lithium-Ion Batteries in Heterogeneous Clients With Data Privacy Protection. IEEE Internet Things J. 2024, 11, 36895–36906. [Google Scholar] [CrossRef]
  14. Lei, Y.; Li, N.; Guo, L.; Li, N.; Yan, T.; Lin, J. Machinery Health Prognostics: A Systematic Review from Data Acquisition to RUL Prediction. Mech. Syst. Signal Process. 2018, 104, 799–834. [Google Scholar] [CrossRef]
  15. Dong, S.; Sheng, J.; Liu, Z.; Zhong, L.; Wei, H. Bearing Remain Life Prediction Based on Weighted Complex SVM Models. J. Vibroeng. 2016, 18, 3636–3653. [Google Scholar] [CrossRef]
  16. Guo, R.; Wang, Y.; Zhang, H.; Zhang, G. Remaining Useful Life Prediction for Rolling Bearings Using EMD-RISI-LSTM. IEEE Trans. Instrum. Meas. 2021, 70, 3509812. [Google Scholar] [CrossRef]
  17. Zhang, T.; Wang, Q.; Shu, Y.; Xiao, W.; Ma, W. Remaining Useful Life Prediction for Rolling Bearings with a Novel Entropy-Based Health Indicator and Improved Particle Filter Algorithm. IEEE Access 2023, 11, 3062–3079. [Google Scholar] [CrossRef]
  18. Zhao, H.; Liu, H.; Jin, Y.; Dang, X.; Deng, W. Feature Extraction for Data-Driven Remaining Useful Life Prediction of Rolling Bearings. IEEE Trans. Instrum. Meas. 2021, 70, 3511910. [Google Scholar] [CrossRef]
  19. Long, Y.; Pang, Q.; Zhu, G.; Cheng, J.; Li, X. Remaining Useful Life Prediction of Rolling Bearings Based on Refined Composite Multi-Scale Attention Entropy and Dispersion Entropy. arXiv 2024, arXiv:2406.16967. [Google Scholar]
  20. Guo, L.; Li, N.; Jia, F.; Lei, Y.; Lin, J. A Recurrent Neural Network Based Health Indicator for Remaining Useful Life Prediction of Bearings. Neurocomputing 2017, 240, 98–109. [Google Scholar] [CrossRef]
  21. Wang, Z.; Guo, J.; Wang, J.; Yang, Y.; Dai, L.; Huang, C.-G.; Wan, J.-L. A Deep Learning Based Health Indicator Construction and Fault Prognosis with Uncertainty Quantification for Rolling Bearings. Meas. Sci. Technol. 2023, 34, 105105. [Google Scholar] [CrossRef]
  22. Li, Z.; Zhang, K.; Lai, X.; Zheng, Q.; Ding, G. A Remaining Useful Life Prediction Method for Rolling Bearing Based on Multi-Channel Fusion Hierarchical Vision Transformer. In Proceedings of the 2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS), Xiangtan, China, 12–14 May 2023; pp. 1025–1029. [Google Scholar]
  23. Guo, J.; Wang, Z.; Li, H.; Yang, Y.; Huang, C.-G.; Yazdi, M.; Kang, H.S. A Hybrid Prognosis Scheme for Rolling Bearings Based on a Novel Health Indicator and Nonlinear Wiener Process. Reliab. Eng. Syst. Saf. 2024, 245, 110014. [Google Scholar] [CrossRef]
  24. Ma, P.; Li, G.; Zhang, H.; Wang, C.; Li, X. Prediction of Remaining Useful Life of Rolling Bearings Based on Multiscale Efficient Channel Attention CNN and Bidirectional GRU. IEEE Trans. Instrum. Meas. 2024, 73, 2508413. [Google Scholar] [CrossRef]
  25. Guo, L.; Yu, Y.; Duan, A.; Gao, H.; Zhang, J. An Unsupervised Feature Learning Based Health Indicator Construction Method for Performance Assessment of Machines. Mech. Syst. Signal Process. 2022, 167, 108573. [Google Scholar] [CrossRef]
  26. Ma, W.; Guo, L.; Gao, H.; Yu, Y.; Qian, M. A Health Indicator Construction Method Based on Self-Attention Convolutional Autoencoder for Rotating Machine Performance Assessment. Measurement 2022, 204, 112108. [Google Scholar] [CrossRef]
  27. Xu, F.; Wang, L. Constructing a Health Indicator for Bearing Degradation Assessment via an Unsupervised and Enhanced Stacked Autoencoder. Adv. Eng. Inform. 2022, 53, 101708. [Google Scholar] [CrossRef]
  28. De Pater, I.; Mitici, M. Developing Health Indicators and RUL Prognostics for Systems with Few Failure Instances and Varying Operating Conditions Using a LSTM Autoencoder. Eng. Appl. Artif. Intell. 2023, 117, 105582. [Google Scholar] [CrossRef]
  29. Xu, Z.; Bashir, M.; Liu, Q.; Miao, Z.; Wang, X.; Wang, J.; Ekere, N.N. A Novel Health Indicator for Intelligent Prediction of Rolling Bearing Remaining Useful Life Based on Unsupervised Learning Model. Comput. Ind. Eng. 2023, 176, 108999. [Google Scholar] [CrossRef]
  30. Qu, Y.; Fu, S.; Yong, M.; Tian, J.; Lv, Z.; Li, R. Health Indicator Construction and Remaining Useful Life Prediction Based on MSC-LSTM-AE Model for Working Bearings. IEEE Sens. J. 2025, 25, 15525–15535. [Google Scholar] [CrossRef]
  31. Wang, H.; Wang, S.; Sun, W.; Xiang, J. Multi-Sensor Signal Fusion for Tool Wear Condition Monitoring Using Denoising Transformer Auto-Encoder Resnet. J. Manuf. Process. 2024, 124, 1054–1064. [Google Scholar] [CrossRef]
  32. Ni, Q.; Ji, J.C.; Feng, K. Data-Driven Prognostic Scheme for Bearings Based on a Novel Health Indicator and Gated Recurrent Unit Network. IEEE Trans. Ind. Inform. 2023, 19, 1301–1311. [Google Scholar] [CrossRef]
  33. Li, N.; Lei, Y.; Lin, J.; Ding, S. An Improved Exponential Model for Predicting Remaining Useful Life of Rolling Element Bearings. IEEE Trans. Ind. Electron. 2015, 62, 7762–7773. [Google Scholar] [CrossRef]
  34. Kim, J.; Oh, S.; Kim, H.; Choi, W. Tutorial on Time Series Prediction Using 1D-CNN and BiLSTM: A Case Example of Peak Electricity Demand and System Marginal Price Prediction. Eng. Appl. Artif. Intell. 2023, 126, 106817. [Google Scholar] [CrossRef]
  35. Lei, Y.; Tan, T.; Wang, B.; Li, N.; Yan, T.; Yang, J. XJTU-SY Rolling Element Bearing Accelerated Life Test Datasets: A Tutorial. J. Mech. Eng. 2019, 55, 1. [Google Scholar] [CrossRef]
  36. Kaji, M.; Parvizian, J.; van de Venn, H.W. Constructing a Reliable Health Indicator for Bearings Using Convolutional Autoencoder and Continuous Wavelet Transform. Appl. Sci. 2020, 10, 8948. [Google Scholar] [CrossRef]
  37. Lin, P.; Tao, J. A Novel Bearing Health Indicator Construction Method Based on Ensemble Stacked Autoencoder. In Proceedings of the 2019 IEEE International Conference on Prognostics and Health Management (ICPHM), San Francisco, CA, USA, 17–20 June 2019; pp. 1–9. [Google Scholar]
  38. Xu, F.; Huang, Z.; Yang, F.; Wang, D.; Tsui, K.L. Constructing a Health Indicator for Roller Bearings by Using a Stacked Auto-Encoder with an Exponential Function to Eliminate Concussion. Appl. Soft Comput. 2020, 89, 106119. [Google Scholar] [CrossRef]
  39. Li, X.; Zhang, W.; Ding, Q. Deep Learning-Based Remaining Useful Life Estimation of Bearings Using Multi-Scale Feature Extraction. Reliab. Eng. Syst. Saf. 2019, 182, 208–218. [Google Scholar] [CrossRef]
  40. Wei, Y.; Wu, D.; Terpenny, J. Remaining Useful Life Prediction Using Graph Convolutional Attention Networks with Temporal Convolution-Aware Nested Residual Connections. Reliab. Eng. Syst. Saf. 2024, 242, 109776. [Google Scholar] [CrossRef]
  41. Niazi, S.G.; Huang, T.; Zhou, H.; Bai, S.; Huang, H.-Z. Multi-Scale Time Series Analysis Using TT-ConvLSTM Technique for Bearing Remaining Useful Life Prediction. Mech. Syst. Signal Process. 2024, 206, 110888. [Google Scholar] [CrossRef]
  42. Cao, W.; Meng, Z.; Li, J.; Wu, J.; Fan, F. A Remaining Useful Life Prediction Method for Rolling Bearing Based on TCN-Transformer. IEEE Trans. Instrum. Meas. 2025, 74, 3501309. [Google Scholar] [CrossRef]
Figure 1. The proposed framework.
Figure 1. The proposed framework.
Applsci 15 10972 g001
Figure 2. The proposed HCVT-WD model.
Figure 2. The proposed HCVT-WD model.
Applsci 15 10972 g002
Figure 3. CNN-BILSTM model for HI to RUL mapping.
Figure 3. CNN-BILSTM model for HI to RUL mapping.
Applsci 15 10972 g003
Figure 4. The experiment setup of the test rig of XJTU-SY.
Figure 4. The experiment setup of the test rig of XJTU-SY.
Applsci 15 10972 g004
Figure 5. Loss variation during the training process.
Figure 5. Loss variation during the training process.
Applsci 15 10972 g005
Figure 6. Raw and Gaussian-smoothed health indicators: (a) Bearing1-1. (b) Bearing2-2.
Figure 6. Raw and Gaussian-smoothed health indicators: (a) Bearing1-1. (b) Bearing2-2.
Applsci 15 10972 g006
Figure 7. Health indicators and FPT of different bearings.
Figure 7. Health indicators and FPT of different bearings.
Applsci 15 10972 g007
Figure 8. Different HI model comparisons: (a) Bearing1-2. (b) Bearing1-3. (c) Bearing2-2 (d) Bearing2-5.
Figure 8. Different HI model comparisons: (a) Bearing1-2. (b) Bearing1-3. (c) Bearing2-2 (d) Bearing2-5.
Applsci 15 10972 g008
Figure 9. Visualization analysis of test encoded features.
Figure 9. Visualization analysis of test encoded features.
Applsci 15 10972 g009
Figure 10. Attention maps captured by MHSA of Vi-T.
Figure 10. Attention maps captured by MHSA of Vi-T.
Applsci 15 10972 g010
Figure 11. RUL prediction of bearings using different models: (a) Bearing1-2. (b) Bearing1-3. (c) Bearing1-5. (d) Bearing2-1. (e) Bearing2-2. (f) Bearing2-5 (g) Bearing3-3. (h) Bearing3-5.
Figure 11. RUL prediction of bearings using different models: (a) Bearing1-2. (b) Bearing1-3. (c) Bearing1-5. (d) Bearing2-1. (e) Bearing2-2. (f) Bearing2-5 (g) Bearing3-3. (h) Bearing3-5.
Applsci 15 10972 g011
Figure 12. RUL prediction and uncertainty quantification: (a) Bearing1-5. (b) Bearing2-1.
Figure 12. RUL prediction and uncertainty quantification: (a) Bearing1-5. (b) Bearing2-1.
Applsci 15 10972 g012
Figure 13. Training and validation losses for ablation studies (a) CNN patch embedding. (b) Patch sizes. (c) Skip fusion.
Figure 13. Training and validation losses for ablation studies (a) CNN patch embedding. (b) Patch sizes. (c) Skip fusion.
Applsci 15 10972 g013
Table 1. The operating conditions in XJTU-SY.
Table 1. The operating conditions in XJTU-SY.
Operating ConditionsSpeedLoadBearingsLifetime (mins)Type of Faults
Condition 12100 rpm12 kNB1_1123Outer race
B1_2161Outer race
B1_3158Outer race
B1_4122Outer race
B1_552Outer race and inner race
Condition 22250 rpm11 kNB2_1491Inner race
B2_2161Outer race
B2_3533Cage
B2_442Outer race
B2_5339Outer race
Condition 32400 rpm10 kNB3_12538Outer race
B3_22496Inner race, ball, cage, and outer race
B3_3371Inner race
B3_41515Inner race
B3_5114Outer race
Table 2. Description of the experimental details of the XJTU-SY dataset.
Table 2. Description of the experimental details of the XJTU-SY dataset.
Bearings SpecificationsBearings SizeMeasurement DetailsMeasurement Plan
Number of rolling elements8 mmLocation of the sensorPositioned at 90° to each other
Rolling elements diameter7.92 mmType of the sensorPCB 352C33
Inner race diameter29.3 mmSampling interval1 min
Outer race diameter39.8 mmSampling frequency25,600 Hz
Mean diameter34.5 mmSampling duration1.28 s
Table 3. Parameter settings of the HCVT-WD.
Table 3. Parameter settings of the HCVT-WD.
ParametersBatch SizeHeadsVi-T Encoder LayersDmodelMLP RatioLearning RateOptimizer
Setttings1288825640.0003AdamW
Table 4. Structure of the HCVT-WD.
Table 4. Structure of the HCVT-WD.
SectionLayerParameters/ConfigurationOutput Shape
EncoderInputdim = 512(B, 512)
Patch Embedding (Conv1d)Patch_size = 16, embed_dimension = 256, k = 16, s = 16(B, 32, 256)
Positional EmbeddingFixed sinusoidal(B, 32, 256)
Transformer × 8Number of heads = 8, drop = 0.1(B, 64, 256)
Global averaging pooling-(B, 256)
FC1 + Drop256 → 128, drop = 0.5(B, 128)
FC2 + Drop128 → 32, drop = 0.1 (latent)(B, 32)
DecoderFC1 → FC232 → 128 → 256 (ReLU)(B, 256)
Expand & RepeatMatch 32 patches(B, 256, 32)
Skip Fusion + Conv1dConcat (512 → 256, k = 3, p = 1)(B, 256, 32)
ConvTranspose1d256 → 256, k = 16, s = 16(B, 256, 512)
Conv1d (head)256 → 1, k = 1(B, 256, 64)
OutputSqueeze(B, 512)
Table 5. Evaluation results of HI.
Table 5. Evaluation results of HI.
ModelMonotonicityTrendabilityPrognosability
ES-AE [37]0.84150.77260.8657
DCN-AE [38]0.92380.81880.8908
MSC-AE [25]0.93190.82790.8898
HCVT-WD (Proposed)0.97420.93760.9673
Table 6. Performance comparisons of different models.
Table 6. Performance comparisons of different models.
ModelsMetricsTest Bearings
B1-2B1-3B1-5B2-1B2-2B2-5B3-3B3-5
CNNMAE0.04910.07130.05020.06010.09060.08730.08590.1078
RMSE0.06160.08920.06050.07080.11470.10710.10660.1196
Bi-LSTMMAE0.04610.10750.07380.06590.07060.07090.05620.0843
RMSE0.05730.11840.08220.08120.08740.08760.06320.0942
CNN-BiLSTMMAE0.03210.07130.03910.04630.06520.05270.03740.0521
RMSE0.04410.08870.04770.05640.07580.06670.04590.0639
Table 7. Performance comparison of different models.
Table 7. Performance comparison of different models.
Test BearingsCLSTMFSAGCN-SATT-ConvLSTMTCN-TransformerProposed
RMSERMSERMSEMAERMSEMAERMSEMAE
B1-20.0640.0790.1850.1550.06210.05120.04410.0321
B1-30.1810.1230.1010.0620.07480.06250.08870.0713
B1-50.1810.1880.0720.0520.05720.04300.04770.0391
B2-10.0510.1810.0940.0790.06730.05490.05640.0463
B2-20.1560.2440.0990.0810.07840.06210.07520.0654
B2-50.1240.1450.1010.0860.05490.04410.06670.0527
B3-30.156-0.1430.1270.06320.05190.04590.0374
B3-50.144-0.1990.1610.07330.06140.06780.0521
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dahal, A.; Huang, H.-Z.; Huang, C.-G.; Huang, T.; Khanal, S.; Niazi, S.G. Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery. Appl. Sci. 2025, 15, 10972. https://doi.org/10.3390/app152010972

AMA Style

Dahal A, Huang H-Z, Huang C-G, Huang T, Khanal S, Niazi SG. Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery. Applied Sciences. 2025; 15(20):10972. https://doi.org/10.3390/app152010972

Chicago/Turabian Style

Dahal, Amrit, Hong-Zhong Huang, Cheng-Geng Huang, Tudi Huang, Smaran Khanal, and Sajawal Gul Niazi. 2025. "Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery" Applied Sciences 15, no. 20: 10972. https://doi.org/10.3390/app152010972

APA Style

Dahal, A., Huang, H.-Z., Huang, C.-G., Huang, T., Khanal, S., & Niazi, S. G. (2025). Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery. Applied Sciences, 15(20), 10972. https://doi.org/10.3390/app152010972

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop