Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery

Dahal, Amrit; Huang, Hong-Zhong; Huang, Cheng-Geng; Huang, Tudi; Khanal, Smaran; Niazi, Sajawal Gul

doi:10.3390/app152010972

Open AccessArticle

Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery

by

Amrit Dahal

^1,2

,

Hong-Zhong Huang

^1,2,*

,

Cheng-Geng Huang

^1,2,

Tudi Huang

^1,2,

Smaran Khanal

^1,2 and

Sajawal Gul Niazi

^1,2

¹

School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

²

Center for System Reliability and Safety, University of Electronic Science and Technology of China, Chengdu 611731, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 10972; https://doi.org/10.3390/app152010972

Submission received: 18 September 2025 / Revised: 7 October 2025 / Accepted: 10 October 2025 / Published: 13 October 2025

(This article belongs to the Section Mechanical Engineering)

Download

Browse Figures

Versions Notes

Abstract

Prognostics for rotating machinery, particularly bearings, encounter significant challenges in constructing reliable health indicators (HIs) that accurately reflect degradation trajectories, thereby enabling precise remaining useful life (RUL) predictions. This article proposes a novel integrated approach for predicting the RUL of bearings without manual feature engineering. Specifically, a sequential autoencoder integrating a convolutional neural network (CNN) and vision Transformer (Vi-T) is employed to capture the local spatial patterns and global temporal correlations of time-domain vibration signals. The Wasserstein distance is introduced to quantify the divergence between healthy and degraded signal embeddings, resulting in a robust HI metric. Subsequently, the derived HI is fed into a CNN-bidirectional long short-term memory-regressor with Monte Carlo dropout to provide RUL predictions and Bayesian uncertainty estimates. Experimental results from the Xi’an Jiao-Tong University bearing dataset demonstrate that the proposed method surpasses conventional techniques in HI construction and RUL prediction accuracy, demonstrating its efficacy for complex industrial systems with minimal data preprocessing.

Keywords:

bearings; health indicator; remaining useful life; hybrid convolutional vision transformer; unsupervised learning

1. Introduction

In parallel with the burgeoning advancements in industrial Internet of Things, artificial intelligence algorithms, and computational capabilities, DL-based data-driven methodologies have emerged as predominant paradigms in PHM for rotating machinery [1,2,3]. Bearings, as critical components in sectors such as power generation, aerospace, and manufacturing, directly affect system reliability and performance, so accurate RUL estimation is essential for facilitating predictive maintenance, reducing unplanned downtimes, and ultimately extending the operational lifespan of machinery [4,5,6,7,8]. RUL prediction approaches are dichotomized into physics-based models [9,10], which demand intricate mechanical insights and precise simulations, and data-driven models [11,12,13], which harness operational data to discern degradation patterns, offering adaptability to multifaceted systems sans profound physical domain knowledge. Consequently, data-driven techniques are increasingly favored for handling the large and high-dimensional datasets common in modern predictive maintenance.

Constructing efficacious data-driven RUL models entails four pivotal stages: sensor data acquisition and preprocessing (e.g., denoising, normalization, dimensionality reduction); degradation feature extraction and health indicator (HI) formulation, which combines signal processing with domain expertise; FPT detection; and RUL regression using machine learning [14]. The HI construction is critical because it determines how faithfully degradation is represented and thus influences prediction accuracy. In this regard, Dong et al. [15] applied principal component analysis to combine the features, which were then input into a weighted complex support vector machine model to predict the RUL of bearings. Similarly, Guo et al. [16] introduced a degradation indicator for predicting the RUL of bearings, utilizing EMD. Zhang et al. [17] proposed a method that combines multi-scale entropy features as a health indicator for accurate RUL prediction of rolling bearings. Zhao et al. [18] employed a method for RUL prediction of rolling bearings that focuses on modeling degraded features using the derivative method of the fitting curve for maximum power spectral density. Long et al. [19] introduced a method for RUL prediction of rolling bearings that emphasizes the creation of HI based on EMD, refined composite multi-scale attention entropy, and dispersion entropy. These studies primarily relied on signal processing techniques to extract features from raw signals in order to develop degradation indicators. These methods are advantageous due to their clear physical interpretation and ease of validation. However, signal processing-based feature extraction has its limitations. The techniques often depend on expert knowledge and experience, may lack generalizability across equipment types, and can miss implicit nonlinear features present in complex data.

Conversely, DL paradigms automate HI extraction from raw vibrations. For instance, [20] was a pioneer in applying recurrent neural networks to autonomously extract HIs from selected features, creating an early data-driven framework to characterize the deterioration of rolling bearings. Similarly, Ref. [21] utilized a hybrid model combining CNN and bidirectional gated recurrent units to derive HIs from raw vibrational signals for roller bearings. Li et al. [22] used a multi-channel fusion hierarchical vision Transformer with wavelet packet decomposition representations to model degraded features for RUL prediction. Guo et al. [23] proposed a hybrid approach that initially constructs a nonlinear HI using complete ensemble empirical mode decomposition with adaptive noise and kernel principal component analysis. They then extracted multi-domain features from vibration signals and applied a dual-channel Transformer network with convolutional attention modules to generate HIs for other bearings. Ping et al. [24] proposed a multi-scale efficient channel attention convolutional neural network and bidirectional gated recurrent unit model for RUL prediction of rolling bearings, extracting multi-scale temporal features from Gram’s angle difference field images. Notwithstanding advancements, these predominantly supervised techniques necessitate abundant labeled data, impeding applicability in prevalent, unlabeled real-world industrial scenarios.

To facilitate data-driven automatic construction of HIs, this study investigates unsupervised learning paradigms. Autoencoders provide a viable framework, enabling label-agnostic HI development through feature extraction and degradation trend modeling. Guo et al. [25] proposed an unsupervised HI construction method using a multi-scale convolutional autoencoder optimized by a genetic algorithm, effectively identifying machine degradation through feature learning and data similarity. Ma et al. [26] developed a self-attention convolutional autoencoder, demonstrating enhanced trendability and scale similarity on bearing and milling cutter datasets. Xu et al. [27] proposed an enhanced stacked autoencoder with an exponential weighted moving average for health indicator construction using unlabeled vibration signals, improving noise reduction and eliminating the need for manual feature selection. De et al. [28] developed an LSTM autoencoder with attention for health indicator construction in aircraft systems. Xu et al. [29] developed a multi-scale-multi-head attention mechanism with an automatic encoder–decoder model, extracted multi-scale features from raw vibration signals, and measured the Wasserstein distance between healthy and unhealthy features, which demonstrated superiority over other similarity-based methods and provided a more reliable HI. Qu et al. [30] proposed an unsupervised method for bearing health indicator construction using multi-scale convolution and long short-term memory combined with an autoencoder for feature extraction and support vector regression to build health indicators, improving degradation trend detection. Wang et al. [31] developed a denoising Transformer autoencoder for tool condition monitoring, enhancing feature extraction and resistance to noise.

Although autoencoders are extensively employed in existing studies for unsupervised feature learning, they exhibit shortcomings in processing time-series signals, particularly in capturing the temporal dependencies of encoded features. Additionally, conventional HI construction techniques frequently necessitate manual selection and design of dominant features, rendering them inefficient and error-prone, especially with high-dimensional or nonlinear datasets. This reliance on manual feature engineering can lead to inaccurate similarity measurements and diminished HI fidelity. To address these challenges, this paper proposes an HCVT-WD for bearing HI construction and RUL prediction. The proposed method encompasses four key stages: (1) an unsupervised HCVT-WD autoencoder framework is developed to extract features from FFT-transformed raw vibration signals without manual feature engineering or labeling; (2) CNN and Vi-T are integrated to captures local spatial patterns and global temporal correlations; (3) Wasserstein distance is used to quantify divergences between healthy for robust HI formulation; and (4) an HI driven RUL prediction pipeline provides accurate forecasts with uncertainty quantification.

The structure of the paper is as follows: Section 2 focuses on the proposed framework, specifically highlighting the development of HIs using the proposed HCVT-WD model. Section 3 details the experimental setup, results, and analysis. Section 4 provides a summary of the key results and proposes possible directions for future studies.

2. Proposed Framework

The overall workflow of the proposed method is illustrated in Figure 1, with a detailed clarification of the HI construction process provided in Algorithm 1. In the initial phase, the HCVT-WD model is trained using a subset of the dataset

{x_{i}^{t r a i n}}_{i = 1}^{N_{t r a i n}}

composed of non-overlapping windows extracted from vibration signals recorded during healthy operation.

N_{t r a i n}

refers to the number of time instances, with each instance corresponding to a specific time window from the healthy state data. The training process minimizes the reconstruction error of the network. Once trained, the encoded features from the HCVT-WD are extracted. In the next phase, the model is implemented on the whole dataset

{x_{i}^{t e s t}}_{i = 1}^{N_{t e s t}}

, which consists of both healthy and unhealthy state data samples, similarly done for the training dataset. The WD between the combined encoded features of the healthy state and the encoded features of all other states is computed and used as the HI. Finally, the HI is modeled using a CNN-BiLSTM network with MC simulations to predict the RUL and quantify uncertainty.

Algorithm 1 Train the HCVT-WD model and construct HI

Input: Frequency domain raw vibrational signals (healthy and unhealthy states): $x_{i}^{r a w} \in ℝ^{L \times C}$ and initial parameters $θ$ .
Build and Initialize: HCVT-WD model.
Consider training data ${x_{i}^{t r a i n}}_{i = 1}^{N_{t r a i n}}$ as healthy data and ${x_{i}^{t e s t}}_{i = 1}^{N_{t e s t}}$ as whole data.
While $i \leq E p o c h s$ do
Decoded output ${\hat{x}}_{i}^{t r a i n}$ from model.
Minimize loss between $x_{i}^{t r a i n}$ and ${\hat{x}}_{i}^{t r a i n}$ .
$θ \leftarrow$ Update parameters of the model.
End for
Obtain encoded features $Z_{t r a i n}$ and $Z_{t e s t}$ along with trained HCVT-WD model parameters $θ^{*}$ .
Calculate the Wassertian distance between encoded $Z_{t r a i n}$ and $Z_{t e s t}$ as HI.
$H I_{j} = W D (Z_{t e s t, j}, Z_{t r a i n}) = \frac{1}{N_{t r a i n}} \sum_{i = 1}^{N_{t r a i n}} ∥ z_{j} - z_{i} ∥_{1}$
Detect the FPT using $μ \pm 3 σ$
Output: $H I_{j}$ , FPT

2.1. HCVT Autoencoder Model

In this section, we employ a hybrid model consisting of 1D-CNN and Vi-T as the encoder. The encoder extracts deep features from the inputs using the combined strengths of CNN and Vi-T architectures, and these encoded features are then upsampled to the original image dimensions using a CNN-based decoder.

2.1.1. 1D-CNN Patch and Position Embedding

To process time series data effectively, the signal is first split into smaller segments. Patch embedding serves this purpose before applying the Vi-T to time series data, as it enables the Transformer model to process both one-dimensional and multi-dimensional time series data effectively.

The raw vibrational signals are initially segmented into non-overlapping segments. Denote each segment by

{X_{i}}^{r a w} \in ℝ^{L \times C}

, where

L

and

C

are the length of the segments and the dimension of the input signals. Each segment is further partitioned into fixed-size patches. Each patch is represented as

x_{i}^{p} \in ℝ^{L_{p} \times C}

, where

L_{p}

is the length of each patch. It can be deduced to

L = L_{p} \times m

, where

m

is the number of patches. One-dimensional convolution operation is then applied to each patch as shown in Equation (1). The size of the convolution kernel is identical to that of patch

p

, and the stride is

L_{p}

, to preserve the temporal dynamics in the raw signals.

d_{p}^{k} = \sum_{i = 0}^{L_{p} - 1} \sum_{j = 0}^{C - 1} p (i, j) \times w^{k} (i, j) + b^{k}, 1 \leq k \leq n,

(1)

where

w^{k}

and

b^{k}

represent the weight and bias of the k-th convolutional kernel, respectively. The variable denotes the number of convolutional kernels, as well as the dimension of the time-series patch embedding. The term

d_{p}

corresponds to the embedding outcome of the patch. The embedding results for the entire set of patches in each sample

X^{r a w}

are presented in Equation (2).

d_{p} = [d_{p}^{1}, d_{p}^{2}, \dots, d_{p}^{n}]

(2)

Since Vi-T does not inherently recognize the order of patches, positional encoding is incorporated into the patch embeddings to preserve spatial information. For position encoding, each embedded patch is encoded as follows:

d_{i}^{P e n c o d e d} = d_{i}^{n} + d_{i}^{P o s}

(3)

where the positions

d_{i}^{P o s}

for the i-th patch are defined as follows:

\begin{array}{l} p_{i}^{2 k} = s i n (i / 10000^{2 k / d}) \\ p_{i}^{2 k + 1} = \cos (i / 10000^{2 k / d}) \end{array}

(4)

where

p_{i}^{2 k}

and

p_{i}^{2 k + 1}

represent the positional encodings for the i-th position within the sequence, and

d

is the dimensionality of the feature vectors.

2.1.2. Vision Transformer Encoder

The patch embeddings, augmented with positional encodings, are subsequently processed through an MHSA module. This module is built upon the scaled dot-product attention mechanism, which facilitates interactions between all positions in the sequence. The MHSA architecture extends this principle by employing multiple attention heads in parallel. Each head has independent parameters, and learn distinct attention patterns and captures a diverse set of semantic relationships within the data. A schematic of the MHSA mechanism is illustrated in Figure 2.

The definition of the i-th attention function is defined as follows:

h e a d_{i} = A t t e n t i o n (Q W_{Q}^{i}, K W_{K}^{i}, V W_{V}^{i})

(5)

where

W_{Q}^{i}

,

W_{K}^{i}

, and

W_{V}^{i} \in ℝ^{d \times d_{k}}

are learnable weight matrices and the vectors

Q

,

K

and

V

represent query, key, and value, respectively. Scalar dot-product attention is computed as follows:

A t t e n t i o n (Q, K, V) = s o f t \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(6)

where

d_{k}

denotes the dimensionality of the key vectors, which serves to counteract the issue of small gradients and produces a more stable attention distribution. The output of the MHSA mechanism is defined as follows:

M u l t i h e a d (Q, K, V) = c o n c a t (H_{1}, H_{2}, H_{3}, \dots, H_{h}) W^{0}

(7)

where

W^{0}

represents the trainable weight.

The output of the MHSA, denoted by

X_{a t t e n t i o n}

, is then passed through an FFN with residual connections and layer normalization:

X_{o u t p u t 1 D} = (L a y e r N o r m (X_{a t t e n t i o n} + F F N (X_{a t t e n t i o n})))

(8)

Subsequently, an FFN transforms each element in the sequence independently. The FFN comprises two fully connected layers with a nonlinear activation between them and is mathematically expressed as follows:

F F N (x) = W_{2} . G e L U (W_{1} x + b_{1}) + b_{2}

(9)

where

W_{1}

and

b_{1}

are the weight and bias of the first linear layer, which projects the input into a higher-dimensional space. Subsequently,

W_{2}

and

b_{2}

are the weight and bias of the second linear layer, projecting the transformed features back to the original dimensionality.

The final output of the Vi-T encoder retains the feature representations

X_{o u t p u t 1 D}

for all time patches, effectively capturing the temporal dependencies learned within the sequence. These output features are then passed through FC layers with dropout regularization, performing downsampling to obtain the latent

z_{e n c o d e d}

, which are also called the encoded features and are subsequently used for HI creation.

In this section, we employ a hybrid model consisting of 1D-CNN and Vi-T as the encoder. The encoder extracts deep features from the inputs using the combined strengths of CNN and Vi-T architectures, and these encoded features are then upsampled to the original image dimensions using a CNN-based decoder, as illustrated in Figure 2.

2.1.3. 1D Deconvolution

The decoding process begins by taking the final latent features and progressively reconstructing the original time series data. To achieve this, we first apply FC layers to expand the latent features, followed by a 1D deconvolution operation. The 1D deconvolution, also known as transposed convolution, allows us to upsample the latent representations to the original time series dimensions.

Mathematically, given the latent feature vector

z_{e n c o d e d} \in ℝ^{B \times l}

, where

B

is the batch size and

l

is the size of the latent space, we use FC layers to expand the latent vector back to a higher dimension that matches the size of the Vi-T encoder output.

z_{\exp a n d e d} = F C_{2} (G e L U (F C_{1} (z_{e n c o d e d})))

(10)

The expanded features are then reshaped and passed through the 1D Deconvolution layers to reconstruct the original input signal:

{\hat{X}}_{i}^{r a w} = C o n v T r a n s p o s e 1 D (z_{e n c o d e d})

(11)

During this reconstruction process, the HCVT-WD model is optimized using a composite loss function

L

that combines reconstruction accuracy with regularization:

L = \frac{1}{2 N} \sum_{i = 1}^{N} ∥ X_{i}^{r a w} - {\hat{X}}_{i}^{r a w} ∥^{2} + \frac{λ}{2} \sum_{j = 1}^{q} ∥ W_{j} ∥^{2}

(12)

The first term calculates the mean squared error (MSE) between the original input signals

X_{i}^{r a w}

and their reconstructions

{\hat{X}}_{i}^{r a w}

, where

N

denotes the number of samples and

i

indexes individual samples. The second term applies regularization to the model trainable parameters

W_{j}

(with

q

total parameter tensors) controlled by the coefficient

λ

. This regularization prevents overfitting by penalizing large weight values while maintaining model generalizability. The complete optimization procedure, implemented using the Adam optimizer, is detailed in Table 1.

2.2. HI Construction

The HI is calculated using WD between the distribution of encoded test features and the empirical distribution of the healthy training encoded features. The WD measures the minimum cost required to transform one probability distribution into another, quantifying how far the test data deviates from the healthy distribution in the embedded feature space. Due to its ability to effectively quantify distributional shifts, the Wasserstein distance proved to be a more robust metric for constructing an HI compared to other distance measures [32].

The empirical distribution of healthy training encoded features is defined as follows:

Z_{t r a i n}^{D} = \frac{1}{N_{t r a i n}} \sum_{i = 1}^{N_{t r a i n}} δ_{z_{i}}

(13)

where

z_{i} \in ℝ^{d}

are encoded training features having

d

dimensions, and

δ_{z_{i}}

denotes a Dirac delta centered at

z_{i}

. Similarly, for each test sample

z_{i}

, its distribution is defined as

Z_{t e s t}^{D} = δ_{z_{j}}

, and finally, the HI for j-th test sample is computed as follows:

H I_{j} = W D (Z_{t e s t, j}^{D}, Z_{t r a i n}^{D}) = \frac{1}{N_{t r a i n}} \sum_{i = 1}^{N_{t r a i n}} ∥ z_{j} - z_{i} ∥_{1}

(14)

where

∥ . ∥_{1}

is the L1 normalization across all feature dimensions.

2.3. FPT Detection

FPT detection based on the

3 σ

criterion involves constructing a monitoring interval for the healthy state by calculating the mean

μ

and standard deviation

σ

of the initial healthy HI values [33]. A potential fault is indicated when the absolute deviation of the current HI value from the mean exceeds three standard deviations, as expressed below:

(|H I_{i} - μ| > 3 σ)

(15)

However, to minimize false alarms caused by random fluctuations, the FPT is not triggered immediately upon a single exceedance. Instead, it is confirmed only when three consecutive HI values exceed the threshold, ensuring a more reliable and robust detection of early fault conditions.

2.4. RUL Prediction and Uncertainty Quantification

2.4.1. RUL Prediction

The RUL prediction process for bearings begins with estimating the time interval from FPT detection to eventual failure. When the FPT is identified for a test bearing, its true RUL labels can be established based on the actual remaining operational lifespan. The prediction architecture employs a CNN-Bi-LSTM network [34] as illustrated in Figure 3 and the prediction process is detailed in Algorithm 2. This dual-stage approach enables comprehensive learning of degradation patterns by analyzing both forward and backward sequence relationships. The trained model processes post-FPT HI sequences to establish an accurate mapping between evolving degradation characteristics and remaining lifespan.

Algorithm 2 CNN-BiLSTM network training for prognosis and uncertainty quantification

Input: ${\{X_{T r a i n}, Y_{T r a i n}\}}^{N}$ , and initial parameters $η$ , $X_{T r a i n}$ is the constructed $H I_{j}$ after FPT, and $Y_{T r a i n}$ is the RUL labels assigned to each $H I_{j}$ .
Build and Initialize: CNN-BILSTM model.
While $i \leq E p o c h s$ do
Calculate loss between $Y_{T r a i n}$ and predicted ${\hat{Y}}_{T r a i n}$ .
$η^{*} \leftarrow$ Update parameters of the model.
End for
Obtain RUL via trained CNN-BiLSTM model parameters $η^{*}$ .
Perform MC-Dropout at inference: run S stochastic forward passes → compute μ as predicted RUL, and σ as model uncertainty.
Output: RUL prediction with confidence intervals $μ \pm 1.96 σ$ .

After the convolution operations in the CNN, the deep features are passed into the Bi-LSTM model. The LSTM network incorporates an input gate

i_{t}

, a forget gate

f_{t}

, an output gate

o_{t}

, and a memory cell

C_{t}

. The hidden layer uses the input vector

x_{t}

and output vector

h_{t}

to implement the activation function and weight updates. The core principle for the LSTM cell is as follows:

f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})

(16)

i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})

(17)

o_{t} = σ (W_{o} [h_{t - 1}, x_{t}] + b_{o})

(18)

C_{t} = f_{t} \cdot C_{t - 1} + i_{t} \cdot {\tilde{C}}_{t}

(19)

h_{t} = o_{t} \cdot t a n h (C_{t})

(20)

where

t a n h (\cdot)

is the tangent activation function, and

{\tilde{C}}_{t}

is an intermediate value obtained by applying the

t a n h (\cdot)

function to both the input and previous information. The function

σ

refers to the sigmoid function, and the terms

W

and

b

represent the weight matrices and bias, respectively.

The hidden states from both forward and backward directions are then concatenated and fed into FC layers to predict the RUL.

2.4.2. Uncertainty Quantification with MC Dropout and KDE

Given an input sequence

x

, our network produces a point prediction

\hat{y}

. Point estimates, however, ignore epistemic and aleatoric sources of uncertainty that arise during equipment operation. To characterize predictive uncertainty without incurring the computational cost of full Bayesian neural networks, we adopt MC dropout as a tractable approximation to Bayesian inference and then recover a full predictive probability density via KDE.

Let

ω

denote all stochastic weights of the model and D the training data. The Bayesian predictive distribution is expressed as follows:

p (y ∣ x, D) = \int p (y ∣ x, ω) p (ω ∣ D) d ω

(21)

Following the dropout-as-Bayesian-approximation view, we replace the posterior with a variational distribution induced by dropout masks applied at inference time. Concretely, for each dropout layer

i

, the effective weights are as follows:

W_{i} = V_{i} \cdot d i a g ({[z_{i, j}]}_{j = 1}^{K_{i}}), z_{i, j} \sim B e r n o u l l i (p_{i})

(22)

where

V_{i}

are the learned parameters,

p_{i}

is the retain probability, and

K_{i}

is the incoming dimensionality. During testing, dropout is kept active, and

S

stochastic forward passes are performed to generate samples:

y^{(s)} \sim p (y ∣ x, ω^{(s)}), s = 1, \dots, S

(23)

where

ω^{(s)} \sim q (ω)

is defined by the sampled dropout masks. From these samples, the predictive mean and variance are estimated as follows:

\hat{μ} (x) = \frac{1}{S} \sum_{s = 1}^{S} y^{(s)}, {\hat{σ}}^{2} (x) = \frac{1}{S} \sum_{s = 1}^{S} {(y^{(s)} - \hat{μ} (x))}^{2}

(24)

These summarize uncertainty for the interval prediction as follows:

\hat{μ} (x) \pm z_{1 - α / 2} \hat{σ} (x)

(25)

where

z_{1 - α / 2}

is the critical value from the standard normal distribution corresponding to the desired confidence level. To capture the full shape of uncertainty, the MC samples are converted into a continuous density with KDE. Let

{y^{(s)}}_{s = 1}^{S}

be the MC predictions for input

x

. The conditional probability density

\hat{f} (y ∣ x)

is estimated as follows:

\hat{f} (y ∣ x) = \frac{1}{S h} \sum_{s = 1}^{S} K (\frac{y - y^{(s)}}{h})

(26)

where

K (.)

is a Gaussian kernel and

h

is the bandwidth. Selecting the appropriate bandwidth is crucial, as it greatly impacts the smoothness and accuracy of the estimation. A grid search and cross-validation are used to identify the optimal bandwidth, choosing the one that minimizes the mean integrated squared error.

3. Experimental Validation and Analysis

3.1. Dataset Preparation

The proposed HI construction and RUL prediction method is validated on the XJTU-SY bearing dataset [35]. As shown in Figure 4, the test rig is designed to collect vibration signals under varying operating conditions. The experiment involved testing 15 LDK UER204 bearings until failure under controlled operating conditions, as shown in Table 1. Vibration data were recorded by two accelerometers operating at a frequency of 25.6 kHz, capturing both vertical and horizontal vibrations. The LDK UER204 bearings were chosen for their stable structural design and consistent performance, making them well-suited for repeatable accelerated degradation experiments and clear observation of the full degradation process. A 1.28 s sample was then recorded every 60 s during the testing. Table 2 further provides the description of the experimental details of the XJTU-SY dataset. In the XJTU-SY bearing dataset, defects on individual bearing components were generated through accelerated degradation tests. During these tests, bearings were continuously operated under heightened load and speed conditions to expedite the wear process and induce failure in a reduced timeframe. The degradation occurred naturally on the inner and outer raceways as well as the rolling elements due to prolonged mechanical stress and friction. The test was concluded when the vibration amplitude surpassed a specified failure threshold (20 g), at which point various types of defects—such as inner race wear, outer race wear, and outer race fracture—were identified.

For this study, only the horizontal vibration signals were considered, as they exhibited clearer trends in the degradation process.

In order to create the healthy dataset containing data for network training and validation, the Pauta criterion was employed, as used in [36]. The initial healthy dataset was constructed from the first 1 min of operation, which was defined as the baseline normal stage under the assumption that the bearing was in a healthy state at the beginning of its life. The Pauta criterion

µ \pm 3 σ

was then calculated from this set. For each new data point, if it falls within this range, it is added to the healthy dataset, and the criterion is updated. If a point falls outside, it is considered abnormal. The healthy dataset was finalized and locked once a predefined number of consecutive data points (e.g., 500) were classified as abnormal, marking the failure threshold. 80% of the healthy data from the normal state were randomly split for training, while 20% were randomly set aside for validation. For example, for Bearing1-1, the total number of samples was 3936, with 1580 samples identified as healthy. Of these healthy samples, 1264 were randomly selected for training, and 316 were used as the validation set. The test set consisted of the run-to-failure samples, which were used to extract the encoded features from the proposed model.

Following the determination of the normal dataset, the vibration data were segmented into samples of 1024 data points without overlapping. The FFT was then applied to each sample, and the shape of the data was changed to 512 by retaining only the first half of the FFT spectrum. For the CNN patch embedding, the input for the model was shaped into (128, 1, 512), where 128 represents the batch size, 1 represents the channel dimension, and 512 corresponds to the number of frequency components after FFT.

3.2. Structural Parameters and Hyperparameters

The hyperparameters and structural parameters of the proposed model are summarized in Table 3 and Table 4, respectively. The optimal hyperparameters are determined based on the reconstruction loss during validation through the use of the grid search algorithm. The model was trained using AdamW as the optimizer for up to 400 epochs, with early stopping employed to avoid overfitting by halting training when the validation loss ceased to improve. The training and validation losses with the optimized parameters are shown in Figure 5. The computational work was done on a computer in PyTorch with an i5 12400F CPU, an NVIDIA RTX3060 GPU, and 16GB of RAM.

3.3. HI Construction and FPT Detection

The HCVT-WD model was used to construct the HI for all bearings under various operating conditions. These indicators display a clear trend over time, which is crucial for FPT detection and RUL prediction. To further enhance the clarity of these indicators, Gaussian smoothing was applied to the raw HI as shown in Figure 6, effectively filtering out noise while preserving the overall degradation trend. The smoothed curves provide a more stable and interpretable representation of bearing degradation, facilitating reliable FPT detection and RUL estimation. The FPT and corresponding HIs for the bearings are illustrated in Figure 7.

3.4. Evaluation Metrics of HIs

This study presents three quantitative metrics: monotonicity (Mo), trendability (Tr), and prognosability (Pr) to evaluate the effectiveness of HI construction, as defined in Equations (1)–(3):

M o = |1 - {\frac{6 \sum_{j = 1}^{N} (P_{j} - j)}{N (N^{2} - 1)}}^{2}|

(27)

T r e = \frac{|N \sum_{j = 1}^{N} (H I_{j} t_{j}) - (\sum_{j = 1}^{N} H I_{j}) (\sum_{j = 1}^{N} t_{j})|}{\sqrt{[N \sum_{j = 1}^{N} H I_{j}^{2} - {(\sum_{j = 1}^{N} H I_{j})}^{2}] [N \sum_{j = 1}^{N} t_{j}^{2} - {(\sum_{j = 1}^{N} t_{j})}^{2}]}}

(28)

\Pr = \exp (- \frac{σ_{H I_{f}}}{|μ_{H I_{i}} - μ_{H I_{f}}|})

(29)

where

P_{j}

represents the rank of the j-th HI within the HI sequence and

N

denotes the number of samples.

H I_{j}

is the value of the HI at the time

t_{j}

.

μ_{H I_{i}}

and

μ_{H I_{f}}

denote the means of the HI values at the initial and failure phases, and

σ_{H I_{f}}

is the standard deviation of the HI values in the failure phase. All three evaluation metrics range between 0 and 1, with values closer to 1 indicating higher-quality health indicators that exhibit more consistent degradation trends, stronger linear relationships with time, and more predictable failure behavior.

To validate the effectiveness of the proposed method, we compared its performance against three state-of-the-art HI construction methods: ensemble stacked autoencoder (ES-AE), deep CNN-auto-encoder–decoder (DCN-AE), and multi-scale CNN-auto-encoder–decoder (MSC-AE), using the aforementioned evaluation metrics. As shown in Figure 8 and Table 5, our method consistently outperforms the other approaches across all evaluation metrics.

While ES-AE is capable of capturing the overall degradation trend, it suffers from noise, which lowers both monotonicity and trendability. DCN-AE and MSC-AE demonstrate better performance by leveraging convolutional feature extraction and multi-scale representations, yet their HIs still exhibit noticeable fluctuations, particularly in the mid-life regions. In contrast, the proposed HCVT-WD achieves superior results through a dual-branch design: the CNN branch effectively filters out local noise from raw vibration signals, while the Vi-T branch captures long-range dependencies and global structural information. The integration of these two branches through skip fusion produces smoother HI curves with reduced oscillations, which explains the markedly higher monotonicity, trendability, and prognosability observed for HCVT-WD.

3.5. Visualization Analysis

In the visualization analysis, the healthy and degraded features are clearly separated in the uniform manifold approximation and projection embedding space, even when testing with full healthy and degraded samples, as shown in Figure 9. This separation indicates that the proposed HCVT-WD model effectively segregates the encoded features. Notably, the unhealthy features follow a clear degradation trajectory in the embedding space, reflecting the progressive failure process, while the healthy features remain well clustered. This separation is attributed to the multi-head self-attention mechanism in the Transformer, which enhances discriminative feature learning. The attention maps further validate this behavior, as illustrated in Figure 10, where the learned attention weights for three randomly selected testing samples exhibit sparsity in the attention matrices. This sparsity confirms that the proposed module successfully captures dynamic and temporal dependencies within the raw input signals, thereby improving the model’s cycle life prediction capability by focusing on the most informative features.

3.6. RUL Estimation and Uncertainty Analysis

The CNN-Bi-LSTM model, integrated with a Bayesian neural network, was employed to map the HIs to the RUL. To predict the RUL for each bearing, the FPT was initially identified. After determining the FPT, post-FPT HI values were collected, and RUL labels were assigned. These labels indicate the remaining life as a percentage of the total remaining time after the FPT for each bearing.

The samples were created by looking at sequences of 10 consecutive time steps of the HIs. Each sequence consisted of a sliding window of 10 time steps, and the model used these sequences to learn how the health of the bearing evolves over time. Once the sequences were created, the data were randomly split into training and validation sets, with 80% used for training and 20% used for validation. For each sequence, the RUL label was assigned based on the remaining life at the end of the sequence.

During the training phase, a leave-one-out cross-validation method was used for each operating condition. For each bearing, one was held out as the test set, while the others under the same operating condition were used to train the model. To evaluate the proposed method’s accuracy, two performance metrics—RMSE and MAE—were used. These metrics were calculated as follows:

R M S E = \sqrt{\frac{1}{n} \sum_{t = 1}^{n} {(y - \hat{y})}^{2}}

(30)

M A E = \frac{1}{n} \sum_{t = 1}^{n} | y - \hat{y} |

(31)

where

y

and

\hat{y}

represent the actual and predicted RUL values, respectively, and

n

is the total number of samples.

The hyperparameters of the proposed model were carefully tuned, with the optimized values being a batch size of 32, a learning rate of 0.0001, and two LSTM hidden layers with 64 and 32 neurons. A dropout was added after the first hidden layer, and an additional dropout of 0.3 was applied after the final FC layer to implement MC dropout for uncertainty estimation. The prediction results of the proposed methodology are shown in Figure 11 across different working conditions. Additionally, separate CNN and Bi-LSTM prediction networks were constructed for comparison, and the results are presented in Table 6. The performance comparison demonstrates that the CNN-Bi-LSTM model outperforms these alternatives, achieving superior accuracy in RUL prediction.

For uncertainty quantification, an MC dropout rate of 0.3 was applied during inference. Additionally, prediction uncertainty was quantified through statistically derived confidence intervals, as shown in Figure 12. Notably, all RUL predictions remain within the calculated 95% confidence bounds throughout the degradation process. KDE analysis revealed that the predicted RUL distributions are tightly clustered around the true values, with minimal skewness. This synergy of accurate point estimation and rigorous uncertainty quantification represents a significant advancement for industrial condition-based maintenance, where understanding both the predicted RUL and its associated confidence level is critical for operational decision-making. The consistent outperformance of conventional approaches validates the efficiency of merging convolutional feature extraction with sequential pattern identification in this integrated prognostic framework.

3.7. Ablation Studies on the HCVT-WD Model

This ablation study investigates the impact of CNN patch embedding, varying patch sizes, and skip connection fusion. As illustrated in Figure 13a the reconstruction loss, observed through both training and validation losses, remains high, fluctuates significantly, and converges slowly in the absence of CNN patch embedding. This indicates that CNN patch embedding performs effectively via convolution operations, and highlights the critical role of patch embedding. Similarly, for a patch size of 16, the reconstruction loss curves exhibit smoother and more rapid convergence compared to patch sizes 8 and 32, as shown in Figure 13b. Furthermore, the introduction of skip connection fusion after the transformer block, aimed at enhancing feature reusability and preserving spatial information, results in a marked improvement in reconstruction loss, as shown in Figure 13c. The skip connection fusion enables the decoder to leverage both high-level features from the transformer blocks and low-level features from the encoder, leading to more stable and smoother convergence of the loss curves and improving the accuracy of the reconstructed input data.

3.8. Comparative Experiments with Other State-of-the-Art Methods for RUL Prediction

To further validate the usefulness of our developed method, a comparative analysis was conducted with four alternative models: the memory fusion network (CLSTMF) [39], self-adaptive graph convolutional networks with self-attention (SAGCN-SA) [40], Time Transformer convolutional LSTM (TT-ConvLSTM) [41], and TCN-Transformer [42]. The first three models were designed for feature extraction and direct RUL prediction, whereas the TCN-Transformer model adopted a two-stage degradation process, considering both HI construction and RUL mapping.

The CLSTMF model achieved a lowest RMSE of 0.051, while the TT-ConvLSTM model showed the lowest RMSE and MAE values of 0.072 and 0.052, respectively. The TCN-Transformer model performed relatively well with a lowest RMSE of 0.0549 and MAE of 0.0441. In contrast, our proposed two-stage degradation model consistently outperformed most of the alternatives, yielding the lowest RMSE of 0.0441 and MAE of 0.0321, demonstrating superior predictive accuracy across all metrics. These results of the comparative experiments are summarized in Table 7, where the RMSE and MAE values for each model are provided. This emphasizes the effectiveness of our approach in accurately predicting RUL through the two-stage degradation process, including both point prediction and interval prediction.

4. Conclusions

This study proposed the HCVT-WD framework for constructing HIs in an unsupervised manner and predicting the RUL of bearings. Raw vibration signals are processed by a sequential CNN–vision Transformer architecture, eliminating manual feature engineering while capturing local spatial patterns and long-range temporal dependencies. The HI is defined as the Wasserstein distance between encoded representations of healthy and degraded states, providing a precise metric of degradation severity. Experimental validation on bearing datasets shows that the proposed HI outperforms state-of-the-art methods in monotonicity, trendability, and prognosability. For RUL estimation, the HI is seamlessly integrated into a CNN-BiLSTM regressor that models both temporal and nonlinear degradation dynamics. A Bayesian neural network yields uncertainty-aware predictions and confidence, which support risk-informed maintenance decisions for safety- and mission-critical applications. The HCVT-WD model requires minimal preprocessing and operates without full lifecycle data, making it adaptable to real-world industrial applications. Its capacity to yield robust HIs and quantify predictive uncertainty establishes it as a potent paradigm for PHM in intricate systems.

Future research will prioritize the development of adaptable HIs across varying operational conditions, validated through extensive testing on heterogeneous datasets to broaden their applicability. Additionally, attention will be directed towards addressing the uncertainty present in both the measurement process and the model’s predictions. This includes the integration of measurement uncertainty as well as other sources of uncertainty, such as model and environmental factors, into the development of more robust and reliable predictive models. Furthermore, future work will focus on fault diagnosis and prognosis through HI construction in a dual-task learning framework, with an emphasis on incorporating zero-fault-shot learning techniques to improve the model’s adaptability to unseen fault types and operational conditions.

Author Contributions

Methodology, A.D. and H.-Z.H. and C.-G.H.; Investigation, S.K., T.H. and S.G.N.; Writing—original draft, A.D.; Writing—review and editing, C.-G.H.; Funding acquisition, H.-Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This study is funded by the National Natural Science Foundation of China (Grant No. 52372349) and the Natural Science Foundation of Sichuan Province (Grant No. 23NSFSC0420).

Data Availability Statement

The study’s original contributions are outlined in the article, and any further questions can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Abbreviations and Nomenclatures
Prognostics and health management	PHM	Monte Carlo	MC
Remaining useful life	RUL	Multi-head self-attention	MHSA
Health indicator	HI	Feedforward network	FFN
Empirical mode decomposition	EMD	Fully connected	FC
Convolutional neural networks	CNN	Kernel Density estimation	KDE
Long short-term memory	LSTM	Root mean square error	RMSE
Wasserstein Distance	WD	Mean absolute error	MAE
Hybrid convolutional vision Transformer with Wasserstein distance	HCVT-WD	Fast Fourier Transform	FFT
Vision Transformer	Vi-T	Bidirectional long short-term memory	Bi-LSTM
${x_{i}^{t r a i n}}_{i = 1}^{N_{t r a i n}}$	Train datasets	${X_{i}}^{r a w}$	Raw signals
${x_{i}^{t e s t}}_{i = 1}^{N_{t e s t}}$	Test datasets	$Z_{t r a i n}$	Train encoded features
$Z_{t e s t}$	Test encoded features	$σ$	Standard deviation
$μ$	mean

References

Qi, F.; Huang, M. Joint Optimization of Maintenance and Spares Inventory Policy for a Series-Parallel System Considering Dependent Failure Processes. Reliab. Eng. Syst. Saf. 2024, 247, 110116. [Google Scholar] [CrossRef]
Pei, X.; Li, X.; Gao, L. A Novel Machinery RUL Prediction Method Based on Exponential Model and Cross-Domain Health Indicator Considering First-to-End Prediction Time. Mech. Syst. Signal Process. 2024, 209, 111122. [Google Scholar] [CrossRef]
Huang, C.-G.; Huang, H.-Z.; Li, Y.-F.; Peng, W. A Novel Deep Convolutional Neural Network-Bootstrap Integrated Method for RUL Prediction of Rolling Bearing. J. Manuf. Syst. 2021, 61, 757–772. [Google Scholar] [CrossRef]
Khanal, S.; Huang, H.-Z.; Huang, C.-G.; Dahal, A.; Huang, T.; Niazi, S.G. Domain-Specific Dual Network with Unsupervised Domain Adaptation for Transfer Fault Prognosis Across Machines Using Multiple Source Domains. IEEE Trans. Instrum. Meas. 2025, 74, 3527813. [Google Scholar] [CrossRef]
Zhu, R.; Peng, W.; Wang, D.; Huang, C.-G. Bayesian Transfer Learning with Active Querying for Intelligent Cross-Machine Fault Prognosis under Limited Data. Mech. Syst. Signal Process. 2023, 183, 109628. [Google Scholar] [CrossRef]
Huang, H.-Z.; Li, H.; Shi, Y.; Huang, T.; Yang, Z.; He, L.; Liu, Y.; Jiang, C.; Li, Y.-F.; Beer, M.; et al. Theory and Application of Possibility and Evidence in Reliability Analysis and Design Optimization. J. Reliab. Sci. Eng. 2025, 1, 015007. [Google Scholar] [CrossRef]
Huang, T.; Xiahou, T.; Mi, J.; Chen, H.; Huang, H.-Z.; Liu, Y. Merging Multi-Level Evidential Observations for Dynamic Reliability Assessment of Hierarchical Multi-State Systems: A Dynamic Bayesian Network Approach. Reliab. Eng. Syst. Saf. 2024, 249, 110225. [Google Scholar] [CrossRef]
Huang, T.; Zhang, Q.; Beer, M.; Liu, Y.; Huang, H.-Z. A Dynamic Reliability Assessment Method for Multi-State Manufacturing System by Merging Imprecise Observational Information. Reliab. Eng. Syst. Saf. 2025, 266, 111722. [Google Scholar] [CrossRef]
Si, X.-S.; Wang, W.; Hu, C.-H.; Zhou, D.-H.; Pecht, M.G. Remaining Useful Life Estimation Based on a Nonlinear Diffusion Degradation Process. IEEE Trans. Reliab. 2012, 61, 50–67. [Google Scholar] [CrossRef]
Lei, Y.; Li, N.; Gontarz, S.; Lin, J.; Radkowski, S.; Dybala, J. A Model-Based Method for Remaining Useful Life Prediction of Machinery. IEEE Trans. Reliab. 2016, 65, 1314–1326. [Google Scholar] [CrossRef]
Zhuang, J.; Chen, Y.; Zhao, X.; Jia, M.; Feng, K. A Graph-Embedded Subdomain Adaptation Approach for Remaining Useful Life Prediction of Industrial IoT Systems. IEEE Internet Things J. 2024, 11, 22903–22914. [Google Scholar] [CrossRef]
Zhu, R.; Chen, Y.; Peng, W.; Ye, Z.-S. Bayesian Deep-Learning for RUL Prediction: An Active Learning Perspective. Reliab. Eng. Syst. Saf. 2022, 228, 108758. [Google Scholar] [CrossRef]
Huang, C.-G.; Li, H.; Peng, W.; Tang, L.C.; Ye, Z.-S. Personalized Federated Transfer Learning for Cycle-Life Prediction of Lithium-Ion Batteries in Heterogeneous Clients With Data Privacy Protection. IEEE Internet Things J. 2024, 11, 36895–36906. [Google Scholar] [CrossRef]
Lei, Y.; Li, N.; Guo, L.; Li, N.; Yan, T.; Lin, J. Machinery Health Prognostics: A Systematic Review from Data Acquisition to RUL Prediction. Mech. Syst. Signal Process. 2018, 104, 799–834. [Google Scholar] [CrossRef]
Dong, S.; Sheng, J.; Liu, Z.; Zhong, L.; Wei, H. Bearing Remain Life Prediction Based on Weighted Complex SVM Models. J. Vibroeng. 2016, 18, 3636–3653. [Google Scholar] [CrossRef]
Guo, R.; Wang, Y.; Zhang, H.; Zhang, G. Remaining Useful Life Prediction for Rolling Bearings Using EMD-RISI-LSTM. IEEE Trans. Instrum. Meas. 2021, 70, 3509812. [Google Scholar] [CrossRef]
Zhang, T.; Wang, Q.; Shu, Y.; Xiao, W.; Ma, W. Remaining Useful Life Prediction for Rolling Bearings with a Novel Entropy-Based Health Indicator and Improved Particle Filter Algorithm. IEEE Access 2023, 11, 3062–3079. [Google Scholar] [CrossRef]
Zhao, H.; Liu, H.; Jin, Y.; Dang, X.; Deng, W. Feature Extraction for Data-Driven Remaining Useful Life Prediction of Rolling Bearings. IEEE Trans. Instrum. Meas. 2021, 70, 3511910. [Google Scholar] [CrossRef]
Long, Y.; Pang, Q.; Zhu, G.; Cheng, J.; Li, X. Remaining Useful Life Prediction of Rolling Bearings Based on Refined Composite Multi-Scale Attention Entropy and Dispersion Entropy. arXiv 2024, arXiv:2406.16967. [Google Scholar]
Guo, L.; Li, N.; Jia, F.; Lei, Y.; Lin, J. A Recurrent Neural Network Based Health Indicator for Remaining Useful Life Prediction of Bearings. Neurocomputing 2017, 240, 98–109. [Google Scholar] [CrossRef]
Wang, Z.; Guo, J.; Wang, J.; Yang, Y.; Dai, L.; Huang, C.-G.; Wan, J.-L. A Deep Learning Based Health Indicator Construction and Fault Prognosis with Uncertainty Quantification for Rolling Bearings. Meas. Sci. Technol. 2023, 34, 105105. [Google Scholar] [CrossRef]
Li, Z.; Zhang, K.; Lai, X.; Zheng, Q.; Ding, G. A Remaining Useful Life Prediction Method for Rolling Bearing Based on Multi-Channel Fusion Hierarchical Vision Transformer. In Proceedings of the 2023 IEEE 12th Data Driven Control and Learning Systems Conference (DDCLS), Xiangtan, China, 12–14 May 2023; pp. 1025–1029. [Google Scholar]
Guo, J.; Wang, Z.; Li, H.; Yang, Y.; Huang, C.-G.; Yazdi, M.; Kang, H.S. A Hybrid Prognosis Scheme for Rolling Bearings Based on a Novel Health Indicator and Nonlinear Wiener Process. Reliab. Eng. Syst. Saf. 2024, 245, 110014. [Google Scholar] [CrossRef]
Ma, P.; Li, G.; Zhang, H.; Wang, C.; Li, X. Prediction of Remaining Useful Life of Rolling Bearings Based on Multiscale Efficient Channel Attention CNN and Bidirectional GRU. IEEE Trans. Instrum. Meas. 2024, 73, 2508413. [Google Scholar] [CrossRef]
Guo, L.; Yu, Y.; Duan, A.; Gao, H.; Zhang, J. An Unsupervised Feature Learning Based Health Indicator Construction Method for Performance Assessment of Machines. Mech. Syst. Signal Process. 2022, 167, 108573. [Google Scholar] [CrossRef]
Ma, W.; Guo, L.; Gao, H.; Yu, Y.; Qian, M. A Health Indicator Construction Method Based on Self-Attention Convolutional Autoencoder for Rotating Machine Performance Assessment. Measurement 2022, 204, 112108. [Google Scholar] [CrossRef]
Xu, F.; Wang, L. Constructing a Health Indicator for Bearing Degradation Assessment via an Unsupervised and Enhanced Stacked Autoencoder. Adv. Eng. Inform. 2022, 53, 101708. [Google Scholar] [CrossRef]
De Pater, I.; Mitici, M. Developing Health Indicators and RUL Prognostics for Systems with Few Failure Instances and Varying Operating Conditions Using a LSTM Autoencoder. Eng. Appl. Artif. Intell. 2023, 117, 105582. [Google Scholar] [CrossRef]
Xu, Z.; Bashir, M.; Liu, Q.; Miao, Z.; Wang, X.; Wang, J.; Ekere, N.N. A Novel Health Indicator for Intelligent Prediction of Rolling Bearing Remaining Useful Life Based on Unsupervised Learning Model. Comput. Ind. Eng. 2023, 176, 108999. [Google Scholar] [CrossRef]
Qu, Y.; Fu, S.; Yong, M.; Tian, J.; Lv, Z.; Li, R. Health Indicator Construction and Remaining Useful Life Prediction Based on MSC-LSTM-AE Model for Working Bearings. IEEE Sens. J. 2025, 25, 15525–15535. [Google Scholar] [CrossRef]
Wang, H.; Wang, S.; Sun, W.; Xiang, J. Multi-Sensor Signal Fusion for Tool Wear Condition Monitoring Using Denoising Transformer Auto-Encoder Resnet. J. Manuf. Process. 2024, 124, 1054–1064. [Google Scholar] [CrossRef]
Ni, Q.; Ji, J.C.; Feng, K. Data-Driven Prognostic Scheme for Bearings Based on a Novel Health Indicator and Gated Recurrent Unit Network. IEEE Trans. Ind. Inform. 2023, 19, 1301–1311. [Google Scholar] [CrossRef]
Li, N.; Lei, Y.; Lin, J.; Ding, S. An Improved Exponential Model for Predicting Remaining Useful Life of Rolling Element Bearings. IEEE Trans. Ind. Electron. 2015, 62, 7762–7773. [Google Scholar] [CrossRef]
Kim, J.; Oh, S.; Kim, H.; Choi, W. Tutorial on Time Series Prediction Using 1D-CNN and BiLSTM: A Case Example of Peak Electricity Demand and System Marginal Price Prediction. Eng. Appl. Artif. Intell. 2023, 126, 106817. [Google Scholar] [CrossRef]
Lei, Y.; Tan, T.; Wang, B.; Li, N.; Yan, T.; Yang, J. XJTU-SY Rolling Element Bearing Accelerated Life Test Datasets: A Tutorial. J. Mech. Eng. 2019, 55, 1. [Google Scholar] [CrossRef]
Kaji, M.; Parvizian, J.; van de Venn, H.W. Constructing a Reliable Health Indicator for Bearings Using Convolutional Autoencoder and Continuous Wavelet Transform. Appl. Sci. 2020, 10, 8948. [Google Scholar] [CrossRef]
Lin, P.; Tao, J. A Novel Bearing Health Indicator Construction Method Based on Ensemble Stacked Autoencoder. In Proceedings of the 2019 IEEE International Conference on Prognostics and Health Management (ICPHM), San Francisco, CA, USA, 17–20 June 2019; pp. 1–9. [Google Scholar]
Xu, F.; Huang, Z.; Yang, F.; Wang, D.; Tsui, K.L. Constructing a Health Indicator for Roller Bearings by Using a Stacked Auto-Encoder with an Exponential Function to Eliminate Concussion. Appl. Soft Comput. 2020, 89, 106119. [Google Scholar] [CrossRef]
Li, X.; Zhang, W.; Ding, Q. Deep Learning-Based Remaining Useful Life Estimation of Bearings Using Multi-Scale Feature Extraction. Reliab. Eng. Syst. Saf. 2019, 182, 208–218. [Google Scholar] [CrossRef]
Wei, Y.; Wu, D.; Terpenny, J. Remaining Useful Life Prediction Using Graph Convolutional Attention Networks with Temporal Convolution-Aware Nested Residual Connections. Reliab. Eng. Syst. Saf. 2024, 242, 109776. [Google Scholar] [CrossRef]
Niazi, S.G.; Huang, T.; Zhou, H.; Bai, S.; Huang, H.-Z. Multi-Scale Time Series Analysis Using TT-ConvLSTM Technique for Bearing Remaining Useful Life Prediction. Mech. Syst. Signal Process. 2024, 206, 110888. [Google Scholar] [CrossRef]
Cao, W.; Meng, Z.; Li, J.; Wu, J.; Fan, F. A Remaining Useful Life Prediction Method for Rolling Bearing Based on TCN-Transformer. IEEE Trans. Instrum. Meas. 2025, 74, 3501309. [Google Scholar] [CrossRef]

Figure 1. The proposed framework.

Figure 2. The proposed HCVT-WD model.

Figure 3. CNN-BILSTM model for HI to RUL mapping.

Figure 4. The experiment setup of the test rig of XJTU-SY.

Figure 5. Loss variation during the training process.

Figure 6. Raw and Gaussian-smoothed health indicators: (a) Bearing1-1. (b) Bearing2-2.

Figure 7. Health indicators and FPT of different bearings.

Figure 8. Different HI model comparisons: (a) Bearing1-2. (b) Bearing1-3. (c) Bearing2-2 (d) Bearing2-5.

Figure 9. Visualization analysis of test encoded features.

Figure 10. Attention maps captured by MHSA of Vi-T.

Figure 11. RUL prediction of bearings using different models: (a) Bearing1-2. (b) Bearing1-3. (c) Bearing1-5. (d) Bearing2-1. (e) Bearing2-2. (f) Bearing2-5 (g) Bearing3-3. (h) Bearing3-5.

Figure 12. RUL prediction and uncertainty quantification: (a) Bearing1-5. (b) Bearing2-1.

Figure 13. Training and validation losses for ablation studies (a) CNN patch embedding. (b) Patch sizes. (c) Skip fusion.

Table 1. The operating conditions in XJTU-SY.

Operating Conditions	Speed	Load	Bearings	Lifetime (mins)	Type of Faults
Condition 1	2100 rpm	12 kN	B1_1	123	Outer race
			B1_2	161	Outer race
			B1_3	158	Outer race
			B1_4	122	Outer race
			B1_5	52	Outer race and inner race
Condition 2	2250 rpm	11 kN	B2_1	491	Inner race
			B2_2	161	Outer race
			B2_3	533	Cage
			B2_4	42	Outer race
			B2_5	339	Outer race
Condition 3	2400 rpm	10 kN	B3_1	2538	Outer race
			B3_2	2496	Inner race, ball, cage, and outer race
			B3_3	371	Inner race
			B3_4	1515	Inner race
			B3_5	114	Outer race

Table 2. Description of the experimental details of the XJTU-SY dataset.

Bearings Specifications	Bearings Size	Measurement Details	Measurement Plan
Number of rolling elements	8 mm	Location of the sensor	Positioned at 90° to each other
Rolling elements diameter	7.92 mm	Type of the sensor	PCB 352C33
Inner race diameter	29.3 mm	Sampling interval	1 min
Outer race diameter	39.8 mm	Sampling frequency	25,600 Hz
Mean diameter	34.5 mm	Sampling duration	1.28 s

Table 3. Parameter settings of the HCVT-WD.

Parameters	Batch Size	Heads	Vi-T Encoder Layers	Dmodel	MLP Ratio	Learning Rate	Optimizer
Setttings	128	8	8	256	4	0.0003	AdamW

Table 4. Structure of the HCVT-WD.

Section	Layer	Parameters/Configuration	Output Shape
Encoder	Input	dim = 512	(B, 512)
	Patch Embedding (Conv1d)	Patch_size = 16, embed_dimension = 256, k = 16, s = 16	(B, 32, 256)
	Positional Embedding	Fixed sinusoidal	(B, 32, 256)
	Transformer × 8	Number of heads = 8, drop = 0.1	(B, 64, 256)
	Global averaging pooling	-	(B, 256)
	FC1 + Drop	256 → 128, drop = 0.5	(B, 128)
	FC2 + Drop	128 → 32, drop = 0.1 (latent)	(B, 32)
Decoder	FC1 → FC2	32 → 128 → 256 (ReLU)	(B, 256)
	Expand & Repeat	Match 32 patches	(B, 256, 32)
	Skip Fusion + Conv1d	Concat (512 → 256, k = 3, p = 1)	(B, 256, 32)
	ConvTranspose1d	256 → 256, k = 16, s = 16	(B, 256, 512)
	Conv1d (head)	256 → 1, k = 1	(B, 256, 64)
	Output	Squeeze	(B, 512)

Table 5. Evaluation results of HI.

Model	Monotonicity	Trendability	Prognosability
ES-AE [37]	0.8415	0.7726	0.8657
DCN-AE [38]	0.9238	0.8188	0.8908
MSC-AE [25]	0.9319	0.8279	0.8898
HCVT-WD (Proposed)	0.9742	0.9376	0.9673

Table 6. Performance comparisons of different models.

Models	Metrics	Test Bearings
Models	Metrics	B1-2	B1-3	B1-5	B2-1	B2-2	B2-5	B3-3	B3-5
CNN	MAE	0.0491	0.0713	0.0502	0.0601	0.0906	0.0873	0.0859	0.1078
CNN	RMSE	0.0616	0.0892	0.0605	0.0708	0.1147	0.1071	0.1066	0.1196
Bi-LSTM	MAE	0.0461	0.1075	0.0738	0.0659	0.0706	0.0709	0.0562	0.0843
Bi-LSTM	RMSE	0.0573	0.1184	0.0822	0.0812	0.0874	0.0876	0.0632	0.0942
CNN-BiLSTM	MAE	0.0321	0.0713	0.0391	0.0463	0.0652	0.0527	0.0374	0.0521
CNN-BiLSTM	RMSE	0.0441	0.0887	0.0477	0.0564	0.0758	0.0667	0.0459	0.0639

Table 7. Performance comparison of different models.

Test Bearings	CLSTMF	SAGCN-SA	TT-ConvLSTM		TCN-Transformer		Proposed
Test Bearings	RMSE	RMSE	RMSE	MAE	RMSE	MAE	RMSE	MAE
B1-2	0.064	0.079	0.185	0.155	0.0621	0.0512	0.0441	0.0321
B1-3	0.181	0.123	0.101	0.062	0.0748	0.0625	0.0887	0.0713
B1-5	0.181	0.188	0.072	0.052	0.0572	0.0430	0.0477	0.0391
B2-1	0.051	0.181	0.094	0.079	0.0673	0.0549	0.0564	0.0463
B2-2	0.156	0.244	0.099	0.081	0.0784	0.0621	0.0752	0.0654
B2-5	0.124	0.145	0.101	0.086	0.0549	0.0441	0.0667	0.0527
B3-3	0.156	-	0.143	0.127	0.0632	0.0519	0.0459	0.0374
B3-5	0.144	-	0.199	0.161	0.0733	0.0614	0.0678	0.0521

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dahal, A.; Huang, H.-Z.; Huang, C.-G.; Huang, T.; Khanal, S.; Niazi, S.G. Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery. Appl. Sci. 2025, 15, 10972. https://doi.org/10.3390/app152010972

AMA Style

Dahal A, Huang H-Z, Huang C-G, Huang T, Khanal S, Niazi SG. Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery. Applied Sciences. 2025; 15(20):10972. https://doi.org/10.3390/app152010972

Chicago/Turabian Style

Dahal, Amrit, Hong-Zhong Huang, Cheng-Geng Huang, Tudi Huang, Smaran Khanal, and Sajawal Gul Niazi. 2025. "Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery" Applied Sciences 15, no. 20: 10972. https://doi.org/10.3390/app152010972

APA Style

Dahal, A., Huang, H.-Z., Huang, C.-G., Huang, T., Khanal, S., & Niazi, S. G. (2025). Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery. Applied Sciences, 15(20), 10972. https://doi.org/10.3390/app152010972

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unsupervised Convolutional Transformer Autoencoder for Robust Health Indicator Construction and RUL Prediction in Rotating Machinery

Abstract

1. Introduction

2. Proposed Framework

2.1. HCVT Autoencoder Model

2.1.1. 1D-CNN Patch and Position Embedding

2.1.2. Vision Transformer Encoder

2.1.3. 1D Deconvolution

2.2. HI Construction

2.3. FPT Detection

2.4. RUL Prediction and Uncertainty Quantification

2.4.1. RUL Prediction

2.4.2. Uncertainty Quantification with MC Dropout and KDE

3. Experimental Validation and Analysis

3.1. Dataset Preparation

3.2. Structural Parameters and Hyperparameters

3.3. HI Construction and FPT Detection

3.4. Evaluation Metrics of HIs

3.5. Visualization Analysis

3.6. RUL Estimation and Uncertainty Analysis

3.7. Ablation Studies on the HCVT-WD Model

3.8. Comparative Experiments with Other State-of-the-Art Methods for RUL Prediction

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI