1. Introduction
Bearing are critical to rotating machinery, as they are responsible for supporting shafts and reducing friction between moving parts [
1]. Bearings faults can result in catastrophic equipment failures and serious safety risks [
2]. Therefore, accurately predicting their remaining useful life (RUL) is crucial for industrial systems. RUL prediction methods can be classified into model-based methods and data-driven methods [
3]. Model-based methods use physical or statistical modeling to capture failure mechanisms of bearings [
4]. However, these methods are highly dependent on underlying assumptions, leading to significant prediction deviations if actual conditions differ from the assumptions. Additionally, the modeling process can be complex and time-consuming, requiring expert domain knowledge. In contrast, data-driven methods predict RUL by analyzing historical data, eliminating the need for expert knowledge.
Among data-driven methods, deep learning has been widely adopted due to its superior capability to handle nonlinear data. Widely used deep neural networks (DNNs), including convolutional neural network (CNN) [
5], long short-term memory (LSTM) [
6], gated recurrent unit (GRU) [
7], temporal convolutional network (TCN) [
8,
9], and sample convolution and interaction network (SCINet) [
10] have been applied to RUL prediction for effective feature extraction. In addition, combining signal processing techniques, such as frequency spectrum analysis [
11], signal decomposition [
12], and wavelet transforms [
13], with DNNs has achieved success, although such methods require signal processing expertise.
Although these DNNs have achieved success, varying working conditions in industrial environments often lead to discrepancies in feature distributions between source and target domain data, making it difficult to achieve accurate RUL prediction using only DNNs [
14,
15].
To solve this problem, researchers have proposed domain adaptation (DA), which minimizes feature distribution discrepancies between two domains by learning transferable features. Traditional DA-based RUL prediction models involve feeding entire source and target domain data into a DNN for learning features, as shown in
Figure 1a. These learned features are then mapped into a high-dimensional space, where domain-invariant features are extracted through discrepancy metrics, including maximum mean discrepancy (MMD) [
8] and multi-kernel maximum mean discrepancy (MK-MMD) [
15]. Although DA-based models achieve RUL prediction across different domains, they have three major limitations in improving prediction accuracy.
(1) Insufficient extraction of degradation information: Most existing methods feed the entire source domain and target domain data into a DNN, as shown in
Figure 1a, causing the DNN to learn features at the same scale repeatedly. This process ignores the multi-scale information in the time series data, failing to thoroughly capture important details and features related to bearing degradation. For example, in a temperature prediction task, short-term temperature fluctuations may be related to weather changes and hourly effects, while long-term trends may be related to seasonal changes and annual effects [
16]. Such multi-scale information is crucial for accurately representing degradation in industrial systems.
(2) Ignoring local-scale domain alignment: Features learned from both source and target domains are aligned in a high-dimensional space using a discrepancy metric, as shown in
Figure 1a. This process aims to align discrepancies in feature distributions between source and target domains globally, as shown in
Figure 2a. Since bearing degradation occurs in different stages, the feature distribution should account for multiple subdomains [
17]. However, most methods only achieve the alignment of the feature distribution on a global scale, ignoring the discrepancies among subdomains at local scales.
(3) Lack of temporal weights: Not all time steps contribute equally during the process of DA. Time steps closer to failure typically provide more information related to bearing degradation. For example, a bearing performs well for the first 90% of the time, but fails in the last 10%. In such cases, the latter data should be assigned higher weights during DA, as it is more indicative of impending failure, while the earlier data are less critical.
To address these three limitations, this article proposes a novel RUL prediction model, called the health indicator-weighted subdomain alignment network (HIWSAN), which consists of three core components: an Encoder, a health indicator (HI) generator, and a Predictor. First, the Encoder captures fine-grained feature representations from raw data at multiple scales, as illustrated in
Figure 1b. Next, the HI generator constructs HIs and uses these HIs to divide subdomains. Finally, HIs are treated as temporal weights and integrated into the Predictor to achieve subdomain alignment and RUL prediction.
The contributions of this research can be highlighted as follows:
HIWSAN captures feature representations that reflect bearing degradation patterns. Applying these representations can achieve a series of prognostics tasks, including but not limited to HI construction and RUL prediction.
HIWSAN achieves precise subdomain adaptation (SDA) by minimizing feature distribution discrepancies among subdomains, enhancing RUL prediction accuracy.
HIWSAN leverages normalized HIs as temporal weights, enabling SDA to focus more attention on the alignment of degradation features.
The paper is organized as follows:
Section 2 presents related work on multi-scale feature learning and DA for RUL prediction.
Section 3 details the implementation methods of the proposed HIWSAN.
Section 4 and
Section 5 describe two case studies using HIWSAN to construct HI and predict RUL, while evaluating the model using standard metrics.
Section 6 discusses the potential limitations of HIWSAN. Finally,
Section 7 concludes.
3. The Proposed Method
As illustrated in
Figure 3, the proposed HIWSAN is a two-stage model. In stage 1, the Encoder is pre-trained to learn feature representations that contain multi-scale information from raw data. In stage 2, both source and target domain data are fed into the trained Encoder to learn feature representations. Subsequently, these representations are split into two branches: one branch is input to an HI generator for temporal weights calculation and subdomain division, while another branch is fed into the Predictor for SDA and RUL prediction. This section details the implementation of HIWSAN. All frequently used symbols in this section are listed in
Table 1.
3.1. Stage 1: Pre-Training the Encoder
The process of pre-training the Encoder is achieved by an unsupervised learning network based on contrastive learning, which represents raw data into representations containing multi-scale information. In this research, the Encoder aims to learn a function
that transforms full-life bearing data
into a feature representation
that effectively captures degradation patterns, where the symbols
are as defined in
Table 1.
is trained by the proposed Encoder and the training process involves four steps.
Step 1: Random sampling is to sample data segments from raw data, ensuring that the Encoder receives data with varying lengths to capture multi-scale features. For any full-life bearing data
x, a data segment
is randomly cropped, and then moved backward by random time steps to
and moved forward by random time steps to
, as illustrated in
Figure 3. The data segment
is regarded as sample 1 and the data segment
is regarded as sample 2.
Step 2: Multi-scale feature learning is achieved by the proposed Encoder. It consists of a fully connected layer and seven stacked residual dilated convolution blocks. Each block contains two 1D convolution layers for feature extraction, two GeLU activation functions, and a residual structure to avoid gradient vanishing or explosion. Note that the dilation parameters of dilated convolution in the l-th block are , but each block shares the same dilation parameters and kernel size. After a series of convolution and GeLU operations, the feature representation of sample 1 and the feature representation of sample 2 are generated.
Step 3: Positive-negative pairs construction aims to compare the similarity and dissimilarity between sample pairs. As illustrated in
Figure 4, the overlapping parts of
and
are selected, and then marked as
r and
, respectively. Feature representations at the same time step are treated as positive pairs, such as
and
at time step
c, and
and
at time step
d. Feature representations at different time steps are treated as negative pairs, such as
and
,
and
.
Step 4: Contrastive loss guides the Encoder’s parameter updates by encouraging high similarity in positive pairs and low similarity in negative pairs. The contrastive loss function of bearings at time step
t can be formulated as follows:
where
is the set of time steps within the temporal overlap of
and
, and
is the indicator function.
To reduce the interference of outliers, the contrastive loss is designed to be a hierarchical structure, as shown in
Figure 4. After calculating the contrastive loss of the first level, max-pooling is used along the time axis of the feature representation, and then the contrastive loss of the next level is calculated. The number of time steps in the feature representation is halved at each level until it is compressed to 1, and the average loss of all levels is taken to be the final loss. The process of pre-training the Encoder is summarized in Algorithm 1.
Algorithm 1 Pre-training Encoder |
- 1:
procedure Pre-training() - 2:
for x in do - 3:
// Random Sampling: - 4:
randomly crop x; - 5:
sample two subsamples ; - 6:
// Feature Learning: - 7:
Encoder ; - 8:
cropped overlap between and ; - 9:
// Calculate contrastive loss: - 10:
HierLoss(r, ) - 11:
Update model parameters using - 12:
end for - 13:
end procedure - 14:
function HierLoss(r, ) - 15:
; - 16:
; - 17:
while do - 18:
// Maxpool1d operates along the time axis: - 19:
maxpool1d(r, , kernel_size = 2); - 20:
; - 21:
; - 22:
end while - 23:
; - 24:
return - 25:
end function
|
3.2. Stage 2: RUL Prediction
The RUL prediction is implemented through an HI generator, a Predictor, and the trained Encoder. The HI generator takes the encoded feature representations and to construct and , and to divide the source and target domains into subdomains. The Predictor treats and as temporal weightings and integrates them into the LMMD module to align the prediction results between the source and target subdomains. The overall training process consists of five steps:
Step 1: Feature representations are obtained from the pre-trained Encoder. Source domain data and target domain data are input into the pre-trained Encoder to generate source feature representation and target feature representation .
Step 2: Health indicator construction aims to reflect the bearing degradation trend. HI is constructed by calculating the WD between the feature representations at each time step and the feature representations at the initial time step. The WD measures the distance between two probability distributions, and its calculation formula is as follows:
where
represents all joint distributions whose marginals are
u and
v.
means sampling a pair from a distribution
.
After calculating the distance
between each time step and the first time step, min–max normalization in the range
[
15] is applied to convert the
into HI. The HI at step
i is expressed as follows:
where
,
.
Step 3: Subdomain division is performed via K-means clustering [
39,
40], which clusters the feature representations into
c groups by minimizing internal variance. In this research,
c is set to 2, dividing the source domain
and the target domain
into healthy subdomains (
,
), and degradation subdomains (
,
).
Step 4: Subdomain adaptation is achieved by the Predictor, which includes two fully connected layers and an LMMD module. First, fully connected layers map the source feature representation
and target feature representation
into RUL prediction results
and
. Next, LMMD maps
and
into a high-dimensional feature space, where multiple Gaussian kernel functions quantify the differences between distributions. The LMMD between
and
is calculated as follows:
where
represents the healthy and degradation subdomains;
and
denote the temporal weights for source and target domains in category
c;
and
denote the number of samples in the source and target domains for category
c;
denotes the reproducing kernel Hilbert space; and
is defined as a Gaussian kernel function.
Step 5: Model parameters optimization involves two optimization objectives: (1) the RUL prediction loss between the ground truth and predicted RUL, (2) the domain discrepancy loss between source and target domain prediction values.
For the first optimization objective, the prediction error
is defined using the mean square error (MSE), a common loss function for regression tasks. The MSE is formulated as follows:
where
and
denote RUL prediction values and the ground truth of the source domain.
is the number of samples in the source domain.
For the second optimization objective, the domain discrepancy loss
is defined using the LMMD in Equation (
4). The total loss of the proposed HIWSAN is defined as follows:
where
denotes the tradeoff parameter;
is the temporal weights associated with category
c, and
c is set to 2 in this research.
Once the loss function
is defined, the optimal parameters of the HIWSAN can be searched. The calculation steps are summarized in Algorithm 2.
Algorithm 2 RUL Prediction |
- 1:
procedure Main(raw data) - 2:
for in do - 3:
// Feature learning: - 4:
trained Encoder ; - 5:
// Subdomain division: - 6:
HI generator ; - 7:
K-means ; - 8:
// RUL prediction: - 9:
Predictor ; - 10:
// Calculate loss: - 11:
MSELoss; - 12:
LMMDLoss; - 13:
+ ; - 14:
Update model parameters using - 15:
end for - 16:
end procedure
|
4. Case Study 1: HI Construction
The aim of Case Study 1 is to validate the effectiveness of HIs constructed from the learned feature representations, ensuring they are monotonic, robust, and strongly correlated with actual bearing degradation.
Two open-source bearing datasets, namely XJTU-SY [
41] and PRONOSTIA (IEEE PHM 2012) [
42], are used in this case study and the next (see
Section 5), because the bearing prognostics literature has standardized on these datasets. Using these public datasets facilitates reproducibility and fair comparisons of different prognostic methods.
Case Study 1 conducts ablation experiments and comparison experiments on the XJTU-SY dataset to compare different HI construction methods. Subsequently, the generalization performance of the proposed HI construction method is validated based on the PRONOSTIA dataset.
4.1. Data Description
The XJTU-SY bearing dataset comprises run-to-failure data of 15 bearings under three different conditions, as detailed in
Table 2. The PRONOSTIA bearing dataset contains run-to-failure data of 17 bearings under three different conditions, as detailed in
Table 3. Their sampling schemes are described in
Figure 5.
4.2. Model Design
The Adam optimizer is selected, and the learning rate adjustment strategy adopts StepLR. The initial learning rate is 0.001, with a decay factor of 0.1. During training, the learning rate is adjusted every 50 epochs, and the maximum number of epochs is 200. The batch size is set to 1, meaning that the data for all time steps are input at once. The hyperparameter settings used for pre-training are summarized in
Table 4. The fully connected layer first reduces the number of features of raw data to 64. After passing through seven dilated convolution blocks, the number of features increases to 320. Taking Bearing 1_2 in the XJTU-SY dataset as an example,
Table 5 shows the architectural parameters of the proposed Encoder.
The proposed Encoder contains about 2.6 million trainable parameters, with an estimated total size of 33.95 MB, which is lightweight given the high dimensionality of the input data. By leveraging dilated convolutions, the Encoder is able to efficiently capture multi-scale temporal features with fewer operations than traditional convolutional networks. When running on an Intel Core i5-12500H processor, the model’s average inference time per sample is 11.32 milliseconds, achieving real-time or near-real-time processing without GPU acceleration. The combination of compact model size and fast inference time shows the potential of our approach for practical applications in industrial settings.
4.3. Evaluation Metrics for HI Construction
Monotonicity, correlation, robustness, and a comprehensive metric are used to quantitatively evaluate the performance of HIs. Polynomial fitting is first applied to decompose the HI into an average trend and a random part:
where
is the HI at time
t, with
and
representing the average trend and the random part.
Monotonicity measures the mean absolute difference between the number of positive differentials and the number of negative differentials [
43]:
where
N is the total number of
values, and
. As an asset without maintenance can only degrade monotonically over time, a higher monotonicity value (
) is associated with better health indication.
Correlation, also known as
trendability [
44], evaluates the degree of correlation between the HI and bearing degradation status:
where
, and
. A higher correlation score indicates a strong correlation with the state of bearing degradation.
Robustness measures the ability of HI to resist random fluctuations [
45]:
Comprehensive metric, also called hybrid metric, is a linear combination of the preceding metrics for assessing the overall ability of an HI [
21]:
The choice of weights above follows Chen et al.’s [
21], but we acknowledge that different choices have been reported in the literature; for example, see [
45]. There is currently no consensus on how Mon, Corr, and Rob should be weighted to form the comprehensive metric, but as the results in
Section 4.5 show, the Mon metric is the most challenging and should be given the highest weight, just as Chen et al. [
21] and Zhang et al. [
45].
4.4. Ablation Experiments of the Proposed Encoder
To assess the proposed HI construction method, its performance is tested under condition 1 of the XJTU-SY dataset. Bearing 1_1 is used for training; Bearing 1_2, Bearing 1_3, and Bearing 1_5 are used for testing; and Bearing 1_4 is abandoned due to sudden failure. The proposed HI construction method consists of three key modules: (1) random sampling (RS), (2) residual dilated convolution (RDC), and (3) hierarchical contrastive (HC), so the structures of models for the ablation experiment are as follows:
w/o RS: Model A omits the RS module and inputs all time-step data into the Encoder in each epoch.
w/o RDC: Model B replaces 7 RDC blocks with 14 stacked normal 1-D convolution layers, keeping the kernel size and number of layers unchanged.
w/o HC: Model C omits the HC loss, and only calculates one-level contrastive loss.
Model D is the proposed Encoder.
All experiments are repeated 10 times to minimize randomness. The ablation experiment results for the four models in terms of Mon, Corr, Rob and CM can be seen in
Table 6 and
Figure 6.
According to
Figure 6, the CM value of Model D is higher than that of the other models, indicating that HIs constructed from the proposed method most effectively reveal the bearing degradation trends. The CM value of Model A ranks last because omitting the RS module prevents the network from learning multi-scale features. Since bearing degradation includes both long-term trends and short-term patterns, omitting the RS module means the model learns at a single scale repeatedly, resulting in the loss of degradation information. The CM value of Model C surpasses only that of Model A, because the HC module assists the network in mitigating the impact of outliers by averaging the similarities between positive-negative pairs. Without the HC module, the contrastive loss is only calculated once, even with a few time steps sampled. Model B ranks second in terms of CM, suggesting that replacing normal 1-D convolution with the RDC module effectively improves the performance of the HIs.
4.5. Comparison with Related HI Construction Methods
HI construction typically involves two key steps: (1) dimensionality reduction; (2) similarity measurement. As similarity measurement generally involves simple distance computations between the current and initial states, the comparison focuses on different dimensionality reduction strategies. Two classic methods (principal component analysis (PCA)-HI, isometric mapping (ISOMAP)-HI), and three Encoder-based methods (auto-encoder (AE)-HI, MCAN-HI [
22], and MSMHA-HI [
46]) are chosen.
Condition 2 of the XJTU-SY dataset is used in this set of comparison experiments, where Bearing 2_1 is used to train the network; Bearing 2_2, Bearing 2_3, and Bearing 2_5 are used for testing; and Bearing 2_4 is discarded due to too few time steps. All experiments are repeated 10 times to minimize randomness.
Table 7 shows the values of HI metrics associated with test bearings.
According to
Table 7, the proposed-HI achieves the highest CM values on three bearings, indicating that feature representations learned from the Encoder are more effective for constructing HIs than raw data. This outcome aligns with the fact that learned representations capture more information than raw data. The CM values of MSMHA-HI and MCAN-HI rank second and third on three bearings, respectively. Both methods extract multi-scale coded features from raw data, further highlighting the importance of multi-scale features in HI construction. The CM values of AE-HI on three bearings rank fourth because an AE loses multi-scale degradation information during feature extraction. The CM values of PCA-HI and ISOMAP-HI rank last on three bearings, confirming the advantage of network-based methods in HI construction. In addition, the CM values for ISOMAP-HI are higher than those for PCA-HI, suggesting that the manifold learning method is more suitable for constructing HIs than the linear dimensionality reduction algorithm, since the degradation trend of bearings is not linear.
4.6. Validation of Model Generalization Performance
To assess the generalization ability of the proposed-HI, the PRONOSTIA dataset is used to construct HIs and divide their subdomains. For validation, the model parameters remain unchanged.
Figure 7 shows time-domain curves, HIs and their subdomains for Bearing 1_3, Bearing 2_2, and Bearing 3_3. The proposed-HI is also compared with MCAN-HI [
22] and MSMHA-HI [
46]. As shown in
Table 8, the average results of the proposed-HI outperform the others, demonstrating its superior generalization capability.
6. Discussion
This section provides details left out in the preceding sections and addresses potential concerns that may arise.
Degradation scaling: To address the limitation of traditional linear normalization in representing nonlinear degradation processes, a sigmoid scaling method and an exponential weighting method were introduced.
The sigmoid scaling method leverages the S-shaped curve of the sigmoid function to nonlinearly map the HI, thereby better capturing the non-uniform degradation trends in bearings. The sigmoid-based HI is computed as follows:
where
denotes the Wasserstein distance at time step
i,
is a parameter controlling the position of the inflection point (which is set to the mean of all distance values in this research), and
k is a parameter controlling the steepness of the sigmoid curve. Larger
k values produce a steeper curve, accentuating nonlinear scaling.
The exponential weighting method highlights the accelerated degradation in the later stages of bearing health. The exponential-based HI is calculated as follows:
where
is the Wasserstein distance at time step
i,
is the maximum distance value, and
controls the strength of exponential weighting. As
increases, normalization accentuates late-stage degradation more.
Figure 12 shows the HIs constructed by the three normalization methods. Although the sigmoid scaling and exponential weighting methods provide semantically reasonable representations of nonlinear degradation, both involve hyperparameter selection (
k,
, and
), which can significantly influence the shape and sensitivity of the resulting HI curves. For instance, the sigmoid method performs well when appropriate parameters are selected, as shown in
Figure 12b, but poor parameter selection in the exponential weighting method can lead to catastrophic results, as shown in
Figure 12c. To avoid the subjectivity and potential instability associated with manual tuning, the min–max normalization method offers a more stable and robust alternative, as it does not rely on any hyperparameter.
Strong noise: This is a concern as it can mask useful signal features, hampering extraction of accurate and reliable features from raw data. To assess the impact of strong noise, a set of experiments on the HI curves and HI metrics (see
Section 4.3) were conducted, where zero-mean Gaussian noise was added to the raw data for Bearing 1_3 in the XJTU-SY dataset. The standard deviation of the noise was varied from 0.5 to 2 to represent four noise levels. The HI constructed using raw data were then compared to HIs constructed using the noisy data in terms of the associated HI curves (see
Figure 13) and HI metrics (see
Table 15).
Figure 13 shows that as the noise level increases, the HI curves become less monotonic, i.e., less indicative of bearing health.
Table 15 shows that as the noise level increases, the values of the HI metrics decrease. It is not surprising that the proposed HI generator shows deteriorated performance in the presence of strong noise, which is consistent with most deep learning models [
51]. This is why denoising is a vital step in data preprocessing. Fortunately, well-established signal processing techniques such as the Kalman filter can be used to filter out additive Gaussian noise.
Model interpretability: This refers to the intrinsic properties of a deep model measuring the degree to which the inference result of the deep model is predictable or understandable to human beings [
52]. The understandability of a model depends on the human, and as such, model interpretability is often visualized in a human-friendly manner through an interpretation algorithm, rather than summarized as a number. For example, the popular interpretation algorithm Grad-CAM [
53] provides visual explanations of a CNN, in the form of a map highlighting important regions of an image associated with the predicted class, based on gradient information flowing into the final convolutional layer of the CNN. Visualizations like those provided by Grad-CAM are relevant for computer vision applications, more so after the discovery of the vulnerability of deep neural networks to adversarial attacks [
51].
For prognostics, the risk of adversarial attacks is small as test data are gathered in a controlled environment. It is less important for human users to visualize how a model summarizes condition-monitoring data into a human-friendly number representing the health status of the object, than whether this number is representative. Furthermore, while an interpretable model may more readily earn a user’s trust than a model that is not, the former does not necessarily score higher on the performance metrics (e.g., accuracy) than the latter [
54]. Nevertheless, the construction of an HI can be thought of as a pursuit of model interpretability as it summarizes raw data into a human-interpretable HI, while the HI metrics (see
Section 4.3) can be thought of as interpretation algorithms evaluating the trustworthiness of the HI.
Model deployability: In many industrial applications, such as condition monitoring or anomaly detection in rotating machinery, predictions must be made in real time or near-real time. Our model contains about 2.63 million trainable parameters, taking up about 33.95 MB of storage and supporting fast inference. Preliminary tests show that the model achieves an average inference latency of only 11.32 milliseconds per sample on an Intel Core i5-12500H processor, which is far lower than the real-time processing requirement of 50 milliseconds and meets the deployment criteria for many industrial applications.
7. Conclusions
This paper proposes a novel remaining useful life (RUL) prediction model called health indicator-weighted subdomain alignment network (HIWSAN), which comprises an Encoder, a health indicator (HI) generator, and a Predictor. The results of our ablation experiments, comparative experiments, and validation experiments, presented through two case studies, provide concrete evidence that (1) HIWSAN effectively encodes raw data into feature representations that reflect degradation patterns; (2) the generated HIs exhibit superior monotonicity, correlation, and robustness compared to existing methods; and (3) the proposed HI-weighted subdomain adaptation mechanism achieves high RUL prediction accuracy, with an average MAE of 0.0989 and RMSE of 0.1189 on the XJTU-SY and PRONOSTIA datasets, outperforming state-of-the-art models.
In future work, we plan to construct a dedicated experimental platform to collect bearing vibration data under various operating conditions, spanning the entire lifecycle from healthy to faulty states. This will help assess the generalizability of the proposed method beyond publicly available datasets. Moreover, to facilitate real-world deployment, we aim to improve the model’s inference speed, ensure compatibility with edge devices, and explore feedback-based retraining strategies to support continuous learning in dynamic industrial environments.