Research on the Remaining Life Prediction Method of Rolling Bearings Based on Multi-Feature Fusion

: Rolling bearings are one of the most important and indispensable components of a mechanical system, and an accurate prediction of their remaining life is essential to ensuring the reliable operation of a mechanical system. In order to effectively utilize the large amount of data collected simultaneously by multiple sensors during equipment monitoring and to solve the problem that global feature information cannot be fully extracted during the feature extraction process, this research presents a technique for forecasting the remaining lifespan of rolling bearings by integrating many features. Firstly, a parallel multi-branch feature learning network is constructed using TCN, LSTM, and Transformer


Introduction
With the continuous development of the industrial level, the metallurgical industry plays an important role in the process of achieving comprehensive industrialization in China, which is of great significance to the industrial and economic development of the country [1].In the metallurgical beneficiation process, the grinding link in mechanical equipment is crucial [2].However, as metallurgical equipment works in harsh environments all year round, the longer the equipment is used, the more difficult it is to avoid wear, aging, and failure of rolling bearings [3][4][5].Such wear and failure will not only affect economic production efficiency but may also c ause accidents in serious cases, resulting in casualties and property losses.Therefore, the prediction of the remaining useful life (RUL) of rolling bearings is a meaningful and extremely important work [6].
In the past few years, many scholars have devoted themselves to the research of bearing RUL prediction methods.These methods can be broadly categorized into two primary groups: model-driven and data-driven approaches.Model-driven-based methods achieve bearing RUL prediction by constructing a physical or mathematical model that can accurately describe the bearing degradation process, which mainly include Kalman filtering [7], particle filtering [8], Wiener's process [9], Gamma process [10], and Weibull distribution [11] methods.The construction process requires not only the parameters of the actual engineering system obtained after a series of measurements but also a great deal of a priori knowledge.While model-based approaches are useful in predicting the general trend of mechanical degradation, it is difficult to accurately simulate the degradation trend with simple physical or mathematical models in practical industrial applications, especially for complex mechanical equipment.Due to the swift advancement of intelligent sensing and machine learning technologies, a substantial quantity of condition monitoring data is gathered in industrial production, leading to the quick growth and increased viability of the data-driven approach.
Currently, machine learning and deep learning are the main research directions for data-driven methods [12].In traditional machine learning approaches to unstructured data, such as text, images, or audio, a process of feature engineering is often required to transform the raw data into features with interpretable and representational capabilities.This process requires domain expertise and experience to select and design appropriate feature extraction methods.Deep learning has the ability to learn features with representational power directly from raw data through the learning capabilities of multi-layer neural networks.Compared to traditional methods, deep learning models possess the ability to autonomously extract and acquire abstract features of a higher level from unstructured data, eliminating the need for manual feature creation and selection.Therefore, with sufficient data, deep learning prediction methods are more effective than traditional machine learning and are now widely used in the domain of RUL forecasting [13].
Since Hinton et al. [14] proposed the deep learning theory, Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and their derivative networks have gained wide application in the field of lifetime prediction.For example, Guo et al. [15] introduced a health indicator (RNN-HI) based on RNN to forecast the RUL of bearings.Catelani et al. [16] combined an RNN-based estimation method with a filter-based state-space estimation technique to enhance the accuracy and precision of the RUL prediction for lithium-ion batteries.However, RNNs may suffer from gradient explosion during model training, so researchers have made many improvements to the RNN structure to solve the problem and have obtained many improved models.Among them, the Long Short-Term Memory (LSTM) model, introduced by Hochreiter and Schmidhuber [17], is widely recognized as the most prominent in its field.Miao et al. [18] employed LSTM to construct a dual-task deep LSTM model for simultaneously learning the assessment of aero-engine degradation and predicting its RUL.Although LSTMs are capable of capturing temporal dependencies in time-series data both before and after a given point, their intricate chain structure results in longer training periods and less sensitivity to extended time series.To enhance the effectiveness of prediction models, certain researchers have endeavored to implement RUL prediction with CNN.Ren et al. [19] developed a model based on deep CNN and LSTM for mining deeper information from limited data to accurately predict the remaining lifetime of lithium-ion batteries.Although CNNs provide a remarkable capability to extract features that are locally relevant, they lack sensitivity to temporal information and are prone to ignoring the back-and-forth correlation of temporal information in the data.In order to tackle this issue, Bai et al. [20] proposed a solution to address this problem by introducing a Temporal Convolutional Network (TCN) that integrates causal convolution with null convolution.This network exhibits similar temporal feature extraction capabilities as RNN and has shown promising results in time series modeling applications.Wang et al. [21] developed a TCN network that incorporates a soft threshold and attention mechanism to effectively capture important features and accurately forecast the RUL of mechanical devices.However, the convolution operation used in TCN, i.e., the extraction of local information, exhibits a deficiency in considering worldwide data and does not adequately account for relationships that span over a significant period of time in the sequence.In order to consider global correlations, some researchers have used Transformer to predict bearing RUL.Mo et al. [22] used the Transformer encoder as the backbone of the model to capture short-term and long-term dependencies and global feature representations in the time series, which in turn predicts their remaining lifespan.
In addition, in recent years, as smart manufacturing systems have improved, modern industries have increasingly implemented a significant number of signal sensors [23].According to Li et al. [24], the use of many sensors in the industrial production process can result in the collection of a significant volume of data.This, in turn, can enhance the dependability of the health monitoring system for industrial equipment, so the information derived from multiple sensors is more valuable to study than the single sensor data.However, how to effectively utilize multi-sensor data and achieve fusion of feature information is still an open question.
To summarize, while deep learning techniques have shown promising outcomes in predicting the RUL of bearings, there are still unresolved concerns that require attention: (1) Most of the current research focuses on the utilization of single-sensor data, while insufficient attention is paid to the efficient integration and utilization of data from several sensors.Meanwhile, when using parallel networks for feature extraction, the same network structure is often adopted without giving full play to the advantages of multiple networks.(2) In most of the parallel attention mechanism structure research, each branch of the network utilizes the multi-head self-attention mechanism to adjust the internal connections inside the data.However, using the same feature extraction method to fuse the outputs of each branch network may result in important features being masked while redundant features are retained, ultimately affecting the overall performance of the network.
In order to tackle the aforementioned issues, this research presents a method for predicting the RUL of rolling bearings by utilizing a fusion of many features.Firstly, the multi-sensor data is normalized, and the sensor data is combined in each channel to produce an optimal fusion of data from several sensors; Then, a parallel multi-branch feature learning network was constructed using TCN, LSTM, and Transformer, where TCN analyzes the data to identify and extract the long-time series features, LSTM captures the time-correlated features in the series data, and Transformer extracts the global feature representation of the bearing data.Meanwhile, a parallel multi-scale attention mechanism that captures both local and global dependencies is designed to go a step further in capturing the global and local contextual information of the sequence in order to accomplish adaptive weighted fusion from the output features of the three feature extractors.Secondly, the shallow features obtained by the parallel feature extractor are residually connected to the deeper features through the attention mechanism in order to enhance the efficiency of utilizing the information from the before and after features.Finally, the fused features output RUL predictions through the fully connected layer.The predictive model allows for a more comprehensive description of the operating condition of ball mill rolling bearings by fusing multiple features multiple times and capturing factors that may have an impact on their remaining life.This improves the accuracy of the prediction model and more accurately forecasts the RUL of rolling bearings.This paper's contribution can be summarized as follows: (1) By fusing information from multiple sensors, more comprehensive, accurate, and reliable information is obtained, and feature extraction by parallel processing TCN, LSTM, and Transformer gives full play to their respective advantages to enhance the efficiency of the prediction model and the precision of the forecast outcomes.(2) A parallel, multi-scale attention mechanism is designed.By fusing features in both the time and frequency domains, we are able to capture comprehensive and specific contextual information from sequential data while capturing local and global dependencies.A more comprehensive representation of the data can be achieved.(3) A multi-feature fusion model for predicting the RUL of rolling bearings is proposed, which can enhance valuable information while reducing redundant information, ultimately achieving the effective fusion of multiple features.Better prediction results than the current prediction methods are achieved in the experimental validation.
The subsequent sections of the paper are structured in the following manner: Section 2 provides a concise overview of the relevant background information; Section 3 provides a comprehensive explanation of the proposed methodology; Section 4 demonstrates the efficacy of the proposed method by analyzing two bearing datasets; and Section 5 provides the conclusion.

Temporal Convolutional Network
The combined process of one-dimensional full convolution, causal convolution, and dilation convolution is represented by dilation causal convolution, as shown in Figure 1.In Figure 1, the input sequence is denoted as obtained by performing a one-dimensional dilated causal convolution operation with a convolution kernel size of 3 on a three-layer input sequence, which is the same as the input sequence.The dilation factor, denoted as d ∈ N * , is often set to 2 in convolution calculations.The sensory field v is determined by the dimensions of the convolution kernel, the number of layers of the convolution computation, and the dilation factor, which is calculated as follows: where k, l, and b represent the size of the convolutional kernel, the number of convolutional layers in the network, and the base of the expansion factor, respectively.which is usually set to b = 2.
For the task of TCN, we are provided with a one-dimensional input sequence x ∈ R n and a convolution kernel f : {0, 1, • • • , k − 1} → R .The dilated causal full convolution at the position of the sequence s is computed as follows: where x s−di is the (s − di) − th element in the preceding layer, and f (i) is a convolution kernel function that maps an index i to a real weight.x s−di denotes the (s − di) − th element of the input sequence x after it has been adjusted by the expansion causality operation, and the remaining parameters retain their former significance.Furthermore, TCN employs residual block connectivity to enhance the depth of the network, which can effectively weaken the gradient problem and further increase the model sensory field by connecting multiple residual blocks together.The structure of each residual block is depicted in Figure 2.
Furthermore, TCN employs residual block connectivity to enhance the depth of th network, which can effectively weaken the gradient problem and further increase th model sensory field by connecting multiple residual blocks together.The structure of eac residual block is depicted in Figure 2.

Long Short-Term Memory
RNN is a specialized model for processing time-series data by introducing the con cept of "time," in which information from a past period of time can be remembered.LSTM is a special structure proposed to overcome the gradient explosion or vanishing problem of RNN, which possesses the ability to retain information over an extended duration an can extract information not only from a single data point but also from an entire data serie [17].LSTM is mainly divided into three gates, namely, the oblivion gate, the input gate and the output gate, and the structure is seen in Figure 3.

Long Short-Term Memory
RNN is a specialized model for processing time-series data by introducing the concept of "time", in which information from a past period of time can be remembered.LSTM is a special structure proposed to overcome the gradient explosion or vanishing problem of RNN, which possesses the ability to retain information over an extended duration and can extract information not only from a single data point but also from an entire data series [17].LSTM is mainly divided into three gates, namely, the oblivion gate, the input gate, and the output gate, and the structure is seen in Figure 3.
Furthermore, TCN employs residual block connectivity to enhance the depth of the network, which can effectively weaken the gradient problem and further increase the model sensory field by connecting multiple residual blocks together.The structure of each residual block is depicted in Figure 2.

Long Short-Term Memory
RNN is a specialized model for processing time-series data by introducing the concept of "time," in which information from a past period of time can be remembered.LSTM is a special structure proposed to overcome the gradient explosion or vanishing problem of RNN, which possesses the ability to retain information over an extended duration and can extract information not only from a single data point but also from an entire data series [17].LSTM is mainly divided into three gates, namely, the oblivion gate, the input gate, and the output gate, and the structure is seen in Figure 3.The core concepts of LSTM lie in the cell state, which corresponds to the path of information transmission, and the gate structure, which enables the addition and removal of information, which is controlled using the Sigmoid activation function.The input gate, denoted as i t , regulates the amount of information that should be stored for the candidate state, denoted as Ĉt , at the present time.The forgetting gate, denoted as f t , regulates the amount of information that should be discarded from the internal state, represented as C t−1 , from the previous time step.The output gate, denoted as o t , regulates the amount of information that is transmitted from the internal state, represented as C t , to the exterior state, denoted as h t , at the present time, and the formulae for the three gates are shown below: where x t represents the input feature, C t is the memory unit, C t−1 represents the memory unit of the previous moment, Ĉt represents the memory unit of the current state of the candidate, h t denotes the external state, and h t−1 denotes the external state of the preceding moment.W i , W f , W o , and W c denote the input weight vectors of the input gate, the forget gate, the output gate, and the candidate unit, respectively.U i , U f , U o , and U c denote the cyclic weight vectors of each gating unit, respectively.The activation function σ is the Sigmoid, while the activation function tanh is the hyperbolic tangent.The symbol "⊗" represents the vector product.

Enocder of Transformer
The Transformer encoder comprises two primary components: the multi-head attention mechanism and the feed-forward network.Each part is followed by connecting the residual network and the layer normalization module.The structure of the Transformer encoder is depicted in Figure 4.
as t 1 C − , from the previous time step.The output gate, denoted as t o , regula amount of information that is transmitted from the internal state, represented as the exterior state, denoted as t h , at the present time, and the formulae for the thre are shown below: ) where t x represents the input feature, t C is the memory unit,  , and c U denote the cyclic weight vectors of each gating unit, respectively.T vation function σ is the Sigmoid , while the activation function tanh is the hyp tangent.The symbol " ⊗ " represents the vector product.

Enocder of Transformer
The Transformer encoder comprises two primary components: the multi-hea tion mechanism and the feed-forward network.Each part is followed by connect residual network and the layer normalization module.The structure of the Trans encoder is depicted in Figure 4.
Appl.Sci.2024, 14, 1294 denote the dimensions of the hidden layer.By querying the sequence Q, the sequence of keys K, and the sequence of values V, the attention value can be obtained as follows: where √ d k denotes the scaling factor, the main purpose of which is to prevent the gradient from disappearing during backpropagation.
The multi-head attention mechanism obtains feature vectors from different representation subspaces and then stitches and linearly transforms the results obtained from multiple attention heads: where and W O i ∈ R do×d h are weight matrices.The input vectors E s , E a , and E sa are subjected to residual network and normalization layer operations with Equation ( 11), which are calculated as follows, using E s as an example: The feed-forward network converts the hidden layer representation acquired by the multi-head attention mechanism, which is computed assuming the input sequence is H as follows: where the weight matrix is denoted by W 1 and W 2 , whereas the bias values are denoted by b 1 and b 2 .Finally, the H s , H a , and H sa hidden layer representations are obtained by residual network and normalization layer computation.Taking the contextual hidden layer representation H s as an example, the computational formula is as follows:

Prediction of RUL Based on the Multi-Feature Fusion Method
The precise RUL of a bearing in real industrial production processes is uncertain at any particular point.Therefore, analyzing the historical operating data of existing bearings is crucial for industrial safety as it allows for an accurate projection of the lifespan of other bearings.To this end, a RUL prediction approach is proposed that utilizes multi-feature fusion, which consists of four main parts: multi-sensor data fusion processing, parallel TCN-LSTM-Transformer feature extractor construction, parallel multi-scale attention mechanism design, offline model training, and online bearing RUL prediction.The structure of the lifetime prediction network, which utilizes multi-feature fusion, is illustrated in Figure 5. Firstly, the multi-sensor data is normalized, and the sensor data is combined in each channel to produce an optimal fusion of data from several sensors; Secondly, a parallel multi-branch feature learning network was constructed using TCN, LSTM, and Transformer, where TCN extracts the local features of the data, LSTM captures the long-term dependencies present in the sequential data, and Transformer extracts the global feature representation of the bearing data.Meanwhile, a parallel multi-scale attention mechanism that captures both local and global dependencies is designed to go a step further in capturing the global and local contextual information of the sequence in order to accomplish adaptive weighted fusion of the output features from the three feature extractors.Next, the shallow features obtained by the parallel feature extractor are residually connected to the deeper features through the attention process, which enhances the efficiency of utilizing both the preceding and subsequent feature information.Finally, the fused features output RUL predictions through the fully connected layer.
ther in capturing the global and local contextual information of the sequence in order to accomplish adaptive weighted fusion of the output features from the three feature extractors.Next, the shallow features obtained by the parallel feature extractor are residually connected to the deeper features through the attention process, which enhances the efficiency of utilizing both the preceding and subsequent feature information.Finally, the fused features output RUL predictions through the fully connected layer.

Fusion of Multisensor Data
Data processing is a critical step in life prediction, and in order to eliminate, as much as possible, the reliance on expert knowledge in this part of the process, Multiple sensors provide the capability to gather vibration data from various positions on the bearing, in contrast to a single sensor, so that feature information from different viewpoints and locations can be obtained.By fusing these diverse characteristics, the condition of the bearing system may be described more comprehensively, providing a richer set of input features.Therefore, this paper employs a multi-sensor information fusion strategy to enhance the quantity of feature information included in the model inputs.The fusion process is illustrated in Figure 6.

Fusion of Multisensor Data
Data processing is a critical step in life prediction, and in order to eliminate, as much as possible, the reliance on expert knowledge in this part of the process, Multiple sensors provide the capability to gather vibration data from various positions on the bearing, in contrast to a single sensor, so that feature information from different viewpoints and locations can be obtained.By fusing these diverse characteristics, the condition of the bearing system may be described more comprehensively, providing a richer set of input features.Therefore, this paper employs a multi-sensor information fusion strategy to enhance the quantity of feature information included in the model inputs.The fusion process is illustrated in Figure 6.
mechanism that captures both local and global dependencies is designed to go a step further in capturing the global and local contextual information of the sequence in order to accomplish adaptive weighted fusion of the output features from the three feature extractors.Next, the shallow features obtained by the parallel feature extractor are residually connected to the deeper features through the attention process, which enhances the efficiency of utilizing both the preceding and subsequent feature information.Finally, the fused features output RUL predictions through the fully connected layer.

Fusion of Multisensor Data
Data processing is a critical step in life prediction, and in order to eliminate, as much as possible, the reliance on expert knowledge in this part of the process, Multiple sensors provide the capability to gather vibration data from various positions on the bearing, in contrast to a single sensor, so that feature information from different viewpoints and locations can be obtained.By fusing these diverse characteristics, the condition of the bearing system may be described more comprehensively, providing a richer set of input features.Therefore, this paper employs a multi-sensor information fusion strategy to enhance the quantity of feature information included in the model inputs.The fusion process is illustrated in Figure 6.Assuming the existence of M rolling bearings and their performance degradation data, the performance degradation data of C sensors will be gathered for each bearing, and each sensor will have the same sample period (t).Moreover, the tth data sample collected by the Cth sensor may be represented as X c t , X c t ∈ R H×1 , where each sample's length is represented by H, and each sensor's data may be seen as a separate data channel.To lessen the impact of variations in the data distribution across several bearings, each sensor's raw data was normalized in the manner described below: where x c t,i is the original data sample's ith value.The normalized data sample may be represented by the symbol x c t,i , which stands for the ith value of the normalized data sample.Each sensor's data is then spliced according to channel to obtain the multichannel fusion data x m t , x m t which can be expressed as: where after multichannel fusion, the tth sample of the mth bearing is represented by x m t ∈ R H×C , x m t , and all of the samples of the mth bearing are designated by {x m t } N t=1 , where N is the total amount of time spent sampling.Finally, the label of the tth sample of the mth bearing can be denoted as y m t and the labels of all the samples of the mth bearing are represented as {y m t } N t=1 .Based on the above samples and labels, the mth bearing's multi-sensor fusion data can be expressed as {x m t , y m t } N t=1 .For M bearings, the multi-sensor data of each bearing after channel fusion will be used as a deep learning model's training set, and the RUL of additional bearings will be predicted using the learned model.

Parallel TCN-LSTM-Transformer Feature Extractor
In order to get the prediction model closer to how the bearing really operates, this paper constructs a parallel multi-branch feature learning network using TCN, LSTM, and Transformer.The feature extraction model needs to be able to handle multi-dimensional problems and have the ability to extract spatial and temporal data.In this paper, a combination of TCN, LSTM, and Transformer is used for parallel processing of the models, and then fusion of their features can extract a rich feature representation that fully captures the advantages of different models.Where TCN performs feature extraction at different time steps through convolutional layers, which can capture both local and global temporal dependencies, LSTM is suitable for capturing long-range temporal features by creating long-term dependencies in the sequence through cyclic cells; Transformer, on the other hand, interacts globally with features in a sequence through a self-attentive mechanism that is able to capture global contextual relationships.With the parallel TCN-LSTM-Transformer network, multiple levels of features can be extracted at the same time, resulting in a richer and more comprehensive representation capability.The model schematic of this parallel feature extractor is shown in Figure 7.

Parallel Multi-scale Attention Mechanisms
In most of the parallel attention mechanism structure research, each branch network is the same through the multi-head self-attention mechanism, which is used to adjust the internal relationships within the data.If you use the same feature extraction method for

Parallel Multi-Scale Attention Mechanisms
In most of the parallel attention mechanism structure research, each branch network is the same through the multi-head self-attention mechanism, which is used to adjust the internal relationships within the data.If you use the same feature extraction method for the fusion of the output of each branch network, it will lead to the important features not being highlighted and the redundant features being retained.Inevitably, the performance of the whole network would be impacted.To tackle this issue, this research proposes the development of a novel parallel, multi-scale attention method.The principle of its attention mechanism is shown in Figure 8.

Parallel Multi-scale Attention Mechanisms
In most of the parallel attention mechanism structure research, each branch networ is the same through the multi-head self-attention mechanism, which is used to adjust th internal relationships within the data.If you use the same feature extraction method fo the fusion of the output of each branch network, it will lead to the important features no being highlighted and the redundant features being retained.Inevitably, the performanc of the whole network would be impacted.To tackle this issue, this research proposes th development of a novel parallel, multi-scale attention method.The principle of its atten tion mechanism is shown in Figure 8. FNet [25] replaces the self-attention layer with a Fourier sublayer, which speeds u the Transformer's encoder with less loss of accuracy; however, FNet may lose some pos tional information when processing the sequence data, which will affect the precision o the RUL forecast.As a result, a parallel multi-scale attention mechanism is proposed i this paper, inspired by FNet.In order to increase the speed while reducing the loss o accuracy, the attention mechanism is implemented in conjunction with the FFT of the in put features because the Fourier transform is faster than the calculation of the convolutio and attention mechanism, which not only accelerates the training speed of the model bu also enhances the feature extraction capability and stability.Utilizing self-attention an FNet, the process of encoding sequences concurrently at several sizes.Self-attention an FNet [25] replaces the self-attention layer with a Fourier sublayer, which speeds up the Transformer's encoder with less loss of accuracy; however, FNet may lose some positional information when processing the sequence data, which will affect the precision of the RUL forecast.As a result, a parallel multi-scale attention mechanism is proposed in this paper, inspired by FNet.In order to increase the speed while reducing the loss of accuracy, the attention mechanism is implemented in conjunction with the FFT of the input features because the Fourier transform is faster than the calculation of the convolution and attention mechanism, which not only accelerates the training speed of the model but also enhances the feature extraction capability and stability.Utilizing self-attention and FNet, the process of encoding sequences concurrently at several sizes.Self-attention and FNet can be combined to learn sequence representations at a greater variety of scales.It consists of three primary components: self-attention, which captures global characteristics in the temporal domain; an FNet module for capturing local deep features in the frequency domain; and a feed-forward network used to capture positional characteristics.The module receives as input the result obtained from the -layer and produces an output representation in the form of a fused manner: where "Attention" pertains to the mechanism of self-attention, "FNet" denotes the FNet module, and "Pointwise" denotes the position feed-forward network.Below are the specifics of FNet and the FNet structure shown in Figure 8.Given a sequence {x n }, n ∈ [0, N − 1], the discrete Fourier transform (DFT) can be defined as follows: where X(a) denotes the data after DFT.From Equation ( 18), the time complexity of DFT is O(N 2 ).FFT is a fast algorithm for discrete Fourier transform, which recursively obtains the result by butterfly operation, which reduces the number of DFT multiplications and the time complexity to O(N log N).

RUL Forecast
Figure 5 illustrates the procedure for model training and predicting the RUL.The fusion of bearing vibration data obtained from multiple sensors yields multi-sensor fusion data.Consequently, this leads to the fulfillment of the prediction objective.In the RUL prediction process in Figure 5, the gathered bearing vibration data from many sensors is combined via fusion, resulting in multi-sensor fusion data.This data is then separated into training-bearing data and test-bearing data to construct the prediction network.
In the offline modeling process, the training data are input into the parallel feature extractor and trained several times.The training process calculates the loss function and backpropagation to adjust the model parameters, and once the number of training iterations exceeds the total number of training iterations, the model finishes the training process.In the online prediction process, the trained prediction network model is used to make realtime predictions by inputting test data.Furthermore, assessment indicators are established to validate the model's predictive performance.Ultimately, the forecast outcomes are presented in a graphical representation.
In this study, bearings' complete life cycle information is used to educate the network.The advantages are: (1) Complete life cycle data encompasses every conceivable bearing failure.Therefore, the examination of complete life cycle data will facilitate a more exhaustive surveillance of the bearing's condition.(2) It has been noted that the duration between the initial injury and the ultimate failure of a bearing is exceptionally brief.Damage can spread rapidly, particularly when the bearing is on the verge of total failure.Hence, predicting the RUL before the onset of deterioration may provide enough time to schedule equipment repair.(3) Li et al. [26] used the concept of first prediction time (FPT) to classify life cycle data into two distinct stages: healthy and deteriorated.However, accurately establishing the FPT is a challenging endeavor that may require a significant increase in effort to calculate it manually.This research circumvents this issue by training the network utilizing comprehensive data over the whole lifespan.
The assessment metrics consist of mean absolute error (MAE) and root mean square error (RMSE).These metrics are described as follows: where Rul act i represents the actual value of the remaining life, Rul pre i represents the predicted value of the remaining life, and n represents the overall duration of the prediction, which is equivalent to the number of samples.In addition, early RUL predictions and late RUL predictions have different impacts on the machine during the machinery and equipment's actual operation.Therefore, in the early stages of the machine's life cycle, the weight of the predicted results is reduced, while in the later stages, the weight of the predicted results is increased to capture possible malfunctions and failures of the machine more accurately.The scoring function's definition is extracted from reference [21].The scoring function designed in reference [21] differs from the scoring function in PHM 2012 [27] in that it takes into account the effects of the early stages of the machine lifecycle, the later stages of the machine lifecycle, and the whole lifecycle operation stages.The specific descriptions are shown in Equations ( 21) and ( 22): )), er i ≤ 0 exp(ln(0.6)• ( where the weights of the early and late bearing phases are denoted by α and β, respectively, while the early stage proportion is represented by m.Here, α is set to 0.35 and β to 0.65, i.e., the late stage of the life cycle is predicted to be more significant than its early stage, and the higher the score, the more accurate the forecast.

Experimental Verification
In this section, we use the PHM 2012 Challenge dataset provided via the Association of Electrical and Electronics Engineers as well as the Bearing Degradation dataset provided by Xi'an Jiaotong University to verify the suggested method's capacity for prediction.In addition, the framework of the Deep Learning Torch is used to perform the experiments.The computer is equipped with an i5-13400F CPU, an NVIDIA GeForce RTX 3060 Ti processor, and 16 GB of RAM.

Parameter Configuration of Multi-Feature Fusion Networks
In both cases, the hyperparameters of the multi-feature fusion network are the same, as detailed in Table 1.The hyperparameters were selected using cross-validation of numerous training datasets, taking into account the prediction accuracy.The bearing dataset for the IEEE PHM 2012 Data Challenge was acquired from the PRONOSTIA test platform, and the data acquisition platform is shown in Figure 9.The vibration acceleration data of rolling bearings during their whole life cycle, from normal operation to failure, under various operating situations, are obtained by the accelerated life degradation test, and the test is halted when the magnitude of the vibration signal surpasses 20 g.The vibration signals obtained are segregated into horizontal and vertical orientations.The data is sampled at a frequency of 25.6 kHz, with recordings taken every 10 s; the time duration of the collection is 0.1 s; and there are 2560 pieces of vibration data collected each time.Table 2 displays the data set's operational circumstances.Bearing A2-7

Analysis of Projected Results
To comprehensively evaluate the performance of our prediction network, we carefully selected bearing vibration data from two different operating conditions, namely Condition 1 and Condition 2. Under these two conditions, we conducted five test experiments on each bearing to ensure the reliability of the data and the stability of the experimental results.Notably, in each experiment, we used a brand-new, untrained network.This approach was taken to ensure the consistency and repeatability of our experimental results.
When testing a specific bearing data set, we used the data from all other bearings to train the network.For example, if bearing A1-1 was chosen as the test set, then bearings A1-2, A1-3, A1-4, A1-5, A1-6, and A1-7 were used as the training set.This strategy ensured that each bearing data set could be used both for training and testing, maximizing the utilization of data and enhancing the effectiveness of the experiments.
To demonstrate the effectiveness of our method, we selected four different approaches and compared their RUL prediction results with our multi-feature fusion-based

Analysis of Projected Results
To comprehensively evaluate the performance of our prediction network, we carefully selected bearing vibration data from two different operating conditions, namely Condition 1 and Condition 2. Under these two conditions, we conducted five test experiments on each bearing to ensure the reliability of the data and the stability of the experimental results.Notably, in each experiment, we used a brand-new, untrained network.This approach was taken to ensure the consistency and repeatability of our experimental results.
When testing a specific bearing data set, we used the data from all other bearings to train the network.For example, if bearing A1-1 was chosen as the test set, then bearings A1-2, A1-3, A1-4, A1-5, A1-6, and A1-7 were used as the training set.This strategy ensured that each bearing data set could be used both for training and testing, maximizing the utilization of data and enhancing the effectiveness of the experiments.
To demonstrate the effectiveness of our method, we selected four different approaches and compared their RUL prediction results with our multi-feature fusion-based prediction results.The specific comparison results are shown in Table 3.In this experiment, a total of 14 bearings were examined.As Table 3 illustrates, for most of the tested bearings, the proposed approach's MAE and RMSE are less than those of the comparison method, and as Figure 10 illustrates, the proposed method's average RMSE and MAE are the lowest while its score is the greatest.The approach suggested in this paper has greater score values than the comparable methods, except that the score of tests bearing A1-1 is lower than that of TCN-RSCB and the score of tests bearing A2-3 is the same as that of TCN-RSCB.The aforementioned experimental findings demonstrate that the proposed method has the greatest prediction performance because the multi-sensor data fusion method effectively fuses multi-sensor data, and in the multi-feature fusion network, we designed a parallel TCN-LSTM-Transformer feature extractor.Where TCN extracts the data's long-term sequence features, LSTM captures the time-dependent features in the sequence data, and Transformer extracts the global feature representation of the bearing data.Meanwhile, a parallel multi-scale attention mechanism is designed that captures both local and global dependencies to go a step further in capturing the global and local contextual information of the sequence in order to accomplish the three feature extractors' output features' adaptive weighted fusion.Secondly, the shallow features obtained from the parallel feature extractor are residually connected to the deep features through the attention mechanism to improve the efficiency of utilizing the information of the front and back features, and finally, the RUL's output is used in the fully connected layer.Among them, CNNs face the drawbacks of insufficient modeling power for long-term dependence, fixed sense field size, a large number of parameters, and translational invariance, leading to poor RUL prediction when dealing with time series data.Although TCN-SA, TCN-RSA, and TCN-RSCB introduce attention mechanisms and residual connections to improve the traditional TCN models, they have some drawbacks, such as not considering the global nature, not mining the information more deeply, and not utilizing the before and after feature information efficiently.In conclusion, the multi-feature fusion network gets higher forecast results due to the design of the multi-feature fusion network, which enhances the all-round learning of sequence features, the attention mechanism's deep feature mining, and the improvement of the efficiency of residual linkage in the utilization of before and after features.Furthermore, even if the proposed method's prediction curves exhibit some local oscillations, the proposed approach has a good capacity to anticipate the RUL of the bearings since it can accurately estimate the condition of the bearings in their latter stages of life.The prediction results for bearings A1-4, A1-5, A2-4, and A2-6 are shown in Figure 11. Figure 11 demonstrates that our proposed multi-feature fusion network successfully captures the bearing degradation information across various operating situations and accurately predicts the RUL.The prediction results for bearings A1-4, A1-5, A2-4, and A2-6 are shown in Figure 11. Figure 11 demonstrates that our proposed multi-feature fusion network successfully captures the bearing degradation information across various operating situations and accurately predicts the RUL.The prediction results for bearings A1-4, A1-5, A2-4, and A2-6 are shown in Figure 11. Figure 11 demonstrates that our proposed multi-feature fusion network successfully captures the bearing degradation information across various operating situations and accurately predicts the RUL.

Ablation Experiment
To evaluate the efficacy of the novel aspect of the proposed methodology, we eliminated or altered a component of the methodology.Data for the ablation experiments were obtained from the PHM 2012 dataset with test bearings A1-4.Method 1 utilizes vertical vibration data obtained from a solitary vertical sensor as input, whereas the proposed method uses multi-sensor fusion data as input.Method 2 utilizes TCN blocks for network feature extraction; the proposed method uses TCN-LSTM-Transformer for network feature extraction.Method 3 uses the traditional self-attention mechanism for feature fusion; the proposed method uses a parallel multi-scale attention mechanism in order to accomplish the adaptive weighted fusion of the output features from the three feature extractors.The prediction findings are shown in Table 4, revealing that the MAE and RMSE outcomes of the proposed approach are markedly superior to those of the comparable approaches.Data set for the XJTU-SY rolling bearing accelerated life test [30].The test rig for this collected dataset is shown in Figure 12.The platform is used to perform accelerated degradation tests on bearings by generating radial force from a hydraulic loading system, and the AC motor's speed is controlled by its speed controller.The gathered vibration signals were split into horizontal and vertical directions, and the test's sample frequency was set to 25.6 kHz, the sampling interval was set at 1 min, and the sampling time was 1.28 s. Figure 13 shows the physical diagrams of bearing inner ring wear, cage fracture, outer ring wear, and outer ring cracking.Different deterioration patterns under various operating situations are caused by these bearing degradation trends.Two different operating conditions were selected for the XJTU-SY test, as shown in Table 5.For the first scenario, the motor speed is 2100 r/min, and the load is 12 kN; for the second scenario, the motor speed is 2250 r/min, and the load is 11 kN; both conditions contain five different bearings each.

Ablation Experiment
To evaluate the efficacy of the novel aspect of the proposed methodology, we elimi nated or altered a component of the methodology.Data for the ablation experiments were obtained from the PHM 2012 dataset with test bearings A1-4.Method 1 utilizes vertica vibration data obtained from a solitary vertical sensor as input, whereas the proposed method uses multi-sensor fusion data as input.Method 2 utilizes TCN blocks for network feature extraction; the proposed method uses TCN-LSTM-Transformer for network fea ture extraction.Method 3 uses the traditional self-attention mechanism for feature fusion the proposed method uses a parallel multi-scale attention mechanism in order to accom plish the adaptive weighted fusion of the output features from the three feature extractors The prediction findings are shown in Table 4, revealing that the MAE and RMSE outcome of the proposed approach are markedly superior to those of the comparable approaches.Data set for the XJTU-SY rolling bearing accelerated life test [30].The test rig for thi collected dataset is shown in Figure 12.The platform is used to perform accelerated deg radation tests on bearings by generating radial force from a hydraulic loading system, and the AC motor's speed is controlled by its speed controller.The gathered vibration signal were split into horizontal and vertical directions, and the test's sample frequency was se to 25.6 kHz, the sampling interval was set at 1 min, and the sampling time was 1.28 s Figure 13 shows the physical diagrams of bearing inner ring wear, cage fracture, oute ring wear, and outer ring cracking.Different deterioration patterns under various operat ing situations are caused by these bearing degradation trends.Two different operating conditions were selected for the XJTU-SY test, as shown in Table 5.For the first scenario the motor speed is 2100 r/min, and the load is 12 kN; for the second scenario, the moto speed is 2250 r/min, and the load is 11 kN; both conditions contain five different bearing each.To assess the prediction network's generalization capacity to the fullest extent possible, the bearing vibration data under Cases 1 and 2 were selected from the working conditions.When RUL prediction is performed using the XJTU-SY dataset, the network parameters, MAE, RMSE, and scoring function of multi-feature fusion are identical to those of Case 1.In addition, the proposed method's superiority is shown by comparing the multi-feature fusion prediction results with those of the four approaches in Case 1, and Table 6 displays the outcomes of their predictions.In this experiment, a total of 10 bearings were subjected to testing.Table 6 demonstrates that the proposed method exhibits reduced MAE and RMSE values compared to the comparison method for the majority of the tested bearings.Figure 14 clearly demonstrates that the proposed method achieves the lowest average MAE and average RMSE while also attaining the greatest score.This suggests that the proposed approach exhibits superior performance in predicting the RUL of bearings.The scoring values of the proposed method in this paper surpass those of the comparison methods,

Analysis of Projected Results
To assess the prediction network's generalization capacity to the fullest extent possible, the bearing vibration data under Cases 1 and 2 were selected from the working conditions.When RUL prediction is performed using the XJTU-SY dataset, the network parameters, MAE, RMSE, and scoring function of multi-feature fusion are identical to those of Case 1.In addition, the proposed method's superiority is shown by comparing the multi-feature fusion prediction results with those of the four approaches in Case 1, and Table 6 displays the outcomes of their predictions.In this experiment, a total of 10 bearings were subjected to testing.Table 6 demonstrates that the proposed method exhibits reduced MAE and RMSE values compared to the comparison method for the majority of the tested bearings.Figure 14 clearly demonstrates that the proposed method achieves the lowest average MAE and average RMSE while also attaining the greatest score.This suggests that the proposed approach exhibits superior performance in predicting the RUL of bearings.The scoring values of the proposed method in this paper surpass those of the comparison methods, except that the score of tests bearing B1-1 is lower than that of TCN-SA, B2-5 is lower than that of TCN-RSCB, and the score of tests bearing B1-5 is the same as that of TCN-SA.Furthermore, due to the diverse range of failures that bearings often experience as they degrade, it is very challenging for prediction methods to accurately determine the optimal RUL forecast for any individual bearing.Figure 15 displays the RUL forecast outcomes for the B1-1, B1-2, B2-2, and B2-5 bearings.Despite the presence of local oscillations in the curves of the proposed method, accurate predictions may be made on the state of the bearings throughout the latter stages of their service life.Ultimately, the proposed method demonstrates a commendable capacity to anticipate the RUL of rolling bearings.to the diverse range of failures that bearings often experience as they degrade, it is very challenging for prediction methods to accurately determine the optimal RUL forecast for any individual bearing.Figure 15 displays the RUL forecast outcomes for the B1-1, B1-2, B2-2, and B2-5 bearings.Despite the presence of local oscillations in the curves of the proposed method, accurate predictions may be made on the state of the bearings throughout the latter stages of their service life.Ultimately, the proposed method demonstrates a commendable capacity to anticipate the RUL of rolling bearings.to the diverse range of failures that bearings often experience as they degrade, it is very challenging for prediction methods to accurately determine the optimal RUL forecast for any individual bearing.Figure 15 displays the RUL forecast outcomes for the B1-1, B1-2, B2-2, and B2-5 bearings.Despite the presence of local oscillations in the curves of the proposed method, accurate predictions may be made on the state of the bearings throughout the latter stages of their service life.Ultimately, the proposed method demonstrates a commendable capacity to anticipate the RUL of rolling bearings.

Ablation Experiment
To assess the suggested method's ability for generalization, the design of the ablation trials aligns with the PHM 2012 dataset, and the test bearing selected is B1-2.Table 7 displays the outcomes of the predictions, revealing that the MAE and RMSE outcomes of the proposed method exhibit a notable superiority over those of the comparative methods.

Conclusions
In order to improve the real-time monitoring capability in industrial processes and accurately and efficiently identify the rolling bearings' state of health during operation, a multi-feature fusion-based rolling bearing RUL prediction method is put forward.The following conclusions are drawn from the experimental findings based on the PHM 2012 bearing degradation dataset and the XJTU-SY bearing accelerated life test dataset: (1) A method for multi-sensor data fusion has been developed to combine data from many sensors based on their channels.This approach not only provides an effective fusion of multi-sensor data but also compensates for the limitations of standard prediction methods that rely only on data from a single sensor.(2) To address the issue of incomplete extraction of global feature information throughout the feature extraction procedure, this paper uses TCN, LSTM, and Transformer to construct a parallel multi-branch feature learning network, designs a parallel multiscale attention mechanism to capture both local and global dependencies, and realizes the adaptive weighted fusion of the output features from three types of feature extractors.The shallow features obtained by the parallel feature extractor are then residually connected with the deeper features through the attention mechanism to improve the utilization efficiency of the before and after feature information so as to learn more comprehensive feature information.(3) Validation of the validity and generalization ability of the proposed method using the PHM 2012 bearing degradation dataset and the XJTU-SY bearing accelerated life test dataset.The experimental findings demonstrate that the proposed method can precisely forecast the RUL of a variety of bearings.Comparative experiments demonstrate that the proposed method has a reduced prediction error.

Figure 1 .
Figure 1.Illustration of the dilated causal convolution operation.Figure 1. Illustration of the dilated causal convolution operation.

Figure 1 .
Figure 1.Illustration of the dilated causal convolution operation.Figure 1. Illustration of the dilated causal convolution operation.

Figure 2 .
Figure 2. Schematic diagram of the residual block in TCN.

Figure 2 .
Figure 2. Schematic diagram of the residual block in TCN.

Figure 2 .
Figure 2. Schematic diagram of the residual block in TCN.

Figure 4 .
Figure 4. Structure of the Transformer encoder.

Figure 4 .
Figure 4. Structure of the Transformer encoder.Multiple attention mechanisms are stacking operations performed by multiple attention mechanisms to get the attention weight of each word in a sentence.The inputs to the attentional mechanism are three matrices of equal dimensions Q = [q 1 , q 2 , • • • , q n ],

Figure 6 .
Figure 6.Fusion of data from multiple sensors via channel fusion.

Figure 6 .
Figure 6.Fusion of data from multiple sensors via channel fusion.Figure 6. Fusion of data from multiple sensors via channel fusion.

Figure 6 .
Figure 6.Fusion of data from multiple sensors via channel fusion.Figure 6. Fusion of data from multiple sensors via channel fusion.

Figure 7 .
Figure 7.The diagram of the hybrid neural network.

Figure 7 .
Figure 7.The diagram of the hybrid neural network.

Figure 7 .
Figure 7.The diagram of the hybrid neural network.

FourierFigure 8 .
Figure 8.The schematic diagram of the parallel multi-scale attention mechanism.

Figure 8 .
Figure 8.The schematic diagram of the parallel multi-scale attention mechanism.

6 Figure 11 .
Figure 11.The PHM 2012 dataset contains RUL visualization prediction results for test bearings.

6 Figure 11 .
Figure 11.The PHM 2012 dataset contains RUL visualization prediction results for test bearings.

Figure 11 .
Figure 11.The PHM 2012 dataset contains RUL visualization prediction results for test bearings.

Figure 15 .
Figure 15.The XJTU-SY dataset contains RUL visualization prediction results for test bearings.
t 1 C − represe memory unit of the previous moment, ˆt C represents the memory unit of the state of the candidate, t h denotes the external state, and W , and cW denote the input weight vectors of put gate, the forget gate, the output gate, and the candidate unit, respectively.U o U

Table 1 .
Parameters of a multi-feature fusion network model.Case Study 1: Predicting the RUL of Bearings Using the PHM 2012 Dataset 4.2.1.Introduction to the Dataset

Table 3 .
Results of several algorithms' predictions in the PHM 2012 dataset.

Table 4 .
Results of ablation experiments with different comparison methods under the PHM2012 dataset.

Table 4 .
Results of ablation experiments with different comparison methods under the PHM2012 dataset.

Table 5 .
Operating conditions for the SJTU-SY bearing dataset.

Table 6 .
Prediction results of different methods in the XJTU-SY dataset.

Table 5 .
Operating conditions for the SJTU-SY bearing dataset.

Table 6 .
Prediction results of different methods in the XJTU-SY dataset.

[29] Proposed Method MAE RMSE Score MAE RMSE Score MAE RMSE Score MAE RMSE Score MAE RMSE Score
Appl.Sci.2024, 14, x FOR PEER REVIEW 19 of 22

Table 7 .
Results of ablation experiments with different comparison methods under the SJTU-SY dataset.