A Method for Predicting the Remaining Life of Rolling Bearings Based on Multi-Scale Feature Extraction and Attention Mechanism

: In response to the problems of difﬁcult identiﬁcation of degradation stage start points and inadequate extraction of degradation features in the current rolling bearing remaining life prediction method, a rolling bearing remaining life prediction method based on multi-scale feature extraction and attention mechanism is proposed. Firstly, this paper takes the normalized bearing vibration signal as input and adopts a quadratic function as the RUL prediction label, avoiding identifying the degradation stage start point. Secondly, the spatial and temporal features of the bearing vibration signal are extracted using the dilated convolutional neural network and LSTM network, respectively, and the channel attention mechanism is used to assign weights to each degradation feature to effectively use multi-scale information. Finally, the mapping of bearing degradation features to remaining life labels is achieved through a fully connected layer for the RUL prediction of bearings. The proposed method is validated using the PHM 2012 Challenge bearing dataset, and the experimental results show that the predictive performance of the proposed method is superior to that of other RUL prediction methods.


Introduction
As a key component of mechanical equipment, rolling bearings play a role in bearing load and transferring kinetic energy and are known as the "joints of industrial equipment". However, rolling bearings have been operating under high loads for a long time, which has led to a variety of failures [1]. Once rolling bearing failure occurs, it not only causes economic interest loss, but even safety accidents. Some statistics indicate that bearing failures in machinery and equipment account for 30% to 40% of all failures [2]. Therefore, accurate prediction of the remaining useful life (RUL) of rolling bearings is an inevitable requirement for reducing equipment maintenance costs and ensuring the reliable operation of the equipment.
At present, prediction methods for the RUL of rolling bearings can be divided into two main types [3]: RUL methods based on mechanistic modeling [4], and data-driven RUL methods [5]. The RUL method based on mechanistic modeling is based on the failure mechanism of the equipment [6]. However, in practical engineering applications, the performance degradation mechanism of bearings is more complex, and it is difficult to establish an accurate mechanistic model. The data-driven RUL prediction method can extract the degradation characteristics of the equipment from a large amount of monitoring data and build the corresponding RUL prediction model. Therefore, data-driven RUL methods are more suitable for complex mechanical systems. The data-driven RUL approach consists of two key steps [7]: firstly, the construction of health indicators that can represent the trend of bearing degradation, and secondly, the establishment of an effective RUL prediction model.
Traditional lifespan prediction methods mainly use signal analysis methods to construct health indicators. For example, in [8], the peak and root mean square (RMS) values of wavelet coefficients are fed into a recurrent neural network (RNN) to predict the remaining life of the bearing. In [9], health indicators were constructed by extracting the time and frequency domain features of the bearing signals, and the extracted health indicators were input into a deep autoencoder (DAE), which effectively predicted the RUL of the bearings. Although these methods of constructing health metrics can infer correlations and causal relationships hidden in the data, this requires the manual extraction of bearing features and relies on empirical knowledge [10], which lacks adaptiveness. To avoid the above, we can use the method of deep learning (DL) to directly learn the mechanical degradation features from the original data.
In recent years, DL theory has been extensively applied in the fields of data exploitation, image processing, and target recognition [11][12][13]. Deep learning-based RUL prediction abandons the traditional RUL method of manually extracting features by building a deep architecture neural network to obtain multi-leveled degradation features in the original time series. Convolutional neural networks (CNNs) have a good ability to extract degradation features from equipment and are widely used in the field of health monitoring and management of mechanical equipment. In the literature [14], the degradation features of bearings were learned by CNN; then, these features were constructed into health indicators by non-linear mapping. The literature [15] formed a convolutional autoencoder structure by fusing CNN models and autoencoders to better extract the degradation features of electric valves. However, ordinary CNN struggles to extract the degradation information of the device in a complex environment. As the number of layers in the network increases, model degradation will occur during training. At the same time, the elements of the convolutional kernel of ordinary CNNs are closely aligned with each other, and the perceptual field is fixed. To acquire a wider perceptual field and extract more feature information, the convolutional kernel size must increase, thus, also increasing the model parameters.
To address the above issues, some scholars have proposed the dilated convolution operation [16,17]. Bearing vibration signals belong to time series data, where RNNs have been used to handle time series information with good results. In [18], the health metrics of the device are fed into the RNN, and the RUL prediction of the device is achieved. However, RNNs can lead to the problem of gradient disappearance when processing long-sequence information [19]. To overcome this problem, some scholars have introduced Long Short-Term Memory (LSTM) networks with gating units. LSTM can learn long-term dependent information and effectively handle long-sequence data. The literature [20] combines CNN and LSTM to predict the remaining lifetime of rolling bearings. The attention mechanism was first applied to machine translation and is now applied extensively in the handling of various time series [21]. By calculating the attention probabilities of different features, the attention mechanism assigns different weights to different features in the model, reinforces more important features, and suppresses relatively unimportant features, which helps to improve the prediction performance of the model. In [22], a recurrent neural network based on an attention mechanism is proposed to predict the remaining life of a bearing.
The above methods have produced good results when predicting RUL for bearings; however, they all perform only single-scale feature extraction, which will inevitably result in the omission of certain important information. Moreover, the above methods do not consider the differences in the contribution of various features to the RUL prediction task, which will introduce adverse effects to the prediction results. In this paper, we propose a rolling bearing remaining life prediction method based on multi-scale feature extraction and an attention mechanism to extract temporal and spatial features from the normalized bearing vibration signals. The method then employs an attention mechanism to achieve a reasonable allocation of attention resources to the model and to enhance the influence of key information on bearing RUL prediction. The mapping of bearing degradation features to remaining life labels is realized through a fully connected layer to achieve the RUL prediction of bearings. The effectiveness of the proposed method in this paper is validated on the PHM2012 bearing dataset.
The rest of the paper is organized as follows: in Section 2, the network structure of the bearing RUL prediction method is constructed, and a flow chart of the bearing RUL prediction method is given. In Section 3, the experimental data are firstly pre-processed, followed by the construction of quadratic labels, and finally, the experimental results of the proposed method and the comparison tests are given. Section 4 concludes the whole paper.

Convolutional Neural Networks
CNN, as an important branch of deep learning, is extensively used in fault diagnosis [23] and the lifetime prediction of mechanical equipment [24]. CNN comprises an input layer, convolutional layer, pooling layer, fully connected layer, and output layer. Figure 1 shows the basic structure of CNN. The functions of each layer are as follows.
(1) Input layer: utilized mainly for data entry.
(2) Convolutional layer: It has the advantages of local area connectivity and weight sharing. The convolution layer is composed of a group of convolution kernels, which are the main tools for feature extraction. The specific operations are shown below.

Dilated Convolution
The elements of the ordinary convolutional kernel are arranged close to each other, and the obtained perceptual field is fixed. Therefore, if we want to obtain more perceptual fields and more feature information, we can only increase the size of the convolution kernel, which also causes an increase in the model parameters. To overcome these problems, some experts propose the operation of dilated convolution [25,26]. This convolution operation adds a certain void rate between each convolution kernel element but does not increase the parameters of the convolution kernel. The comparison of conventional convolution and dilated convolution is shown in Figure 2. As can be observed from Figure 2, the dilated convolution can obtain a larger perceptual field while preventing the parameters of the convolution kernel from increasing, so it has been used in many fields.
where W denotes the convolution kernel size, G l(n ) m denotes the n'th weight of the mth convolution kernel of the lth layer, and x l(r n ) denotes the nth local receptive field of layer l.
where p l(m,n) represents the output value of the pooling layer, a l(m,t) represents the activation value, and H denotes the width size of the pooling domain.
(4) Fully connected layer: It maps the feature space extracted from the data after convolution and pooling to the sample space. The specific operations are shown below. where h l denotes the output characteristics of the lth hidden layer, σ l is the activation function of the lth layer, W l denotes the connection weight between neurons in layer l and neurons in layer l-1, v l−1 is the output vector of layer l-1, and b l is the offset.
(5) Output layer: mainly used to output the final prediction results.

Dilated Convolution
The elements of the ordinary convolutional kernel are arranged close to each other, and the obtained perceptual field is fixed. Therefore, if we want to obtain more perceptual fields and more feature information, we can only increase the size of the convolution kernel, which also causes an increase in the model parameters. To overcome these problems, some experts propose the operation of dilated convolution [25,26]. This convolution operation adds a certain void rate between each convolution kernel element but does not increase the parameters of the convolution kernel. The comparison of conventional convolution and dilated convolution is shown in Figure 2. As can be observed from Figure 2, the dilated convolution can obtain a larger perceptual field while preventing the parameters of the convolution kernel from increasing, so it has been used in many fields.

Input layer
Convolute layer Pooled layer Full connection layer Output layer Figure 1. Basic structure of CNN.

Dilated Convolution
The elements of the ordinary convolutional kernel are arranged close to each other, and the obtained perceptual field is fixed. Therefore, if we want to obtain more perceptual fields and more feature information, we can only increase the size of the convolution kernel, which also causes an increase in the model parameters. To overcome these problems, some experts propose the operation of dilated convolution [25,26]. This convolution operation adds a certain void rate between each convolution kernel element but does not increase the parameters of the convolution kernel. The comparison of conventional convolution and dilated convolution is shown in Figure 2. As can be observed from Figure 2, the dilated convolution can obtain a larger perceptual field while preventing the parameters of the convolution kernel from increasing, so it has been used in many fields.

LSTM Networks
LSTM networks take into account the connection between outputs and inputs in a time series and have been applied extensively in the health management prediction of mechanical equipment [27,28]. Figure 3 shows the structure of an LSTM network. The LSTM network updates the network state mainly by forgetting gate , input gate , and output gate . The cell state and the output state ℎ in the LSTM network are obtained by updating the cell state −1 and the output state ℎ −1 at the previous moment. The specific update process is as follows.

LSTM Networks
LSTM networks take into account the connection between outputs and inputs in a time series and have been applied extensively in the health management prediction of mechanical equipment [27,28]. Figure 3 shows the structure of an LSTM network. The LSTM network updates the network state mainly by forgetting gate f t , input gate i t , and output gate o t . The cell state c t and the output state h t in the LSTM network are obtained by updating the cell state c t−1 and the output state h t−1 at the previous moment. The specific update process is as follows.
where c t denotes the candidate state, x t denotes the input time series signal, h t denotes the output updated by the network at time, and the Sigmoid and tanh functions are denoted by σ and tanh, respectively. ω i , ω o , ω f , and ω c denote the matrix weights of the input gate, output gate, forgetting gate, and cell state, respectively; b i , b o , b f , and b c denote the offset of input gate, output gate, forgetting gate and unit state, respectively. "*" denotes the operation of multiplying the corresponding elements of two matrices of the same order, "·" denotes the ordinary product operation.
where ∼ denotes the candidate state, denotes the input time series signal, ℎ denotes the output updated by the network at time, and the Sigmoid and tanh functions are denoted by and ℎ, respectively. , , , and denote the matrix weights of the input gate, output gate, forgetting gate, and cell state, respectively; , , , and denote the offset of input gate, output gate, forgetting gate and unit state, respectively. ''*'' denotes the operation of multiplying the corresponding elements of two matrices of the same order, "⋅" denotes the ordinary product operation.
Input gate Output gate

Attentional Mechanisms
Similar to the human visual mechanism, the attention mechanism can give more attention to key information that is beneficial to the task and less attention to unimportant information, thus, enabling the extraction of effective features [29]. The attention mechanism is not an exact model but an idea, and therefore, it can be combined with many network models. The current mainstream attention mechanisms can be divided into the following three types: channel attention, spatial attention, and self-attention. The channel attention mechanism aims to automatically obtain the importance of each feature channel by means of network learning, and finally assign different weight coefficients to each channel to reinforce the important features to suppress the unimportant ones [30]. The core idea of the channel attention mechanism is to help the network focus on the information related to the current input, assign different weights to different features, and multiply the input vector with the weights to achieve the importance assignment. The implementation process of the channel attention mechanism can be divided into two parts: the generation of attention weights and the assignment of weights. This is shown in the following equation.
where is the input vector, ℎ(·) is the attention mechanism network, 1 is the output vector, is the attention weight, and is the feature vector of the input vector .

Attentional Mechanisms
Similar to the human visual mechanism, the attention mechanism can give more attention to key information that is beneficial to the task and less attention to unimportant information, thus, enabling the extraction of effective features [29]. The attention mechanism is not an exact model but an idea, and therefore, it can be combined with many network models. The current mainstream attention mechanisms can be divided into the following three types: channel attention, spatial attention, and self-attention. The channel attention mechanism aims to automatically obtain the importance of each feature channel by means of network learning, and finally assign different weight coefficients to each channel to reinforce the important features to suppress the unimportant ones [30]. The core idea of the channel attention mechanism is to help the network focus on the information related to the current input, assign different weights to different features, and multiply the input vector with the weights to achieve the importance assignment. The implementation process of the channel attention mechanism can be divided into two parts: the generation of attention weights and the assignment of weights. This is shown in the following equation.
where X is the input vector, h(·) is the attention mechanism network, Z 1 is the output vector, A is the attention weight, and Z is the feature vector of the input vector X.

Network Model Construction
The network model of the rolling bearing remaining life prediction method based on multi-scale feature extraction and attention mechanism proposed in this paper is shown in Figure 4. Firstly, in order to extract more comprehensive bearing degradation indexes from the original data, this paper uses dilation convolution and long-short time neural network to extract the spatial and temporal features of bearings, where dilation convolution has a large sensory field and does not increase the optimization parameters of the network, while LSTM has a good ability to extract temporal features. Next, global average pooling (GAP) is used to structurally regularize the network to prevent overfitting and to give each channel an actual category meaning. Then, the channel attention mechanism is used to implement adaptive weight assignment for bearing degradation features. Finally, a fully connected layer is used to implement the mapping of bearing degradation features to remaining life labels.
multi-scale feature extraction and attention mechanism proposed in this paper is shown in Figure 4. Firstly, in order to extract more comprehensive bearing degradation indexes from the original data, this paper uses dilation convolution and long-short time neural network to extract the spatial and temporal features of bearings, where dilation convolution has a large sensory field and does not increase the optimization parameters of the network, while LSTM has a good ability to extract temporal features. Next, global average pooling (GAP) is used to structurally regularize the network to prevent overfitting and to give each channel an actual category meaning. Then, the channel attention mechanism is used to implement adaptive weight assignment for bearing degradation features. Finally, a fully connected layer is used to implement the mapping of bearing degradation features to remaining life labels.  Figure 5 shows the flow chart of the bearing RUL prediction method for bearings based on multi-scale feature extraction and attention mechanism designed in this paper; the specific steps are as follows:

Prediction Process of Bearing RUL Based on Multi-Scale Feature Extraction and Attention Mechanism
Step 1: Obtain the bearing vibration signal and normalize the original signal.
Step 2: Construct the quadratic degradation labels corresponding to the bearing vibration data and divide the normalized bearing vibration signal into the training set, test set, and validation set.
Step 3: Input the training set bearing vibration data into the Dilated CNN and LSTM network for adaptive extraction of spatial and temporal features; adjust the network parameters (including the learning rate, the number of iterations and the size of the convolution kernel).
Step 4: The weights are assigned to the bearing degradation features extracted by the multi-scale feature extraction module through the channel attention mechanism.
Step 5: A fully connected layer is used to implement the mapping of bearing degradation features to the remaining life labels for the RUL prediction of bearings.
Step 6: The validation set verifies the model training effect and fine-tunes the model parameters according to the validation results.
Step 7: The test set tests the performance of the trained model and calculates the model evaluation metrics, outputs the settlement results, and ends the process.  Figure 5 shows the flow chart of the bearing RUL prediction method for bearings based on multi-scale feature extraction and attention mechanism designed in this paper; the specific steps are as follows: Step 1: Obtain the bearing vibration signal and normalize the original signal.

Prediction Process of Bearing RUL Based on Multi-Scale Feature Extraction and Attention Mechanism
Step 2: Construct the quadratic degradation labels corresponding to the bearing vibration data and divide the normalized bearing vibration signal into the training set, test set, and validation set.
Step 3: Input the training set bearing vibration data into the Dilated CNN and LSTM network for adaptive extraction of spatial and temporal features; adjust the network parameters (including the learning rate, the number of iterations and the size of the convolution kernel).
Step 4: The weights are assigned to the bearing degradation features extracted by the multi-scale feature extraction module through the channel attention mechanism.
Step 5: A fully connected layer is used to implement the mapping of bearing degradation features to the remaining life labels for the RUL prediction of bearings.
Step 6: The validation set verifies the model training effect and fine-tunes the model parameters according to the validation results.
Step 7: The test set tests the performance of the trained model and calculates the model evaluation metrics, outputs the settlement results, and ends the process.

Test Data
The bearing vibration data for validating the proposed method in this paper are obtained from the PHM 2012 bearing dataset of the PRONOSTIA platform. The platform provides realistic bearing degradation data that can be used to validate various algorithms regarding bearing health assessment, remaining life prediction, and fault diagnosis. The PRONOSTIA experimental platform is shown in Figure 6. The stage allows the bearing to rotate at high speed and is fitted with two DYTRAN high-frequency accelerometers type 3035B to collect the bearing signals in both the horizontal and vertical directions. The vibration signal is sampled every 10 s with a sampling time of 0.1 s and a sampling frequency of 25.6 kHz so that 2560 data are recorded per sample. At the start of bearing rotation, all bearings are healthy and free of defects. The bearings underwent accelerated degradation during rotation, and once the amplitude of the bearing signal was monitored to exceed 20 g, the bearings were considered damaged, and the experiment was over.  Figure 5. Flow chart of the RUL prediction method in this paper.

Test Data
The bearing vibration data for validating the proposed method in this paper are obtained from the PHM 2012 bearing dataset of the PRONOSTIA platform. The platform provides realistic bearing degradation data that can be used to validate various algorithms regarding bearing health assessment, remaining life prediction, and fault diagnosis. The PRONOSTIA experimental platform is shown in Figure 6. The stage allows the bearing to rotate at high speed and is fitted with two DYTRAN high-frequency accelerometers type 3035B to collect the bearing signals in both the horizontal and vertical directions. The vibration signal is sampled every 10 s with a sampling time of 0.1 s and a sampling frequency of 25.6 kHz so that 2560 data are recorded per sample. At the start of bearing rotation, all bearings are healthy and free of defects. The bearings underwent accelerated degradation during rotation, and once the amplitude of the bearing signal was monitored to exceed 20 g, the bearings were considered damaged, and the experiment was over.  The PHM 2012 bearing data set was collected under three different operating conditions. The specific information on the bearings under these three operating conditions is shown in Table 1. The article selects the bearing vibration data collected under operating condition 1 for experimental verification. Although the PHM 2012 bearing data set contains vibration data in both the horizontal and vertical directions, according to some experts, vibration signals in the horizontal direction provide more useful information than those in the vertical direction [31]. Therefore, only monitoring data collected in the horizontal direction are used in the article. The time domain signals of bearing 1-1 and 1-3 are shown in Figure 7a,b. From Figure 7, it can be seen that the amplitude of the bearing vibration signal changes significantly with time, and the signal shows a tendency to disperse, which is beneficial to the extraction of health feature information with degradation trend.
The PHM 2012 bearing data set was collected under three different operating conditions. The specific information on the bearings under these three operating conditions is shown in Table 1. The article selects the bearing vibration data collected under operating condition 1 for experimental verification. Although the PHM 2012 bearing data set contains vibration data in both the horizontal and vertical directions, according to some experts, vibration signals in the horizontal direction provide more useful information than those in the vertical direction [31]. Therefore, only monitoring data collected in the horizontal direction are used in the article. The time domain signals of bearing 1-1 and 1-3 are shown in Figure 7a,b. From Figure 7, it can be seen that the amplitude of the bearing vibration signal changes significantly with time, and the signal shows a tendency to disperse, which is beneficial to the extraction of health feature information with degradation trend.

Data Preprocessing
To avoid the impact of inconsistent feature metric scales on prediction accuracy, the article uses the min-max normalization method to normalize the bearing signals. The minmax normalization is calculated as follows.
where is the original bearing life signal, is the minimum value in the original bearing life signal, is the maximum value in the original bearing life signal, and is the normalized bearing life signal.

Data Preprocessing
To avoid the impact of inconsistent feature metric scales on prediction accuracy, the article uses the min-max normalization method to normalize the bearing signals. The min-max normalization is calculated as follows.
where x is the original bearing life signal, x min is the minimum value in the original bearing life signal, x max is the maximum value in the original bearing life signal, and x new is the normalized bearing life signal.

Construction of Data Labels
After obtaining the raw vibration data of the bearings, they need to be divided into the training set, test set, and validation set. However, since the raw data do not have corresponding labels, degradation labels corresponding to the vibration data need to be constructed. At present, the commonly used degradation labels mainly include linear degradation labels and segmental degradation labels, as shown in Figure 8a,b below, respectively. The linear degradation label does not need to identify the degradation start point, and it is considered that the normal phase data also need to be predicted, which will Electronics 2022, 11, 3616 9 of 16 greatly improve the training time and is not conducive to network training; the segmental degradation label is trained only for the degradation phase, which reduces the prediction time consumption and also improves the prediction accuracy, but it needs to identify the degradation phase start point, which increases the labor cost.
After obtaining the raw vibration data of the bearings, they need to be divided into the training set, test set, and validation set. However, since the raw data do not have corresponding labels, degradation labels corresponding to the vibration data need to be constructed. At present, the commonly used degradation labels mainly include linear degradation labels and segmental degradation labels, as shown in Figure 8a,b below, respectively. The linear degradation label does not need to identify the degradation start point, and it is considered that the normal phase data also need to be predicted, which will greatly improve the training time and is not conducive to network training; the segmental degradation label is trained only for the degradation phase, which reduces the prediction time consumption and also improves the prediction accuracy, but it needs to identify the degradation phase start point, which increases the labor cost. To address the above issues, the article uses the quadratic function indicator as the degradation label of the bearing, as shown in Figure 9. As can be observed from Figure 9, the label is more in line with the degradation trend of the bearing. In the early stage of bearing degradation, the degradation effect is not obvious, with a relatively gentle degradation trend, and in the late stage of bearing degradation, the bearing shows a rapid degradation trend. The label takes into account the entire degradation trend of the bearing and does not require the identification of the start of the degradation phase. The formula for the quadratic degradation label is as follows.
is the remaining life of the bearing at moment , is the time of complete bearing failure, and is the sampling time. To address the above issues, the article uses the quadratic function indicator as the degradation label of the bearing, as shown in Figure 9. As can be observed from Figure 9, the label is more in line with the degradation trend of the bearing. In the early stage of bearing degradation, the degradation effect is not obvious, with a relatively gentle degradation trend, and in the late stage of bearing degradation, the bearing shows a rapid degradation trend. The label takes into account the entire degradation trend of the bearing and does not require the identification of the start of the degradation phase. The formula for the quadratic degradation label is as follows.
where y i is the remaining life of the bearing at moment t i , T is the time of complete bearing failure, and t i is the sampling time.

Evaluation Indicators
To quantitatively evaluate the prediction effect, the article uses the root mean square error (RMSE) and the mean absolute error (MAE) between the predicted value of RUL and the true value of RUL as evaluation indicators. The smaller these two evaluation in-

Evaluation Indicators
To quantitatively evaluate the prediction effect, the article uses the root mean square error (RMSE) and the mean absolute error (MAE) between the predicted value of RUL and the true value of RUL as evaluation indicators. The smaller these two evaluation indicators, the smaller the difference between the predicted and true values, and the higher the prediction accuracy. The formulae for calculating RMSE and MAE are as follows.
where y i denotes the true remaining life of the rolling bearing, and ∧ y i denotes the predicted value of the remaining life of the rolling bearing. m is the number of samples.

Test Results
To verify the effectiveness of the proposed method, the paper takes the data under the PHM 2012 bearing dataset working condition 1 for the experiment, and uses bearing 1-1 as the training set, bearing 1-2 as the validation set, and other bearings under working condition 1 as the test set. The network structure of the RUL prediction method proposed in the article is shown in Table 2. For the hyperparameters of the network model, the method of multiple experiments is adopted to determine them. Specifically, the batch size is set as 64, the number of iterations is set as 50, the learning rate is 0.001, and the optimizer is selected as Adam. Since the proposed method is supervised learning, the mean square error function (MSE) is selected as the loss function of regression prediction in this paper, and the MSE function is calculated as follows.
where n is the number of samples, y i denotes the real life of the bearing, andŷ i denotes the predicted life of the bearing. The hyperparameters of the network play an important role in the training of the whole network, so a reasonable selection of the hyperparameters of the network can improve the overall RUL prediction effect. First, the number of batches is set to 64 according to the device configuration. Second, to verify whether the network has reached the convergence state, the loss function curve is visualized in this paper, and the network model training loss is shown in Figure 10. Figure 10 shows that the value of the loss function dropped to below 0.005 after the network reached 10 iterations; thus, it can be concluded that the training has reached the convergence state, so the number of iterations set to 50 is reasonable. The hyperparameters of the network play an important role in the training of the whole network, so a reasonable selection of the hyperparameters of the network can improve the overall RUL prediction effect. First, the number of batches is set to 64 according to the device configuration. Second, to verify whether the network has reached the convergence state, the loss function curve is visualized in this paper, and the network model training loss is shown in Figure 10. Figure 10 shows that the value of the loss function dropped to below 0.005 after the network reached 10 iterations; thus, it can be concluded that the training has reached the convergence state, so the number of iterations set to 50 is reasonable. The main parameters that have an impact on the prediction performance of the network model are the convolutional kernel size and the learning rate. Among them, the convolutional kernel realizes the extraction of bearing degradation features, and too large a convolutional kernel size leads to the loss of local information, while too small a convolutional kernel cannot capture the global features. Therefore, in this paper, the convolutional kernel sizes of 3, 5, 7, 9, and 11 are selected as alternatives in turn, and the other parameters are kept unchanged to perform parameter optimization. Similarly, the learning rate is the most important parameter in the optimizer; too small a learning rate will greatly increase the training time, while too large a learning rate will cause the training The main parameters that have an impact on the prediction performance of the network model are the convolutional kernel size and the learning rate. Among them, the convolutional kernel realizes the extraction of bearing degradation features, and too large a convolutional kernel size leads to the loss of local information, while too small a convolutional kernel cannot capture the global features. Therefore, in this paper, the convolutional kernel sizes of 3, 5, 7, 9, and 11 are selected as alternatives in turn, and the other parameters are kept unchanged to perform parameter optimization. Similarly, the learning rate is the most important parameter in the optimizer; too small a learning rate will greatly increase the training time, while too large a learning rate will cause the training process to fluctuate greatly, which is not conducive to model convergence. Therefore, the learning rates of 0.01, 0.05, 0.001, 0.005, and 0.0005 are selected as alternatives in the article, and the remaining parameters are kept constant to perform parameter optimization. The evaluation metric is chosen as the average of RMSE of the five test bearings. Figure 11 shows the learning rate and convolutional kernel size optimization search process. In Figure 11, "Lr" denotes the learning rate and "Ks" denotes the convolutional kernel size. The average RMSE is the smallest when the convolutional kernel size is 5, which means that the prediction effect is optimal at this time, so the convolutional kernel size is 5. It can also be observed that the prediction effect is optimal when the network learning rate is 0.001, so the learning rate is 0.001 in this paper.
The RUL prediction results of the method proposed in the article on the training bearing 1−1 are shown in Figure 12a. As can be observed from Figure 12a, the method better fits the training set. It can be concluded that the model learns the degraded information contained in the training set. Further, after proposing new labels, the bearing prediction has a good effect both in the early and late stages, thus, also validating the effectiveness of our proposed method. The RUL prediction results of the proposed method on test bearing 1−3 are shown in Figure 12b. As can be observed from Figure 12b, the method in the article fits the degradation trend of the bearing very well, has good monotonicity and prediction accuracy, and can almost perfectly predict the final failure life of the bearing at the final moment.
process to fluctuate greatly, which is not conducive to model convergence. Therefore, the learning rates of 0.01, 0.05, 0.001, 0.005, and 0.0005 are selected as alternatives in the article, and the remaining parameters are kept constant to perform parameter optimization. The evaluation metric is chosen as the average of RMSE of the five test bearings. Figure 11 shows the learning rate and convolutional kernel size optimization search process. In Figure 11, "Lr" denotes the learning rate and "Ks" denotes the convolutional kernel size. The average RMSE is the smallest when the convolutional kernel size is 5, which means that the prediction effect is optimal at this time, so the convolutional kernel size is 5. It can also be observed that the prediction effect is optimal when the network learning rate is 0.001, so the learning rate is 0.001 in this paper. Figure 11. Learning rate and convolutional kernel size finding process.
The RUL prediction results of the method proposed in the article on the training bearing 1−1 are shown in Figure 12a. As can be observed from Figure 12a, the method better fits the training set. It can be concluded that the model learns the degraded information contained in the training set. Further, after proposing new labels, the bearing prediction has a good effect both in the early and late stages, thus, also validating the effectiveness of our proposed method. The RUL prediction results of the proposed method on test bearing 1−3 are shown in Figure 12b. As can be observed from Figure 12b, the method in the article fits the degradation trend of the bearing very well, has good monotonicity and prediction accuracy, and can almost perfectly predict the final failure life of the bearing at the final moment.

Comparison Test
To verify the effectiveness and superiority of the proposed method, the residual network (ResNet), CNN-LSTM, and temporal convolutional network (TCN) are selected for comparison tests. Among them, ResNet has residual connectivity, which reduces the risk of overfitting due to the increase in network depth. The CNN-LSTM model can extract both spatial and temporal features and is widely used in RUL prediction. the TCN model has long-term memory capability and achieves better results in time series prediction. The parameters of the comparison methods selected in this paper are consistent with those of the proposed models. The RUL prediction results of each prediction method on test bearing 1−3 are shown in Figure 13, from which it is obvious that the curve of the RUL prediction method in this paper has the best fit with the curve of the real bearing life. This indicates that the overall prediction effect of the proposed method is better than other comparison methods.

Comparison Test
To verify the effectiveness and superiority of the proposed method, the residual network (ResNet), CNN-LSTM, and temporal convolutional network (TCN) are selected for comparison tests. Among them, ResNet has residual connectivity, which reduces the risk of overfitting due to the increase in network depth. The CNN-LSTM model can extract both spatial and temporal features and is widely used in RUL prediction. the TCN model has long-term memory capability and achieves better results in time series prediction. The parameters of the comparison methods selected in this paper are consistent with those of the proposed models. The RUL prediction results of each prediction method on test bearing 1−3 are shown in Figure 13, from which it is obvious that the curve of the RUL prediction method in this paper has the best fit with the curve of the real bearing life. This indicates that the overall prediction effect of the proposed method is better than other comparison methods.
of overfitting due to the increase in network depth. The CNN-LSTM model can extract both spatial and temporal features and is widely used in RUL prediction. the TCN model has long-term memory capability and achieves better results in time series prediction. The parameters of the comparison methods selected in this paper are consistent with those of the proposed models. The RUL prediction results of each prediction method on test bearing 1−3 are shown in Figure 13, from which it is obvious that the curve of the RUL prediction method in this paper has the best fit with the curve of the real bearing life. This indicates that the overall prediction effect of the proposed method is better than other comparison methods.  RMSE and MAE are used to evaluate the prediction effectiveness of each method in the article. The RMS and MAE prediction performance indexes of each prediction method are shown in Table 3. It can be observed from Table 3 that the prediction performance indexes of the RUL prediction method proposed in this paper are optimal for all five test bearings. This further reflects the effectiveness and superiority of the proposed method in this paper. This is because this paper not only adopts the expanded convolution with a wider feeling field, but also adopts the attention mechanism to assign weights to the importance of features. This avoids the interference of useless features and enhances the utilization of effective features. In summary, the method proposed in this paper has a better prediction effect than the existing advanced methods. In addition, by comparing Table 3, it can be found that although the prediction effects of the method proposed on test bearings 1−5, 1−6, and 1−7 are all better than other comparable models, they are all far inferior to bearing 1−3 in terms of prediction effects. To analyze the reasons causing such results, the root means square indicators of the initial signals of each test bearing are extracted separately, and the RMS indicators of each initial bearing signal are shown in Figure 14. As can be observed from Figure 14, the RMS variation trend of the vibration signals of bearings 1−5, 1−6, and 1−7 is steeper, so they belong to the sudden failure type, while the RMS variation trend of the vibration signals of bearing 1−3 is flatter, so bearing 1−3 belongs to the gradual failure type, which is why it leads to a large difference in the prediction effect. Subsequently, migration learning can be considered for introduction into the proposed method to reduce the difference in data distribution between different failed bearings, thus, improving the prediction accuracy.
bearing signal are shown in Figure 14. As can be observed from Figure 14, the RMS variation trend of the vibration signals of bearings 1−5, 1−6, and 1−7 is steeper, so they belong to the sudden failure type, while the RMS variation trend of the vibration signals of bearing 1−3 is flatter, so bearing 1−3 belongs to the gradual failure type, which is why it leads to a large difference in the prediction effect. Subsequently, migration learning can be considered for introduction into the proposed method to reduce the difference in data distribution between different failed bearings, thus, improving the prediction accuracy.

Conclusions
In this paper, a method for predicting the remaining life of rolling bearings based on multi-scale feature extraction and attention mechanism is proposed. Firstly, this paper takes the vibration signal of the bearing as the network input and normalizes it to perform feature extraction directly from the original dataset, reducing the loss of degradation features. Secondly, quadratic function labels are constructed for the dataset to avoid the identification of the starting point of the bearing degradation stage. Thirdly, the temporal and spatial features of the bearing vibration signals are extracted using a dilated convolutional neural network and a long-and short-term memory network, respectively. Finally, a channel attention mechanism is used to assign importance to the extracted degradation features, and the mapping of bearing degradation features to remaining life labels is achieved by a fully connected layer. The effectiveness and superiority of the proposed rolling bearing residual life prediction method is verified on the PHM 2012 bearing dataset, and the tests show that the proposed method has better prediction results compared with other advanced methods.
Author Contributions: Conceptualization, C.J. and M.X.; methodology, X.L.; software, X.L.; validation, X.L., M.X. and C.L.; formal analysis, C.L.; investigation, Y.L.; resources, C.J.; data curation, Q.W. and X.L.; writing-original draft preparation, X.L.; writing-review and editing, C.J. and M.X.; funding acquisition, C.J., Y.L. and C.L. All authors have read and agreed to the published version of the manuscript. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Written informed consent has been obtained from the patient(s) to publish this paper.