Rolling Bearing Health Indicator Extraction and RUL Prediction Based on Multi-Scale Convolutional Autoencoder

: Rolling bearings are some of the most crucial components in rotating machinery systems. Rolling bearing failure may cause substantial economic losses and even endanger operator lives. Therefore, the accurate remaining useful life (RUL) prediction of rolling bearings is of tremendous research importance. Health indicator (HI) construction is the critical step in the data-driven RUL prediction approach. However, existing HI construction methods often require extraction of time-frequency domain features using prior knowledge while artiﬁcially determining the failure threshold and do not make full use of sensor information. To address the above issues, this paper proposes an end-to-end HI construction method called a multi-scale convolutional autoencoder (MSCAE) and uses LSTM neural networks for RUL prediction. MSCAE consists of three convolutional autoencoders with different convolutional kernel sizes in parallel, which can fully exploit the global and local information of the vibration signals. First, the raw vibration data and labels are input into MSCAE, and then, MSCAE is trained by minimizing the composite loss function. After that, the vibration data of the test bearings are fed into the trained MSCAE to extract HI. Finally, RUL prediction is performed using the LSTM neural network. The superiority of the HI extracted by MSCAE was veriﬁed using the PHM2012 challenge dataset. Compared to state-of-the-art HI construction methods, RUL prediction using MSCAE-extracted HI has the highest prediction accuracy.


Introduction
Rolling bearings are the joints of rotating machinery and play an essential role in industrial production, intelligent manufacturing, and transportation [1,2]. Bearing mounting methods, lubricant quality, bearing operating environment, load and speed can all affect bearing vibration and life [3][4][5]. Once the rolling bearing fails, it will cause the machinery and equipment to stop working, causing substantial economic loss or even threatening the operator's life [6]. Therefore, timely analysis of their working conditions and prediction of their remaining useful life (RUL) is of great research importance [7]. The Prognostic and Health Management (PHM) refers to the use of sensor monitoring data to predict, monitor, and manage the health status of a system through models and algorithms [8][9][10][11][12]. RUL prediction is a crucial technique in PHM and is defined as the time interval between the current moment and when a system or internal component fails [10,11].
Currently, RUL prediction methods are mainly divided into model-based and datadriven methods [13]. The model-based method uses a mathematical approach to construct a physical model of the degradation trend of mechanical components [14]. However, modern mechanical systems have dramatically increased in complexity, have highly coupled internal components, and often operate in severe environments with heavy loads, variable operating conditions, and multiple noise levels [15]. Therefore, building accurate degradation models is a challenging task, limiting the development of model-based methods. The data-driven method is based on machine learning algorithms to construct a mapping relationship between the monitoring data from the sensors and the RUL [16]. Deep learning is currently one of the up-to-date research directions in machine learning. With its powerful data processing and feature extraction capabilities, deep learning is widely used in computer vision, natural language processing, and fault diagnosis [17]. With the advantages of deep learning, more and more researchers are applying it in the RUL prediction model. Wang et al. [18] proposed a multi-scale convolutional attention network. The multi-sensor data are first fused, and then, feature extraction is performed using a multi-scale convolution module with a self-attentive mechanism. Finally, regression analysis is performed on the high-level representation using a dynamic dense layer. The validity of the model was verified using milling tool life data. Ma et al. [19] proposed a convolutional long and short-term memory model for bearing RUL prediction. Unlike the traditional method that directly connects the convolutional neural network (CNN) and LSTM, this method performs convolutional operations on all the state transitions in the LSTM. Cao et al. [20] proposed a temporal convolution prediction framework with a residual self-attention mechanism. The marginal spectrum of the vibration signal is first extracted; then, it is used as the input of the temporal convolution network, and finally, the residual self-attention mechanism is introduced to achieve end-to-end remaining lifetime prediction. The validity of the proposed method was verified with the PHM2012 challenge dataset and the XJTU-SY dataset. Yao et al. [21] proposed a prediction method that combines a one-dimensional convolutional neural network with simple recurrent units.
The extraction of health indicators (HI) is a crucial step in the RUL prediction process. HI mainly reflect the degradation of mechanical equipment. Therefore, the quality of HI will directly affect the RUL prediction accuracy. Qiu et al. [22] used self-organizing mapping (SOM) to fuse the extracted features to construct the HI of rolling bearings. Qin et al. [23] used the root mean square (RMS) error as the HI of a rolling bearing and predicted the future HI sequence using a gated dual attention unit neural network to determine the magnitude of the rolling bearing RUL. Chen et al. [24] proposed a simple HI construction scheme that uses the ratio of the current moment's RUL value to the initial RUL as the HI, reducing the need for expensive prior knowledge. Five bandpass energy values of the spectrum are used as features. The mapping relationship between the features and HI is constructed directly using a circular autoencoder (AE) based on the attention mechanism. Finally, the value of the RUL is calculated using linear regression. Guo et al. [25] used CNN to extract the bearings' HI and used outlier region correction techniques to detect and correct the outlier points in the obtained HI.
There are three main problems with the above methods. (1) In the HI construction process, it is often necessary to analyze the vibration signals in the time-frequency domain and extract the features in the time domain, frequency domain or time-frequency domain as the input to the model. However, different combinations of features may produce different results. In the literature [26,27], 14 specially designed features and 14 commonly used features were used for HI construction, respectively, and the trend, monotonicity, and scale similarity of the two HI differed. This approach requires a large amount of expert prior knowledge and does not allow automatic HI extraction. (2) The temporal information of the vibration signals is not fully utilized, and usually, only a single size time window is used for feature extraction of the vibration signals, which cannot utilize the local and the global information comprehensively. The bearing vibration signal is a vector containing time information, and the lack of adequate consideration of the time scale can affect the effectiveness of HI. (3) Usually, when using HI for RUL prediction, the failure threshold of HI needs to be determined. In the literature [23], the RMS maximum value of 1.903 at bearing failure was chosen as the failure threshold. In the literature [28], the failure threshold for HI was determined to be 0.17 by calculating the average of the last five local minimum HI points of the tested motor. These methods often carry a particular element of subjectivity and conjecture, leading to errors in RUL predictions.
To solve the problems mentioned above, a novel deep learning framework, called a multi-scale convolutional autoencoder (MSCAE), is proposed in this paper for the auto-matic extraction of rolling bearing HI. MSCAE is obtained by fusing multiple convolutional autoencoders with different convolutional kernel sizes. Convolutional autoencoders with small convolutional kernels can perform feature extraction for locally degraded features. In contrast, convolutional autoencoders with large convolutional kernels can perform feature extraction considering the degenerate trend of the whole sequence. Combining both of the advantages can solve the problem of underutilization of temporal information. Meanwhile, the construction of labels using quadratic functions and the training of MSCAE using composite loss functions can realize end-to-end HI extraction and automate HI extraction. After obtaining the HI of MSCAE, the LSTM is used for bearing RUL prediction. Unlike other methods, HI obtained by the proposed method does not require the additional step of failure threshold determination, and the failure threshold is directly set to 0. The main contributions of this paper are listed below.
Compared with the traditional HI construction method, it can effectively use local and global temporal information to obtain a more robust HI and, at the same time, realize the automatic HI extraction. (2) The HI obtained by MSCAE is compared with single-scale convolutional autoencoders with different convolutional kernel sizes, and the superiority of the proposed method is verified by a comprehensive evaluation index consisting of monotonicity, correlation and robustness. (3) The HI obtained by MSCAE does not require artificial determination of the failure threshold, which is directly set to 0 and can be directly used for RUL prediction.
The remaining parts of this paper are organized as follows. Section 2 introduces the relevant theoretical knowledge covered in this paper. Section 3 presents the proposed HI construction method. Section 4 conducts relevant experiments and comparisons to verify the validity of the proposed method. Section 5 concludes this paper.

AE
AE was first proposed in the literature [29] as an unsupervised learning method and is often used for feature dimensionality reduction. AE consists of two parts: the encoder and the decoder. The encoder reduces the dimensionality of the input signals and extracts the high-level representations. The decoder takes the output results of the encoder as the input and reconstructs the input signals. AE uses a back-propagation algorithm to update the internal parameters to minimize the reconstruction error. The structure of a typical AE is shown in Figure 1.
Assume that the input of the encoder is x = [x 1 , x 2 , . . . , x l ]. l is the length of the input data. Then, the output s of the encoder can be expressed as: where W e and b e denote the weight and bias, respectively, and f e denotes the encoder activation function. Then, the the decoder outputx = [x 1 ,x 2 , . . . ,x l ] can be expressed as: where W d and b d denote the weight and bias, respectively, and f d denotes the decoder activation function. The autoencoder optimizes the internal parameters by minimizing the reconfiguration error. The reconfiguration error L AE can be expressed as:

LSTM Neural Network
Data-driven methods for predicting the bearing RUL are usually based on monitoring data collected by sensors, such as temperature and vibration. There is a strong time dependence on these data. The recurrent neural network (RNN) model is characterized by taking sequence data as input and performing operations in the evolutionary direction of the sequence. Therefore, RNN can recognize time series and suit to solve RUL prediction problems. However, traditional RNN suffers from long-term dependency problems, and when training on long sequence data, gradient disappearance or gradient explosion may occur, resulting in ineffective training of the model. To solve this problem, Hochreiter and Schmidhuber proposed the LSTM model [30]. The structure of LSTM is shown in Figure 2. There are two information pathways in LSTM: one for storing long-term memory and one for short-term processing information and adding valid information to the other information pathway. In addition, three gating units, the input gate, forget gate and output gate, are added to the LSTM to control the information flow of input data in the LSTM unit. The formula of LSTM can be expressed as: where x t represents the input at moment t. h t and c t represent the hidden state and the cell state at the moment t, respectively. f t is the output of the forget gate. i t is the output of the input gate. o t is the output of the output gate.

MSCAE
Based on the AE introduced in Section 2.1, the structure of the convolutional autoencoder is obtained by replacing the matrix operations in it with convolutional operations. Usually, features are extracted using convolutional kernels of the same size in the convolutional autoencoder. The literature [18] verified the effectiveness of the multi-scale convolutional network. Compared to traditional CNN, the use of parallel convolutional paths with different convolutional kernel sizes enables multi-scale learning, allowing the model to extract features from different time scales, ensuring the integrity of the representations and making full use of both global and local information. In this paper, we adopt the idea of multi-scale convolution to construct MSCAE to extract a more robust HI based on the full utilization of global and local degradation information, and the automatic extraction of HI can be realized. The structure of the proposed MSCAE is shown in Figure 3.
As shown in Figure 3, the proposed MSCAE uses three convolutional paths for feature extraction of the input data. It is worth noting that the three convolutional paths use one-dimensional convolution with different convolutional kernel sizes, aiming to make full use of the global and local information of the input data for more effective feature extraction.
In the encoding stage, three pathways perform parallel convolution and pooling operations on the input data. Suppose E i n,m denotes the n-th channel of convolutional layer data in the m-th encode block in the i-th(i = 1, 2, 3) pathway and N e m is the number of convolutional layer channels in the m-th encode block. Then, the one-dimensional convolution operation can be expressed as: where * is the one-dimensional convolution operation. w i,e k,n,m denotes the weight of the k-th convolutional kernel of the convolutional layer in the m-th encode block. b i,e m is bias. f r adopts the ReLU activation function. Z i,e k,m+1 denotes the k-th channel data of the convolutional layer operation result in the m-th encode block of the i-th pathway.
After performing a convolution operation, a downsampling operation is performed on the convolution result using maximum pooling to reduce the size of the data. The pool-ing result I i k,m+1 for the k-th channel data of the pooling layer in the m-th encoding block of the i-th pathway can be expressed as: where P denotes the maximum pooling operation, p m is the size of the pooling window in the m-th convolution block, and s m is the stride. After the input data are extracted by L encoding blocks of features, the Flatten layer is used to turn the high-level representations into one-dimensional data, which is input to the fully connected block for HI extraction. It is worth noting that the fully connected block of MSCAE proposed in this paper uses only one fully connected layer in both encoding and decoding parts, and the number of hidden neurons is the same. Then, the extracted HI can be expressed as: whereŷ denotes the extracted HI, W f e and b f e denote the encoded partial weight and bias, respectively, and σ denotes the Sigmoid activation function.
In the decoding stage, there are also three parallel paths. The decoding part in the fully connected block performs a dimensional expansion operation on the extracted HI, and the results obtained can be expressed as: where X d denotes the output of the decoding part in the fully connected block, W f d and b f d denote the weight and bias of the decoding part in the fully connected block, respectively, and f r is the ReLU activation function. After obtaining X d , it is dimensionally changed using the Reshape layer, and the transformed shape is the same as the shape of the input data in the Flatten layer. L decoding blocks reconstruct the results of the Reshape layer, i.e., three parallel upsampling and convolution pathways. Suppose D i n,m denotes the n-th channel data of the upsampling layer in the m-th encoding block of the i-th(i = 1, 2, 3) pathway. Then, the operation of the upsampling layer can be expressed as: where Z i,d n,m is the result of the upsampling layer, U denotes the upsampling operation, and u m and l m represent the upsampling window size and stirde of the m-th decoding block, respectively. In a decoding block, the upsampling layer is connected after the convolution layer, and the convolution result of the m-th decoding block can be expressed as: where D i k,m+1 is the k-th channel data of the output of the m-th decoding block, N d m denotes the total number of channels of the input data of the m-th decoding block, w i,d k,n,m denotes the weight of the k-th convolutional kernel in the m-th decoding block, b i,d m is the bias, and f r is the ReLU activation function. The reconstruction of the input data is obtained after processing by L decoding blocks. Assume that the i-th input data of MSCAE is x i = [x i,1 , x i,2 , . . . , x i,l ], and the corresponding real RUL label is y i , i = 1, 2, . . . B. l is the length of the input data, and B is the total life of the bearing. Then, the HI extracted by MSCAE isŷ i , and the reconstruction of the input data isx In this paper, we use the composite loss function to evaluate the HI extraction capability of MSCAE. The composite loss consists of two parts: one is the error between the input data and the reconstructed data, and the other is the error between the HI and the true RUL. The composite loss function is calculated as follows.
where θ denotes the internal parameters of MSCAE and ν is a scaling factor to adjust the weight between the two errors. Compared with the traditional AE, the composite loss function constructed by MSCAE combines the supervised learning approach with the unsupervised learning approach for model training, which fully utilizes the degradation information of the bearings and enhances the extraction capability of the model HI. In the training phase of the model, the vibration data and labels of the training bearings are used to optimize the internal parameter θ using the back-propagation algorithm to minimize the composite loss. In the testing phase of the model, the vibration data of the testing bearing serve as the input, and the HI is obtained using the encoder part of the trained MSCAE.

Construction Method of Degradation Labels
After obtaining the vibration data of the bearings, it is necessary to divide them into training and testing sets. Since the obtained original data do not have corresponding labels, it is necessary to construct the degradation labels corresponding to the vibration data. There are mainly two conventional methods for constructing degradation labels: linear degradation and segmental smoothing, and Figure 4 shows the differences between the two methods.
The degenerate label construction equation of Figure 4a can be expressed as where B denotes the total life of the bearing, t i denotes the current time the bearing has been in operation, and y i denotes the current degradation level of the bearing. The degenerate label construction equation of Figure 4b can be expressed as: where t h indicates the degradation starting threshold; when t i is less than or equal to t h , it is determined that the bearing has not started to degrade and the degradation label is always 1. When t h exceeds the threshold, it starts to show a linear degradation trend.  The first two methods of describing bearing degradation trends shown in Figure 4 have their own limitations compared to the true degradation of the bearing. The true bearing degradation does not exhibit a linear characteristic but rather degrades faster and faster as the operating time of the bearing increases. Therefore, the linear degradation method does not satisfy this operating characteristic. For the segmented smoothing method, the threshold t h at the degradation onset moment is usually determined artificially. However, t h is usually not the same for different operating conditions, and there are different bearings t h for the same operating conditions, which limits the application of the segmented smoothing method. The literature [31] proposed a method for constructing a degenerate trend based on a quadratic function, as shown in Figure 4c, and the specific expression can be expressed as: The quadratic function-based method overcomes the shortcomings of the above two methods, and its constructed labels are more satisfying to the real degradation of the bearings. With the increase of the operation time, the degradation of the bearings is gradually accelerated, which is shown in Figure 4c as the slope of the curve gradually increases. Moreover, it is a convenient and effective method without artificially specifying the degradation threshold, so this paper adopts the quadratic function method to construct the degradation labels of bearings.

Overall Framework Flow
The overall framework flow of MSCAE-based HI extraction and RUL prediction proposed in this paper is shown in Figure 5. The framework is divided into two main phases, offline training and online testing. There are three main steps in the offline training phase, and the process is as follows.  The degenerate labels for each moment of the vibration data are calculated using Equation (19), and the labels y = [y 1 , y 2 , . . . y B ] are obtained by the calculation.

3.
Training the proposed MSCAE model. First, the vibration datum X of the training bearing is input to the model, and the output of the MSCAE encoding part is the extracted HI, which is denoted asŷ = [ŷ 1 ,ŷ 2 , . . .ŷ B ]. After that, HI is used as an input to the decoding part of MSCAE to obtain the reconstructed dataX = [x 1 ,x 2 , . . . ,x B ] of the vibration data X. Finally, the composite loss function is calculated using Equation (16), and the internal parameters θ of the model are updated using the backpropagation algorithm. After the MSCAE is trained offline, it moves to the online testing phase.

4.
The vibration data from test bearing operation to failure are obtained, which are defined as X = [x 1 , x 2 , . . . x T ], and T is the total operating time of the testing bearing.

5.
The vibration signal X of the testing bearing is input to the encoder of the MSCAE trained in step 3 to obtain in HI H = [h 1 , h 2 , . . . h T ] of the test bearing. 6.
After obtaining the HI of the testing bearing, the first N points of H are taken to construct the training data for training the LSTM model. The training matrix can be expressed as follows.
where M is the neurons number of the LSTM output layer, Then, the matrix V can be updated as follows: The above method allows the prediction of HI vectors to be performed continuously and the matrix V to be continuously updated. Thus,ν k can be expressed as: Meanwhile, the matrix V is updated as follows: whereν k = h k , . . . , h N ,h N+1 , . . . ,h k+N−M−1 . Whenh k+N−M−1 is less than the threshold 0, the prediction step is stopped, and the RUL of the bearing is obtained as (k − M) × T s , and T s is the sampling interval of the vibration sensor. If the prediction result is not less than the threshold, the iterative prediction is continued until the predicted value is less than the threshold and the corresponding RUL is obtained.

Data Introduction
The experimental data in this paper were obtained from the PHM Challenge [32] organized by the Institute of Electrical and Electronics Engineers in 2012, and the data were obtained from the PRONOSTIA experimental bench, as shown in Figure 6. The experimental bench consists of three main parts: the rotation part, the degradation generation part and the sensor part. The rotating part consists of an asynchronous motor with a gearbox and two shafts to provide the working environment for the test bearing. The degradation generating part can apply radial load to the testing bearing to simulate the actual working operation with load and accelerate the degradation, which can complete the whole process of bearing operation to failure in a short time. The sensor part consists of three sensors, two of which are vibration sensors positioned 90°apart, measuring the magnitude of hori-zontal and radial vibrations, respectively, and the other is a temperature sensor, measuring the temperature of the bearing during operation.
The PHM2012 challenge data give the full cycle life data of the bearings from operation to failure for three operating conditions. The three operating conditions are 1800 rpm and 4000 N; 1650 rpm and 4200 N; and 1500 rpm and 5000 N. There are 7, 7, 3 test bearings for each of the three operating conditions. The PRONOSTIA test bench has a sampling interval of 10 s, a sampling duration of 0.1 s, and a sampling frequency of 25.6 kHz, indicating that every 10 s, the sensor can collect 2560 data points.
In order to verify the effectiveness of the MSCAE proposed in this paper, the data set needs to be divided into training data and testing data. The division method in this paper is shown in Table 1. As can be seen from Table 1, all bearings under three operating conditions are used in this paper to validate the proposed method. The training set contains 5, 6, 2 bearings, and the testing set contains 2, 1, 1 bearings.

Evaluation Metrics
To verify the effectiveness of the proposed MSCAE to extract HI in this paper, three evaluation metrics, monotonicity, correlation and robustness, were used to quantify the performance of HI [31]. It is worth noting that the range of all three metrics is within [0, 1].
• Monotonicity: It aims to assess the tendency of HI to increase monotonically or decrease monotonically as the running time increases. The stronger the monotonicity of HI, the closer it is to 1. The specific formula for monotonicity can be expressed as: where dH denotes the first-order derivative between two HI values and T denotes the number of HI and also the number of sensor samples. • Correlation: It aims to measure the correlation between HI and runtime. The more correlated the two are, the closer the value of correlation is to 1, and vice versa. The formula for correlation can be expressed as: Robustness: It aims to measure the ability of HI to resist outlier interference; the stronger its ability, the closer the robustness is to 1, and vice versa. The extracted HI can be seen as a superposition of the average trend and noise, whereby H can be expressed as: where H T (t i ) denotes the average trend of HI at the moment T, and H R (t i ) denotes the noise disturbance of HI at the moment T. Then, the robustness is calculated by the formula: In order to comprehensively evaluate the advantages and disadvantages of the extracting HI, a Composite Indicator (CI) containing the above three indicators is proposed, which is defined as:

The Validity of MSCAE
The specific structure of the MSCAE proposed in this paper is shown in Figure 7. As can be seen from Figure 7, MSCAE consists of three encoding blocks, three decoding blocks and one fully connected block. The number of convolutional kernels in the three coding blocks is 8, 16, 4, and the size of convolutional kernels in the three pathways is 3 × 1, 7 × 1 and 11 × 1, respectively, and the size and stride of maximum pooling are 8. The number of convolutional kernels in the three decoding blocks is 16, 8, 1, and the size of convolutional kernels in the three pathways is the same as the decoder, and the size and stride of upsampling are 8. The Sigmoid activation function is used in the fully connected block when the second layer obtains HI. The activation function is not used when reconstructed data are obtained. All the remaining layers use the ReLU activation function, and each layer uses the BatchNormalization (BN) layer to improve the model generalization. The detailed hyperparameter settings of the model are shown in Table 2. Comparative experiments are conducted in this paper to choose the scale factor v of the composite loss function. In this paper, five values of 0.2, 0.4, 0.6, 0.8 and 1.0 are selected for comparison and validation. Ten experiments are conducted on four test bearings using these five values, respectively, and the final box line diagram is obtained, as shown in Figure 8. From Figure 8, it can be seen that v is taken as 0.6, 0.6 and 0.4, respectively, as the best choice for the three working conditions.    After training the MSCAE using the training bearing data, the MSCAE was tested using testing bearings, and the HI extracted by four testing bearings is shown in Figure 9. In Figure 9, the green color indicates that the value of HI is closer to 1, and the blue color is closer to 0. The red curve is the HI trend curve fitted using polynomials. Four bearings in Figure 9 have a clear downward trend in HI, and by the time sampling stops, the value of HI is close to 0. In order to verify the superiority of MSCAE, this paper sets up a comparison between MSCAE and three convolutional autoencoders (CAE) with constant convolutional kernel sizes of 3 × 1, 7 × 1 and 11 × 1, respectively. To ensure the completeness of the experiments, the structure of the CAE is the same as the structure of a pathway in MSCAE. Moreover, when the model is trained, the hyperparameters are set the same as MSCAE. The obtained results are shown in Figure 10. From Figure 10, it can be seen that MSCAE exceeds the other three models in CI for the tested bearings under each operating condition. It shows that MSCAE can combine the advantages of different convolutional kernel sizes to identify different time scale information and make full use of the local and global information of the original vibration signal to extract a more effective HI. To further verify the superiority of MSCAE, four models were used to extract the HI of Bearing2_6 bearing for analysis, as shown in Figure 11. When the convolution kernel size is set to 3, the final HI trend increases, which is against the degradation law of the real bearing. Moreover, when the convolution kernel size is 7 and 11, the final HI trend decreases to 0. However, both undergo abrupt changes when the bearing is damaged, i.e., the HI value directly decreases to 0. The abrupt change is most obvious when the convolution kernel size is 7, and this phenomenon affects the prediction of RUL. Therefore, it is desired to obtain an HI that can satisfy the real degradation trend of the bearing, and the degradation process of the HI is relatively flat, which is conducive to the continuous prediction of RUL. It can be found that the HI extracted by MSCAE proposed in this paper can meet the above requirements.

RUL Prediction
After extracting the HI of the testing bearing using MSCAE, it can be found that the extracted HI has obvious degradation characteristics with time, which indicates that the HI is a time series. Therefore, LSTM is used to perform the RUL prediction of bearings. For Bearing1_1, assuming that the last 100 HI are unknown and the first 2703 HI are known, the known 2703 HI points are used as training data for the LSTM to predict the unknown HI points, and when the predicted value of the LSTM is less than the threshold 0, it means that the predicted current moment bearing has been damaged, and the time interval between the current moment and the starting moment of the prediction is the predicted RUL value. Similarly, for Bearing1_3, assuming that the last 100 HI points are unknown, the training data for the LSTM are the first 2275 HI points. The parameters of the LSTM are configured as shown in Table 3. The number of neurons in the input, hidden, and output layers of the LSTM are 360.29 and 1, respectively. The internal parameters of the LSTM are optimized using the Adam optimizer, and the learning rate is set to 0.07. Table 3. Hyperparameter configuration of LSTM.

Parameters Value
The number of neurons in the input layer 360 The number of neurons in the hidden layer 29 The number of neurons in the output layer 1 Learning rate lr 0.07 Optimizer Adam To demonstrate the superiority of the HI extraction method proposed in this paper on RUL prediction, two state-of-the-art deep learning-based HI construction methods are used for comparison. One is the recurrent convolutional neural network (RCNN), which was proposed in the literature [33], and the other is the CNN proposed in the literature [25]. The RCNN is an end-to-end HI extraction framework consisting of a convolutional neural network with residual structure and an LSTM serially connected with the specific structural parameters described in the literature [33]. The CNN-based HI construction method proposed in the literature [25] is divided into two steps, first using two convolutional layers for feature extraction and later using a fully connected layer for HI construction. It is worth noting that both methods, as in this paper, use the original vibration signal as the input to the model without a manual feature extraction step. All three methods are trained using the same training data, and the RUL prediction ability of the extracted HI is verified on Bearing1_1 and Bearing1_3. For convenience, the HI extracted by the proposed MSCAE, RCNN and CNN are MSCAE-HI, CRNN-HI and CNN-HI.
The RUL prediction results of HI extracted by the three models are shown in Figures 12-14. From Figure 13, it can be found that RUL prediction using CRNN-HI requires artificially set thresholds, and the choice of thresholds directly affects the RUL prediction results. If the set threshold is large, it will cause the prediction to stop early, and conversely, it will cause the prediction value of RUL to be larger than the real one, or even the phenomenon that it cannot converge to the set threshold. In this paper, the failure threshold of CRNN-HI is set to 0.2. In Figure 13a, the HI of Bearing1_1 extracted using the CRNN method shows an increasing trend at the end of degradation, which deviates from the actual degradation threshold and can affect the prediction ability of the LSTM. In Figure 13b, there are a few outliers in the HI of Bearing1_3, and there are HI points less than the threshold value when the bearing first starts running. It can be found in Figure 14 that there is a significant divergence of CNN-HI at the late degradation stage of the bearing, resulting in the predicted HI value being more likely to be close to the failure threshold when using CNN-HI for RUL prediction, resulting in the predicted RUL often being smaller than the true RUL value. In Figure 12, the MSCAE-HI constructed in this paper has less fluctuation and a smoother degradation trend, which can effectively reflect the real degradation trend of the bearing. Using MSCAE-HI for RUL prediction, the predicted RUL results are closer to the true values, and the predicted HI is more consistent with the true degradation trend. Combining the above analysis, the MSCAE-HI method proposed in this paper is more suitable for the RUL prediction of bearings.   To further validate the superiority of MSCAE-HI, five metrics for evaluating the prediction accuracy of RUL were used to compare the prediction performance of the three methods. These five evaluation metrics are score scoring function, mean absolute error (MAE), normalized root mean square error (NRMSE), root mean square error (RMSE) and mean absolute percentage error (MAPE), which are defined in the literature [23]. Three methods were used to conduct five prediction experiments for Bearing1_1 and Bearing1_3, and the final evaluation results are shown in Table 4. In Table 4, the MAE, NRMSE, RMSE and MAPE of CNN-HI are the maximum among three methods. This phenomenon can be explained that the fluctuation of CNN-HI is very obvious at the late degradation stage, and the HI predicted by the LSTM is easily smaller than the degradation threshold, resulting in a large deviation of the predicted from the true RUL. RUL prediction using MSCAE-HI obtained the maximum score values and the minimum MAE, MRMSE, RMSE, and MAPE values on both Bearing1_1 and Bearing1_3, which once again confirmed the superiority of the HI indicator extraction method of MSCAE proposed in this paper.

Conclusions
In this paper, a novel framework for HI extraction, called MSCAE, is proposed. It can overcome the disadvantages of traditional methods that require manual extraction of timefrequency domain indicators as features and the need to set failure thresholds by experience in RUL prediction. It relies solely on the raw sensor vibration signal to extract HI and does not require additional determination of the failure threshold. MSCAE can use convolutional kernels of different sizes to effectively exploit the global and local information of vibration signals, enhancing the HI extraction capability. A quadratic function-based label is first constructed for the original vibration data, after which the model is trained using the training data, and the internal parameters are optimized using a compound loss function. Then, HI is extracted using the test-bearing data to verify the validity of MSCAE. Finally, the RUL prediction is performed using LSTM. The HI extraction capability of MSCAE is verified to be superior to that of CAE models using a single scale with the PHM2012 dataset. Furthermore, it is compared with two state-of-the-art HI construction methods, CRNN and CNN, to judge the prediction performance of RUL using five evaluation metrics. The comparison results confirm the superiority of the proposed MSCAE-extracted HI for RUL prediction. In this paper, HI extraction and RUL prediction for rolling bearings achieved excellent results; however, the generalization capability for mechanical components such as gears and engines needs further validation. The future direction is to apply MSCAE to HI extraction and RUL prediction of other mechanical components.