Remaining Useful Life Prediction Based on Adaptive SHRINKAGE Processing and Temporal Convolutional Network

The remaining useful life (RUL) prediction is important for improving the safety, supportability, maintainability, and reliability of modern industrial equipment. The traditional data-driven rolling bearing RUL prediction methods require a substantial amount of prior knowledge to extract degraded features. A large number of recurrent neural networks (RNNs) have been applied to RUL, but their shortcomings of long-term dependence and inability to remember long-term historical information can result in low RUL prediction accuracy. To address this limitation, this paper proposes an RUL prediction method based on adaptive shrinkage processing and a temporal convolutional network (TCN). In the proposed method, instead of performing the feature extraction to preprocess the original data, the multi-channel data are directly used as an input of a prediction network. In addition, an adaptive shrinkage processing sub-network is designed to allocate the parameters of the soft-thresholding function adaptively to reduce noise-related information amount while retaining useful features. Therefore, compared with the existing RUL prediction methods, the proposed method can more accurately describe RUL based on the original historical data. Through experiments on a PHM2012 rolling bearing data set, a XJTU-SY data set and comparison with different methods, the predicted mean absolute error (MAE) is reduced by 52% at most, and the root mean square error (RMSE) is reduced by 64% at most. The experimental results show that the proposed adaptive shrinkage processing method, combined with the TCN model, can predict the RUL accurately and has a high application value.


Introduction
With the rapid development of computing methods and information technology, modern production systems have become more complex [1]. In recent years, the prognosis and health management (PHM) has been a common and effective way to improve the work availability, safety, supportability, maintainability and reliability of modern industrial equipment and reduce life cycle costs, and thus has received widespread attention from both academia and the industry. Among the processes involved in the PHM framework, the remaining useful life (RUL) prediction represents a key task for PHM [2] and forms the basis for the decision-making of management activities. The purpose of the RUL prediction includes predicting when a system or component will operate normally, warning of impending failures, and helping to prevent industrial mishaps to a considerable extent. Therefore, an efficient RUL prediction method is urgently needed in the industrial field. Thus, the construction of an accurate RUL prediction model is essential to realizing the above-mentioned tasks.
In the past decade, RUL prediction technology has made great progress, mainly including model-based methods, data-driven methods and hybrid methods [3,4]. Among them, model-based methods usually need to establish failure degradation models for research objects, and generally do not have generalization. Due to the complexity of working conditions, the complexity of mechanical equipment and the different degradation mechanisms, the process of obtaining failure models is complex, and the prediction effect is difficult to guarantee [5]. The data-driven method is useful to explore the relationship with the remaining life from the data collected by sensors through machine learning and statistical methods [6]. Traditional data-driven methods (such as support vector machine [7], neural network [8], etc.) have achieved some results in residual life prediction. However, with the complexity and integration of mechanical equipment, the collected sensor data are becoming larger and larger and it is difficult to obtain the characteristic relationships contained therein, so there are certain errors in the accuracy of residual life prediction results.
Deep learning has a strong nonlinear mapping ability and feature extraction ability, and it is increasingly used in the field of RUL prediction and health monitoring [9]. In RUL prediction, recurrent neural network and its improved variants have been widely used. For example, Senanayake et al. [10] used an autoencoders and RNN to predict bearing RUL. Luo et al. [11] used a BiLSTM model to predict the degradation trend of roller bearing performance, and verified the effectiveness and robustness of the proposed method through experiments. Zhang et al. [12] proposed a novel bidirectional gated recurrent unit with a temporal self-attention mechanism (BiGRU-TSAM) to predict RUL. Zhang et al. [13] proposed a dual-task network based on a bidirectional gated recurrent unit (Bi-GRU) and a multi-gate mixture of experts (MMoE), which can simultaneously evaluate the health status and predict the RUL of mechanical equipment. These methods solve the difficult problem of unpredictable RUL under specific conditions. However, RNN and its variants can capture potential time patterns based on cyclic recursive structure, but it is difficult to design and train due to its complex internal structure. In addition, the problem of gradient explosion and gradient disappearance often leads to the low accuracy of RNN training [14]. The emergence of convolutional neural networks (CNN) makes the prediction method of time series data no longer limited to RNNs [15]. CNN has the advantage of parallel computing, and when the receptive field increases, the network model can obtain more historical information; therefore, it has also been widely used and has achieved very good results. For example, in Ge et al. [16], a short-term traffic speed prediction method based on graph attention convolution network was proposed, and good prediction results were obtained. Li et al. [17] proposed a CNN-based RUL prediction method trained with a cycle-consistent learning scheme to align the data of different entities in similar degradation levels. Lin et al. [18] proposed a trend attention fully convolution network (TaFCN) to further improve the prediction performance. However, when CNN processes long time series, it often needs a deeper structure to obtain enough receptive fields, which will reduce the training efficiency.
The temporal convolutional networks (TCNs) have been the latest improvement in the CNN structure, which extracts historical data using the dilated causal convolution (DCC). The dilated causal convolution usually includes fewer layers than the classical CNN but can capture the same receptive field. Therefore, TCNs have a better time series prediction ability than CNNs [19]. In addition, TCN has no cyclic connection, which makes it more efficient in training than RNN in computation. Recent studies have pointed out the potential of TCN in prediction; for instance, Sun et al. [20] used a TCN to predict the RUL of rotating machinery and Gan et al. [21] used a TCN to predict the wind speed range of wind turbines successfully. However, the vibration signal we collected from the sensor contains noise. In RUL prediction, TCN is often affected by noise when extracting degradation features, which makes it impossible to accurately capture degradation features from historical data, leading to low prediction accuracy. Moreover, due to the impact of changes in working environment and load, the noise intensity will vary with time. The question of how to adaptively solve the redundant information such as noise is particularly important for RUL prediction.
To solve these problems, this paper proposes a RUL prediction method based on adaptive shrinkage processing and temporal convolution network (AS-TCN). Firstly, the vibration signals monitored by multiple channels are directly used as the input of the prediction network, without prior knowledge to extract features. Secondly, in AS-TCN, TCN residual connection and dilated causal convolution are used to extract long-term historical information, and an adaptive shrinkage processing sub-network is introduced to adaptively eliminate different noises. Finally, the PHM2012 bearing data set and XJTU-SY bearing data set are compared with the three most advanced methods, respectively. The average MAE is reduced by 52% at most, and the average RMSE is reduced by 64%, which verifies the effectiveness of the proposed method.
The main contributions of this paper can be summarized as follows: (1) A new framework of RUL prediction based on AS-TCN is proposed. It can directly use multi-channel monitored data as network input, without prior knowledge to extract features and effectively capture the key degradation information of bearings, thus realizing the end-to-end prediction process.
(2) Using a TCN network to build the main network can help it remember a large amount of complete historical information, avoid the shortcomings of long-term dependence and improve the accuracy of RUL prediction.
(3) Add an adaptive shrinkage processing subnet to the TCN block. The sub-network can adaptively adjust the threshold of the soft threshold function to minimize the redundant information related to noise while retaining the features that can better reflect the degradation information.
The rest of this article is organized as follows. Section 2 briefly introduces the theoretical background of the TCN and adaptive contraction mechanism. Section 3 describes the AS-TCN internal structure and implementation process. Section 4 verifies the RUL predictive performance of the proposed rolling bearing method. Finally, Section 5 concludes the paper and presents future work directions.

Dilated Causal Convolution (DCC)
Causal convolution was first used in the WaveNets model [22] to learn the input audio data before time τ to predict the output at time (τ + 1). The output at time τ can be obtained only from the input data before time τ; namely, the prediction model at time τ cannot rely on any future time step. Causal convolution adopts the method of unilateral filling and ensures that the input size is consistent with the output size by performing the zero filling on the input data and thus avoiding the leakage of information that never came to the past [23].
Since there is no recurrent connection in casual convolution, a parallel input of time series data can be adopted, so causal convolution has a faster training speed than RNNs, particularly for the large sample time series [24]. However, when dealing with long sequence data, causal convolution requires a deeper network structure or a large convolution kernel to enhance the receptive field of neurons in a neural network. For this reason, the TCNs have been combined with the DCC technology. The dilated causal convolution has been achieved by introducing the dilated convolution into the causal convolution so as to increase the receptive field, which can be expressed as follows: where κ is the size of the convolution kernel, d is the dilation factor, (·) represents the convolution calculation and subscript s − d·i indicates the past direction. The DCC architecture with a kernel size of κ = 3 is presented in Figure 1. In the first hidden layer, the dilation factor is one, indicating that one neuron is omitted between the selected set of neurons. In the second hidden layer, with a dilation factor of two, three neurons are omitted between selected neurons. Each layer of TCN is a residual block, and the dilation factor of the convolutional neurons in each residual block increases at a rate of (2 n − 1) from shallow to deep, ensuring a good performance of memorizing historical information.
Sensors 2022, 22, x FOR PEER REVIEW selected set of neurons. In the second hidden layer, with a dilation factor of tw neurons are omitted between selected neurons. Each layer of TCN is a residual blo the dilation factor of the convolutional neurons in each residual block increases of (2 − 1) from shallow to deep, ensuring a good performance of memorizing h information.

Hidden layer
Output layer d=4 d=2 d=1 X 2 X 1 X 1 X n-2 X n X n-1 y 0 y 1 y 2 y n-2 y n-1 y n Figure 1. The DCC architecture.

Residual Module
The TCN uses residual learning to simplify deep network training. He et al. [ posed the residual modules for the first time and achieved promising results in recognition [26] and image processing [27]. The core idea has been to introduc connection operation that skips one or more layers. Generally, residual learning the process of undirect usage of a superimposed nonlinear layer to achieve th mapping; ( ) the stacked nonlinear layers fit the residual map ( ), and the o required mapping of ( ) is modified to ( ) + . In the residual connection, tity skip connection that bypasses the residual layer is introduced. By establishing layer connection between two layers that are apart, multiplexing of the output ch istic map of a convolution layer is enhanced, which improves network perform the same time, the problem of gradient dissipation or gradient explosion caused b number of layers can be effectively avoided. The residual is mapped as follows: Batch normalization (BN) [28] has usually been required for deep network t The BN uses the mean and standard deviation of small batches to adjust the inter output of a network, which improves the stability of the intermediate output an mizes overfitting. The activation function and BN have usually been added after volution operation in the conventional CNN structure. Many studies on primitiv ual networks have analyzed how the combination of activation function and BN a ent locations affects the network performance [29]. The results have shown that t pre-activated structure is superior to other structures in reducing overfitting and ing the generalization ability of the network. Therefore, this study aims to achieve activation of the residual connection by adding the activation function and BN be dilated causal convolution. The improved residual structure is presented in Figur

Residual Module
The TCN uses residual learning to simplify deep network training. He et al. [25] proposed the residual modules for the first time and achieved promising results in speech recognition [26] and image processing [27]. The core idea has been to introduce a skip connection operation that skips one or more layers. Generally, residual learning refers to the process of undirect usage of a superimposed nonlinear layer to achieve the actual mapping; H(X) the stacked nonlinear layers fit the residual map F(X), and the original, required mapping of H(X) is modified to F(X) + X. In the residual connection, an identity skip connection that bypasses the residual layer is introduced. By establishing a cross-layer connection between two layers that are apart, multiplexing of the output characteristic map of a convolution layer is enhanced, which improves network performance. At the same time, the problem of gradient dissipation or gradient explosion caused by a large number of layers can be effectively avoided. The residual is mapped as follows: Batch normalization (BN) [28] has usually been required for deep network training. The BN uses the mean and standard deviation of small batches to adjust the intermediate output of a network, which improves the stability of the intermediate output and minimizes overfitting. The activation function and BN have usually been added after the convolution operation in the conventional CNN structure. Many studies on primitive residual networks have analyzed how the combination of activation function and BN at different locations affects the network performance [29]. The results have shown that the fully pre-activated structure is superior to other structures in reducing overfitting and improving the generalization ability of the network. Therefore, this study aims to achieve full pre-activation of the residual connection by adding the activation function and BN before the dilated causal convolution. The improved residual structure is presented in Figure 2.

Adaptive Shrinkage Processing
Shrinkage processing refers to soft thresholding, which is a function that shrinks input data in the direction of zero to retain negative or positive characteristics and sets the characteristics approaching zero to zeros. In this way, it has been proven that useful information can be well preserved and noise related features can be eliminated [30]. The soft-thresholding function is given by Equation (3), where and are the output and input features, respectively, as shown in Figure 3a; they are two soft threshold functions with different thresholds. The boundary in the soft-thresholding function is controlled by a threshold , whose value in the interval of [− , ] is set to zero. The soft-thresholding function can adjust the threshold value to shrink, as shown in Figure 4. Meanwhile, the soft-thresholding function's derivative is defined by Equation (4) and presented in Figure  3b, where it can be seen that the derivative of the soft-thresholding function is either zero or one. Therefore, using the soft-thresholding function is beneficial to prevent gradient and exploding.

Adaptive Shrinkage Processing
Shrinkage processing refers to soft thresholding, which is a function that shrinks input data in the direction of zero to retain negative or positive characteristics and sets the characteristics approaching zero to zeros. In this way, it has been proven that useful information can be well preserved and noise related features can be eliminated [30]. The soft-thresholding function is given by Equation (3), where y and x are the output and input features, respectively, as shown in Figure 3a; they are two soft threshold functions with different thresholds. The boundary in the soft-thresholding function is controlled by a threshold τ, whose value in the interval of [−τ, τ] is set to zero. The soft-thresholding function can adjust the threshold value τ to shrink, as shown in Figure 4. Meanwhile, the soft-thresholding function's derivative is defined by Equation (4) and presented in Figure 3b, where it can be seen that the derivative of the soft-thresholding function is either zero or one. Therefore, using the soft-thresholding function is beneficial to prevent gradient and exploding.

Adaptive Shrinkage Processing
Shrinkage processing refers to soft thresholding, which is a function that shrinks input data in the direction of zero to retain negative or positive characteristics and sets the characteristics approaching zero to zeros. In this way, it has been proven that useful information can be well preserved and noise related features can be eliminated [30]. The soft-thresholding function is given by Equation (3), where and are the output and input features, respectively, as shown in Figure 3a; they are two soft threshold functions with different thresholds. The boundary in the soft-thresholding function is controlled by a threshold , whose value in the interval of [− , ] is set to zero. The soft-thresholding function can adjust the threshold value to shrink, as shown in Figure 4. Meanwhile, the soft-thresholding function's derivative is defined by Equation (4) and presented in Figure  3b, where it can be seen that the derivative of the soft-thresholding function is either zero or one. Therefore, using the soft-thresholding function is beneficial to prevent gradient and exploding.   To detect the degradation information of equipment comprehensively, this study uses the operation-to-failure data collected by different sensors as a network training dataset. However, due to environmental impact, changes in operating conditions, performance degradation and other factors, the noise level also changes. Therefore, when the signal samples are converted into the feature map through the stack layer, it is necessary to customize the threshold of the feature map. To this end, an adaptive shrinkage training subnet is constructed in the TCN framework. This subnet adaptively adjusts threshold during training by optimizing operations with the purpose of minimizing deviations from basic facts and model output.
The working mechanism of the adaptive shrinkage processing is illustrated in Figure  5. Suppose that the input tensor is , having rows and columns; namely, characteristic graphs are used to calculate the absolute value of the input layer tensor one by one, and then the average value of each column is calculated in the global average pooling layer; the average value has one row and matrices, and it is represented by and processed by the full connection (FC) layer and the BN layer in turn. The activation function of the last FC layer is set to the "Sigmoid" function so that the shape invariant is in the range of [0,1]. Through the process of multiplying and by elements, each feature map has its own threshold vector [ 1 , 2 , … ], as the result of adaptive shrinkage. Finally, the input feature and soft threshold vector realize shrinkage processing via Equation (3). Figure 5. The block diagram of the adaptive shrink processing mechanism.

AS-TCN block
The AS-TCN block consists of DCC and an adaptive shrinking sub-network. After a series of operations, the input data, which represent a picture with redundant information and degradation features, are input to the TCN block, and different thresholds are obtained by the adaptive shrinkage subnetwork for different features. Next, the soft threshold function is used to eliminate redundant information to retain degenerate features. Moreover, the TCN block uses an identity path to reduce the difficulty in model training. The internal details of the DCC sub-network stacked with custom layers are presented in Figure 6. The LeakyRELU activation function is used at the cost of gradient sparsity, so the module is more robust in terms of optimization [31]. To detect the degradation information of equipment comprehensively, this study uses the operation-to-failure data collected by different sensors as a network training dataset. However, due to environmental impact, changes in operating conditions, performance degradation and other factors, the noise level also changes. Therefore, when the signal samples are converted into the feature map through the stack layer, it is necessary to customize the threshold of the feature map. To this end, an adaptive shrinkage training subnet is constructed in the TCN framework. This subnet adaptively adjusts threshold τ during training by optimizing operations with the purpose of minimizing deviations from basic facts and model output.
The working mechanism of the adaptive shrinkage processing is illustrated in Figure 5. Suppose that the input tensor is α, having M rows and N columns; namely, N characteristic graphs are used to calculate the absolute value of the input layer tensor one by one, and then the average value of each column is calculated in the global average pooling layer; the average value has one row and N matrices, and it is represented by β and processed by the full connection (FC) layer and the BN layer in turn. The activation function of the last FC layer is set to the "Sigmoid" function so that the shape invariant γ is in the range of [0, 1]. Through the process of multiplying β and γ by elements, each feature map has its own threshold vector [τ 1 , τ 2 , . . . τ n ], as the result of adaptive shrinkage. Finally, the input feature and soft threshold vector realize shrinkage processing via Equation (3).  To detect the degradation information of equipment comprehensively, this study uses the operation-to-failure data collected by different sensors as a network training dataset. However, due to environmental impact, changes in operating conditions, performance degradation and other factors, the noise level also changes. Therefore, when the signal samples are converted into the feature map through the stack layer, it is necessary to customize the threshold of the feature map. To this end, an adaptive shrinkage training subnet is constructed in the TCN framework. This subnet adaptively adjusts threshold during training by optimizing operations with the purpose of minimizing deviations from basic facts and model output.
The working mechanism of the adaptive shrinkage processing is illustrated in Figure  5. Suppose that the input tensor is , having rows and columns; namely, characteristic graphs are used to calculate the absolute value of the input layer tensor one by one, and then the average value of each column is calculated in the global average pooling layer; the average value has one row and matrices, and it is represented by and processed by the full connection (FC) layer and the BN layer in turn. The activation function of the last FC layer is set to the "Sigmoid" function so that the shape invariant is in the range of [0,1]. Through the process of multiplying and by elements, each feature map has its own threshold vector [ 1 , 2 , … ], as the result of adaptive shrinkage. Finally, the input feature and soft threshold vector realize shrinkage processing via Equation (3). Figure 5. The block diagram of the adaptive shrink processing mechanism.

AS-TCN block
The AS-TCN block consists of DCC and an adaptive shrinking sub-network. After a series of operations, the input data, which represent a picture with redundant information and degradation features, are input to the TCN block, and different thresholds are obtained by the adaptive shrinkage subnetwork for different features. Next, the soft threshold function is used to eliminate redundant information to retain degenerate features. Moreover, the TCN block uses an identity path to reduce the difficulty in model training. The internal details of the DCC sub-network stacked with custom layers are presented in Figure 6. The LeakyRELU activation function is used at the cost of gradient sparsity, so the module is more robust in terms of optimization [31].

AS-TCN Block
The AS-TCN block consists of DCC and an adaptive shrinking sub-network. After a series of operations, the input data, which represent a picture with redundant information and degradation features, are input to the TCN block, and different thresholds are obtained by the adaptive shrinkage subnetwork for different features. Next, the soft threshold function is used to eliminate redundant information to retain degenerate features. Moreover, the TCN block uses an identity path to reduce the difficulty in model training. The internal details of the DCC sub-network stacked with custom layers are presented in Figure 6. The LeakyRELU activation function is used at the cost of gradient sparsity, so the module is more robust in terms of optimization [31].  Figure 6. The AS-TCN block structure.

RUL Prediction Process
The RUL prediction process is illustrated in Figure 7, where it can be seen that d collection is performed first, and then sensor signals from different channels are collect Next, the signals are standardized, and the preprocessed data are divided into test a training sets. Then, the training set is used to train the network through a predefined nu ber of iterations. During the model training, the back-propagation method is used; Adam optimizer is employed to reduce the loss function value (loss), and the optim structural parameters are determined. After 200 iterations, the trained network mode obtained. The trained network model is used to perform RUL prediction on the test set. T simulation results are expressed according to the fit between the RUL predicted va curve of the test set and the true value curve, and the absolute error (MAE) and me square root error (RMSE) are used as indicators to evaluate the prediction effect.

AS-TCN Prediction Model
The structure of the RUL prediction network based on the AS-TCN designed in t paper is shown in Figure 8. The sensor data collected from the two channels, namely vibration signals in the x and y directions, are used as the network input. The two-dim sional input tensor passes through the one-dimensional roll-up layer, the maxpool lay the dropout layer, three stacked TCN modules, the Global-Average-pool block and a f

RUL Prediction Process
The RUL prediction process is illustrated in Figure 7, where it can be seen that data collection is performed first, and then sensor signals from different channels are collected. Next, the signals are standardized, and the preprocessed data are divided into test and training sets. Then, the training set is used to train the network through a predefined number of iterations. During the model training, the back-propagation method is used; the Adam optimizer is employed to reduce the loss function value (loss), and the optimal structural parameters are determined. After 200 iterations, the trained network model is obtained.  Figure 6. The AS-TCN block structure.

RUL Prediction Process
The RUL prediction process is illustrated in Figure 7, where it can be seen that data collection is performed first, and then sensor signals from different channels are collected. Next, the signals are standardized, and the preprocessed data are divided into test and training sets. Then, the training set is used to train the network through a predefined number of iterations. During the model training, the back-propagation method is used; the Adam optimizer is employed to reduce the loss function value (loss), and the optimal structural parameters are determined. After 200 iterations, the trained network model is obtained. The trained network model is used to perform RUL prediction on the test set. The simulation results are expressed according to the fit between the RUL predicted value curve of the test set and the true value curve, and the absolute error (MAE) and mean square root error (RMSE) are used as indicators to evaluate the prediction effect.

AS-TCN Prediction Model
The structure of the RUL prediction network based on the AS-TCN designed in this paper is shown in Figure 8. The sensor data collected from the two channels, namely the vibration signals in the x and y directions, are used as the network input. The two-dimensional input tensor passes through the one-dimensional roll-up layer, the maxpool layer, the dropout layer, three stacked TCN modules, the Global-Average-pool block and a full The trained network model is used to perform RUL prediction on the test set. The simulation results are expressed according to the fit between the RUL predicted value curve of the test set and the true value curve, and the absolute error (MAE) and mean square root error (RMSE) are used as indicators to evaluate the prediction effect.

AS-TCN Prediction Model
The structure of the RUL prediction network based on the AS-TCN designed in this paper is shown in Figure 8. The sensor data collected from the two channels, namely the vibration signals in the x and y directions, are used as the network input. The twodimensional input tensor passes through the one-dimensional roll-up layer, the maxpool layer, the dropout layer, three stacked TCN modules, the Global-Average-pool block and a  Figure 8. The AS-TCN network prediction model.
Network parameters include trainable parameters (such as the weights and offsets of convolution kernels) and untrained super parameters (such as the number and size of convolution kernels). The super parameter needs to be set in advance, and its influence on the network prediction effect and training time needs to be comprehensively considered. After repeated experiments and comparisons, the final super parameter setting is shown in Table 1. The Adam algorithm is selected as the optimizer during training, with a learning rate of 0.001, and a total of 200 times of training.

Dataset Introduction
The proposed prognostic method was validated using two accelerated rolling bearing degradation test datasets. The bearing operation-to-failure dataset [32] released by the IEEE PHM2012 Data Challenge was measured by the PRO-NOSTIA test rig, as shown in Figure 9. The data were collected by two acceleration sensors in the horizontal and vertical directions, separately; the sampling frequency was 25.6 kHz, and the data were recorded every 10 s for 0.1 s, so the vibration data for each sampling included 2560 points. For safety reasons, the experiment was stopped when the amplitude of the vibration data exceeded 20 g. The measured bearing failure time was defined as the time when the amplitude was greater than 20 g. The proposed prediction model was verified by using the operation-tofailure data under conditions one and two, as shown in Table 2. The life cycle data of bearings 1-1 and 2-1 are shown in Figure 10.
The server configuration used in the laboratory was as follows: the processor was an Intel ® Xeon E5 2696 v2 @ 2.5 GHz, the memory was 128 GB and the GPU was a Nvidia ® Network parameters include trainable parameters (such as the weights and offsets of convolution kernels) and untrained super parameters (such as the number and size of convolution kernels). The super parameter needs to be set in advance, and its influence on the network prediction effect and training time needs to be comprehensively considered.
After repeated experiments and comparisons, the final super parameter setting is shown in Table 1. The Adam algorithm is selected as the optimizer during training, with a learning rate of 0.001, and a total of 200 times of training. The proposed prognostic method was validated using two accelerated rolling bearing degradation test datasets. The bearing operation-to-failure dataset [32] released by the IEEE PHM2012 Data Challenge was measured by the PRO-NOSTIA test rig, as shown in Figure 9. The data were collected by two acceleration sensors in the horizontal and vertical directions, separately; the sampling frequency was 25.6 kHz, and the data were recorded every 10 s for 0.1 s, so the vibration data for each sampling included 2560 points. For safety reasons, the experiment was stopped when the amplitude of the vibration data exceeded 20 g. The measured bearing failure time was defined as the time when the amplitude was greater than 20 g. The proposed prediction model was verified by using the operation-to-failure data under conditions one and two, as shown in Table 2. The life cycle data of bearings 1-1 and 2-1 are shown in Figure 10.
GeForce 3070Ti (8 GB); the operating system was Microsoft ® Windows 10 (64-bit); and the programming language was PythonTM3.9 based on the Tensor Flow-GPU 2.6.0 deep learning framework.     GeForce 3070Ti (8 GB); the operating system was Microsoft ® Windows 10 (64-bit); an programming language was PythonTM3.9 based on the Tensor Flow-GPU 2.6.0 learning framework.    The server configuration used in the laboratory was as follows: the processor was an Intel ® Xeon E5 2696 v2 @ 2.5 GHz, the memory was 128 GB and the GPU was a Nvidia ® GeForce 3070Ti (8 GB); the operating system was Microsoft ® Windows 10 (64-bit); and the programming language was PythonTM3.9 based on the Tensor Flow-GPU 2.6.0 deep learning framework.

Data Preprocessing
Different operation settings may lead to different sensor values, and the obtained data represent different physical characteristics. Therefore, in order to eliminate the impact of data irregularities on the prediction effect, data normalization is carried out before the model is trained and tested, and the minimum and maximum values of the data set are converted to the range of [0,1] to improve the calculation speed of the model. The calculation formula is as follows: where,X i,j (t) is the value obtained from the normalization of X i,j (t) data, and X i,j (t) is the jth sensor data at the ith data point at time t. X j min and X j max are respectively the minimum and maximum values of the data collected by the jth sensor. After the normalization of the PHM2012 data set, the original data are segmented according to the sampling points, and the input data of the network model are shown in Table 3.

Evaluation Indicators
To extract the degradation features from the network model better, the horizontal and vertical vibration signals were used as the network input data, 0.1 s of were divided into a sample (i.e., input data were shaped X ∈ R 2560×2 ). The percentage of remaining service life during degradation was calculated by: where y i is the RUL of the total time step, and y i is the real RUL when the time step is t. The performance of the prediction results could be evaluated by a variety of indicators. To evaluate the prediction effect of the AS-TCN model, this paper used the mean absolute error (MAE) and the root mean square error (RMSE) as indicators, which were respectively calculated by: whereŷ i represents the prediction result at time i, y i represents the real value at time i and n is the number of samples in the test set. The smaller the absolute value of the prediction error was, the lower the MAE and RMSE values were.

Results Analysis
The ablation study was conducted to verify the effectiveness of the sub-network of the AS-TCN model. To examine the effects of the structural parameters of the proposed model on the overall model performance, in the analysis, the key parameters of the AS-TCN were modified, either replaced or deleted, while the other parameters were kept unchanged. To further illustrate the advantages of the proposed technology, other advanced RUL prediction technologies based on deep learning were added for comparison. Finally, five methods were designed as follows: (1) CNN. The conventional CNN model had no DCC and adaptive shrinkage processing mechanism.
(2) AS-CNN. In this model, compared to the AS-TCN, only the TCN was replaced with the traditional CNN, while the rest of the network parameters were kept the same.
(3) TCN. The AS-TCN is converted to TCN. The AS subnet is removed, and other structural parameters remain unchanged. This model is used to evaluate the effectiveness of the AS subnet.
(4) DSCN [33]. A new deep separable convolutional network (DSCN) of deep prediction network is proposed, which is good to get rid of manual feature selection and learn the RUL of degradation state prediction machine from the monitoring data.
(5) CNN BiGRU [34].The most advanced method proposed by Shang et al. [34]. We used cross validation to deeply evaluate the performance of AS-TCN and other methods in RUL prediction. We divided the 14 data sets listed in Table 2 under two different working conditions into two groups. The first group (i.e., the seven data sets under the working condition one) used six data sets for model training, and used one of the remaining data sets as a test data set to generate RUL prediction in the corresponding direction. For example, six data sets other than the bearing 1-1 data set are used as the training set for the first cross validation, and the bearing 1-1 data set is used as the test set to generate RUL prediction of bearing 1-1, and so on.
The experimental results are shown in Tables 4 and 5. In Table 4, we compared the MAE and RMSE of six different models. It can be seen that most of the proposed AS-TCN methods are superior to other methods. Compared with the first three models designed for ablation experiment, it is well verified that the AS subnet can improve the accuracy of RUL prediction and TCN has a good ability to capture long time series. Table 5 shows the mean values of the MAE and RMSE of several different methods. It can be seen that compared with DSCN [33], MAE and RMSE decreased by 52% and 64%, respectively, and RMSE decreased by 28% compared with CNN-BiGRU [34]. The smaller the MAE and MRSE values, the more accurate the prediction results are. The results show that AS-TCN is superior to other techniques in bearing RUL prediction.  In order to more intuitively represent the MAE and RMSE of different methods, the results in Table 4 are drawn as shown in Figure 11. The mean values of MAE and RMSE in Figure 11 are smaller than those of other methods. For MAE, in all 14 experiments, except for test bearing 2-1 and test bearing 2-7, it was significantly lower than other methods. Figure 12 shows the RUL prediction curves of bearings 1-3 and 2-2. It is assumed that the actual RUL defined by linear degradation is not exactly the same as the predicted RUL results, but the technology proposed in this study can capture the degradation trend of the bearings. Additionally, the predicted RUL result is within the 95% prediction interval of the real RUL prediction fitting line. In conclusion, the AS-TCN method can effectively predict RUL.    Figure 13. The tests were performed at two rotational speeds using LDK UER204 rolling bearings. The proposed

Dataset Introduction
The XJTU-SY dataset was provided by Xi'an Jiaotong University and Changxing Sumiao Technology Company [35]. The test bench is shown in Figure 13. The tests were performed at two rotational speeds using LDK UER204 rolling bearings. The proposed prognostic model was validated using the operation-to-failure data under different operating conditions, as shown in Table 6. The run-to-fail vibration acceleration data were acquired by the accelerometer on the bearing housing, and it included 32,768 data samples; in 1.28 s, they were collected every 1 min with a sampling rate of 25.6 kHz. In this experiment, horizontal and vertical vibration data were used for prognosis. For safety, when the vibration amplitude exceeded 20 g, the accelerated degradation bearing test was stopped, and the corresponding time was regarded as a failure time of the bearing. A photo of the failed bearing is displayed in Figure 14. The horizontal and vertical vibration signals under two different operating conditions are presented in Figure 15. Similar to the PHM2012 dataset, horizontal and vertical vibration signals were used, and the input data is shown in Table 7. Data corresponding to the period of 1.28 s were divided into a sample; input data were X ∈ R 32768×2 , and the percentage of the output labeled RUL prediction result is realized by Equation (6).                 The results in Tables 8 and 9 are more intuitively expressed, as shown in Figure 16. The average values of MAE and RMSE in Figure 12 are smaller than other methods. For MAE and RMSE, in all 10 experiments, except for testing bearing 2-1, other methods are significantly lower than other methods. Therefore, AS-TCN has more advantages in RUL prediction. Figure 17 shows the RUL prediction curves of tested bearings 1-1 and 2-5. The predicted RUL results are within 95% of the prediction interval of the real RUL prediction fitting line, which proves that AS-TCN can capture the degradation trend of bearings well. In addition, the AS-TCN method is verified on two different datasets, which shows that the model has a good generalization ability. The results in Table 8 and Table 9 are more intuitively expressed, as shown in Figure 16. The average values of MAE and RMSE in Figure 12 are smaller than other methods. For MAE and RMSE in all 10 experiments, except for testing bearing 2-1, other methods are significantly lower than other methods. Therefore, AS-TCN has more advantages in RUL prediction. Figure 17 shows the RUL prediction curves of tested bearings 1-1 and 2-5. The predicted RUL results are within 95% of the prediction interval of the real RUL prediction fitting line, which proves that AS-TCN can capture the degradation trend of bearings well. In addition, the AS-TCN method is verified on two different datasets, which shows that the model has a good generalization ability.

Conclusions
In the process involved in the prognosis and health management framework, the prediction of remaining useful life (RUL) is a key task of PHM and is also the basis for decision-making on management activities. Therefore, this paper proposes a RUL prediction model for AS-TCN equipment to predict the RUL of rolling bearings. In addition, ablation experiments and comparisons with state-of-the-art models are conducted, and two experimental cases are provided to verify the superiority of the proposed prediction method to the other methods.
Based on the results, the following conclusions can be obtained: (1) Adding a DCC hierarchy with a residual block to the TCN module can be used to seize longer-time collection records.
(2) The training subnet adopts an adaptive mechanism to learn the soft-thresholding function adaptively so as to reduce the information related to noise, retain degenerated features and extract the health status information of equipment.

Conclusions
In the process involved in the prognosis and health management framework, the prediction of remaining useful life (RUL) is a key task of PHM and is also the basis for decision-making on management activities. Therefore, this paper proposes a RUL prediction model for AS-TCN equipment to predict the RUL of rolling bearings. In addition, ablation experiments and comparisons with state-of-the-art models are conducted, and two experimental cases are provided to verify the superiority of the proposed prediction method to the other methods.
Based on the results, the following conclusions can be obtained: (1) Adding a DCC hierarchy with a residual block to the TCN module can be used to seize longer-time collection records.
(2) The training subnet adopts an adaptive mechanism to learn the soft-thresholding function adaptively so as to reduce the information related to noise, retain degenerated features and extract the health status information of equipment.
(3) The validity of the proposed RUL prediction method based on the AS-TCN is verified by applying the rolling bearing to the test bench; the proposed method is compared with several different methods. The experimental results show that the proposed AS-TCN has excellent RUL prediction ability and thus has an important reference value for the actual remaining life prediction.
In a future study, we will consider the degradation characteristics of the same bearing under different working conditions, realize the RUL prediction of bearings under different working conditions and adaptively divide the health state and degradation state mechanism to achieve more accurate RUL prediction.

Data Availability Statement:
The data used to support this study are available at https://biaowang. tech/xjtu-sy-bearing-datasets/ (accessed on 27 January 2021).