Remaining Useful Life Prediction Model for Rolling Bearings Based on MFPE–MACNN

Aiming to resolve the problem of redundant information concerning rolling bearing degradation characteristics and to tackle the difficulty faced by convolutional deep learning models in learning feature information in complex time series, a prediction model for remaining useful life based on multiscale fusion permutation entropy (MFPE) and a multiscale convolutional attention neural network (MACNN) is proposed. The original signal of the rolling bearing was extracted and decomposed by resonance sparse decomposition to obtain the high-resonance and low-resonance components. The multiscale permutation entropy of the low-resonance component was calculated. Moreover, the locally linear-embedding algorithm was used for dimensionality reduction to remove redundant information. The multiscale convolution module was constructed to learn the feature information at different time scales. The attention module was used to fuse the feature information and input it into the remaining useful life prediction module for evaluation. The appropriate network structure and parameter configuration were determined, and a multiscale convolutional attention neural network was designed to determine the remaining useful life prediction model. The results show that the method demonstrates effectiveness and superiority in degrading the feature information representation and improving the remaining useful life prediction accuracy compared with other models.


Introduction
With the rapid development of the Industrial Internet of Things, the explosive growth of monitoring data brings new opportunities and challenges for predictions of the remaining useful life of rolling bearings. The data-driven remaining useful life prediction method can learn the degradation characteristics of rolling bearings from the massive monitoring data and build a corresponding remaining useful life prediction model. Therefore, it has received increasing attention in research surrounding remaining useful life prediction [1].
Data-driven methods for remaining useful life prediction based on data typically involve three steps, including degradation feature construction, degradation trend learning, and remaining useful life estimation [2]. In the task of rolling bearing remaining useful life prediction, the trend of rolling bearing remaining useful life degradation over time needs to be better evaluated. Therefore, increasingly time-sensitive features need to be extracted. Degradation feature construction uses a priori knowledge of rolling bearing performance to extract sensitive degradation features from the monitoring data obtained. At the current stage, rolling bearing vibration signal feature-extraction methods mainly remove the signal features reflecting time, and remove the frequency domain waveform characteristics from signals from the time and frequency domains. The methods also which reflects the bearings' decline in performance, which affects the evaluation of the health status of rolling bearings. This method can improve the sensitivity of features to the decline trend of remaining useful life and predict the remaining life of rolling bearings in advance, thus improving the prediction accuracy of the model. It provides an effective technical means for the predictive maintenance of machines.
The main contents are as follows: Section 2 presents a multiscale fusion permutation entropy feature-extraction method; Section 3 presents a MACNN remaining useful life prediction model; Section 4 presents our experimental validation; Section 5 presents our conclusions.

Multiscale Fusion Permutation Entropy Feature Extraction
The MFPE-based bearing vibration signal feature-extraction method constructs a highdimensional, entropy-valued feature matrix by calculating the multiscale permutation entropy values of the low-resonance components of rolling bearings. It fully reflects the complexity and instability of the signals from multiple dimensions. The local linear embedding (LLE) algorithm further removes redundant information. The overall method makes up for the imperfect reflection of the characteristics extracted at a single scale on the local trend of rolling bearing life decline and can better improve the prediction accuracy of the remaining useful life of rolling bearings.

Resonance Sparse Decomposition Method
The resonance sparse decomposition method can analyze the resonance properties of a signal. The wavelet basis function library was constructed by an adjustable quality factor wavelet transform approach. The call was sparsely represented by the wavelet basis function library according to the morphological analysis method, and the quality factor, Q, was used as the evaluation method to separate the different components of the signal from each other. When the quality factor, Q, was more extensive, it indicated that the movement bandwidth was narrower, and the movement was in the form of high-resonance periodic vibration. When the quality factor, Q, was smaller, it indicated that the bandwidth of the movement was more expansive, and the movement was in the form of low-resonance transient shock.
The high-resonance element corresponded to the component of continuous oscillation in the movement, that is, the regular vibration movement generated when the bearing ran smoothly. The low-resonance component corresponded to the regular shock component in the movement, that is, the regular shock movement generated when the bearing had regular failures. The low-resonance component can adequately reflect the characteristic information in the movement caused by the fault.
The specific calculation steps of the resonance sparse decomposition algorithm are as follows.
(1) Assume that the input movement is X = X 1 +X 2 . Set the resonant sparse decomposition parameters quality factor, Q 1 and Q 2 , redundancy factor, r 1 and r 2 , and decomposition level, J 1 and J 2 , according to the movement characteristics, and construct the wavelet basis function library, S 1 and S 2 . (2) Select appropriate weighting coefficients, λ 1 and λ 2 , according to the signal-to-noise ratio index so that the different components in the signal can be separated effectively. Set the optimization target as shown in Equation (1).
where ω 1 and ω 2 are the matching coefficients of wavelet bases S 1 and S 2 . (3) The best-matching coefficients ω * 1 and ω * 2 are obtained by solving the optimization problem of Equation (1), and the high-resonance component X 1 and the low-resonance component X 2 are obtained by combining the best-matching coefficients ω * 1 and ω * 2 with the wavelet basis function library for calculation.

Multiscale Permutation Entropy
Multiscale permutation entropy avoids the limitation of the permutation entropy to evaluate the information of temporal characteristics of signals from a single scale by coarsening the input signal, τ = 2. The coarse granulation treatment at the time is shown in Figure 1.
(3) The best-matching coefficients ω 1 * and ω 2 * are obtained by solving the optimization problem of Equation (1), and the high-resonance component X 1 and the low-resonance component X 2 are obtained by combining the best-matching coefficients ω 1 * and ω 2 * with the wavelet basis function library for calculation.

Multiscale Permutation Entropy
Multiscale permutation entropy avoids the limitation of the permutation entropy to evaluate the information of temporal characteristics of signals from a single scale by coarsening the input signal, τ = 2. The coarse granulation treatment at the time is shown in Figure 1. The specific calculation steps for multiscale permutation entropy are as follows.
where τ is the scale factor; y j (τ) is the coarse-graining sequence.
(2) The coarse-grained sequence y (τ) phase space is reconstructed to obtain the multiscale sequence, as shown in Equation (3).
where Y i (τ) is the multiscale sequence; m is the embedding dimension; s is the time delay sparsity.
(3) Arrange the multiscale time series Y i (τ) in ascending order and record the index θ j = {j 1 ,j 2 ,⋯,j m } of each short time series after the ascending order. There are m! permutations of each short time series. Count the number of occurrences of each permutation N l and calculate the frequency of each permutation, as shown in Equation (4).
(4) The multiscale permutation entropy is obtained by calculating the permutation entropy of the multiscale time series, as shown in Equation (5) [29].  The specific calculation steps for multiscale permutation entropy are as follows.
where τ is the scale factor; y (τ) j is the coarse-graining sequence.
(2) The coarse-grained sequence y (τ) phase space is reconstructed to obtain the multiscale sequence, as shown in Equation (3).
(3) Arrange the multiscale time series Y (τ) i in ascending order and record the index θ j = {j 1 , j 2 , · · · , j m } of each short time series after the ascending order. There are m! permutations of each short time series. Count the number of occurrences of each permutation N l and calculate the frequency of each permutation, as shown in Equation (4).
(4) The multiscale permutation entropy is obtained by calculating the permutation entropy of the multiscale time series, as shown in Equation (5) [29].

Multiscale Fusion Permutation Entropy
The multiscale permutation entropy reconstructs the movement by coarse granulation and phase space reconstruction operations. It can obtain the feature information of the movement on different time scales. The problem of incomplete feature information on a single dimension was improved. It can improve the accuracy of the remaining useful life prediction. Due to the use of the sliding window slicing processing method to construct the short time series matrix, a partial overlap of the movement was caused. Although this Entropy 2022, 24, 905 6 of 33 operation can enrich the feature information in the signal, it can also cause the redundancy of features in the signals that are insensitive to the decline trend of the remaining useful life of the rolling bearing. In turn, this causes feature redundancy in the high-dimensional, multiscale permutation entropy feature matrix. Therefore, dimensionality reduction is needed to retain the primary feature information in the high-dimensional feature matrix.
The specific steps of the multiscale fusion permutation entropy feature-extraction method are shown below: (1) The input data are known to comprise a multiscale permutation entropy matrix MPEM =[E 1 , E 2 , · · · , E z ], which contains z-dimensional multiscale permutation entropy vectors, and the objective is to reduce the multiscale permutation entropy matrix to d dimensions. The k nearest neighbors of an entropy value e i (e i ∈ E i ) are found according to the Euclidean distance. The linear relationship between the entropy value e i and the k nearest neighbors are established after the k nearest neighbors are found. The loss function is shown in Equation (6).
where K(i) denotes the k nearest neighbor samples with an entropy value e i ; ω ij is the linear weight coefficient, which is generally normalized to satisfy the condition shown in Equation (7). For the entropy value of the k nearest neighbor samples that are not in the entropy value e i , the weight coefficient will be made to be 0, and the weight coefficient will be extended to the dimensionality of the whole dataset.
(2) Calculate the covariance matrix Z i in the space of k nearest neighbor samples, as shown in Equation (8), and find the corresponding vector of weight coefficients W i , as shown in Equation (9).
where 1 k is a vector with the k-dimensional value of 1. (3) The weight coefficient vector W is constructed as the weight coefficient matrix W i , from which the conditioned matrix M is calculated as shown in Equation (10).
where I is the constraint that ensures that the entropy value retains the original feature information as much as possible after dimensionality reduction. I = 1 s ∑ s i=1 y i y T i ; y i is the fusion entropy value obtained after dimensionality reduction. (4) Compute the first d + 1 eigenvalues of the conditional matrix M and compute the eigenvector y 1 , y 2 , · · · , y d+1 corresponding to these d + 1 eigenvalues. (5) The matrix consisting of the second eigenvector y 2 to the d + 1st eigenvector is the multiscale fusion permutation entropy matrix, MFPE = y 2 , y 3 , · · · , y d+1 , obtained by dimensionality reduction.

MACNN Remaining Useful Life Prediction Model
The MACNN remaining useful life prediction model consists of a multiscale convolutional learning module and a remaining useful life forecast module. In the MACNN model, the multiscale fusion permutation entropy feature matrix was used as the input. The detailed data were automatically learned and detected by constructing the multiscale Entropy 2022, 24, 905 7 of 33 convolutional learning module. The primary information for determining the remaining useful life was fused and highlighted by a self-attentive mechanism and input into the module for remaining useful life prediction.

Multiscale Convolution Module
A convolutional neural network is a feed-forward neural network, and the main structure of a convolutional neural network is shown in Figure 2.

MACNN Remaining Useful Life Prediction Model
The MACNN remaining useful life prediction model consists of a multiscale convolutional learning module and a remaining useful life forecast module. In the MACNN model, the multiscale fusion permutation entropy feature matrix was used as the input The detailed data were automatically learned and detected by constructing the multiscale convolutional learning module. The primary information for determining the remaining useful life was fused and highlighted by a self-attentive mechanism and input into the module for remaining useful life prediction.

Multiscale Convolution Module
A convolutional neural network is a feed-forward neural network, and the main structure of a convolutional neural network is shown in Figure 2. (1) Convolutional layer: Through feature extraction in the convolutional layer, a convolutional neural net work can capture the deep features of interconnections between the input data. In the conventional layer, multiple convolution kernels are passed that are updated with mode training. The output feature matrix of the convolution layer is obtained by performing do product operations between convolution kernels and corresponding elements of the fea ture matrix covered by convolution kernels. Each output feature matrix is calculated from multiple input feature matrices of the previous convolutional layer. The output value a of the j-th cell of the convolution layer l is shown in Equation (11), and the convolution calculation is shown in Figure 3 [30].
where b j l is the bias, k is the convolution kernel, and the parameters are updated when feedback updates are performed after each round of model training. (1) Convolutional layer: Through feature extraction in the convolutional layer, a convolutional neural network can capture the deep features of interconnections between the input data. In the conventional layer, multiple convolution kernels are passed that are updated with model training. The output feature matrix of the convolution layer is obtained by performing dot product operations between convolution kernels and corresponding elements of the feature matrix covered by convolution kernels. Each output feature matrix is calculated from multiple input feature matrices of the previous convolutional layer. The output value a l j of the j-th cell of the convolution layer l is shown in Equation (11), and the convolution calculation is shown in Figure 3 [30].
where b l j is the bias, k is the convolution kernel, and the parameters are updated when feedback updates are performed after each round of model training. There are two problems in the convolution calculation process.
(1) The output feature matrix size declines after each convolution computation compared with the input feature matrix. When the input feature matrix has a small size, or multiple consecutive convolution calculations are executed, the amount of infor- There are two problems in the convolution calculation process.
(1) The output feature matrix size declines after each convolution computation compared with the input feature matrix. When the input feature matrix has a small size, or multiple consecutive convolution calculations are executed, the amount of information in the output feature matrix will be minimal, resulting in the loss of useful information and altering the reliability of subsequent tasks. (2) Edge features of the input feature matrix. The number of calculations is less, which means that the edge information in the input feature matrix will be less involved in the analysis of the final output result. It causes the edge information of input features to be lost.
To solve these two problems, the input feature matrix is usually padded, and the main padding operations are valid padding and same padding. Valid padding is used directly to convolve the image with the convolution kernel of the input feature matrix. It is used when the input feature matrix size is significant and needs to be reduced. The same padding is used to restore the original size of the output feature matrix by padding 0. The output feature matrix, after filling with valid and same padding, is shown in Figure 4.  There are two problems in the convolution calculation process.
(1) The output feature matrix size declines after each convolution computation compared with the input feature matrix. When the input feature matrix has a small size, or multiple consecutive convolution calculations are executed, the amount of information in the output feature matrix will be minimal, resulting in the loss of useful information and altering the reliability of subsequent tasks. (2) Edge features of the input feature matrix. The number of calculations is less, which means that the edge information in the input feature matrix will be less involved in the analysis of the final output result. It causes the edge information of input features to be lost.
To solve these two problems, the input feature matrix is usually padded, and the main padding operations are valid padding and same padding. Valid padding is used directly to convolve the image with the convolution kernel of the input feature matrix. It is used when the input feature matrix size is significant and needs to be reduced. The same padding is used to restore the original size of the output feature matrix by padding 0. The output feature matrix, after filling with valid and same padding, is shown in Figure 4.  (2) ReLU layer: It is essential to add an activation function after the convolutional layer to enhance the nonlinear expression ability of the input movement and make the learned features more distinguishable. In recent years, rectified linear unit (ReLU), which is the most widely used activation unit, has been applied to CNNs to accelerate the convergence. Combined with the backpropagation learning method to adjust parameters, the ReLU makes shallow weights more trainable [31]. The ReLU function is calculated as shown in Equation (12), and the function image is shown in Figure 5. (2) ReLU layer: It is essential to add an activation function after the convolutional layer to enhance the nonlinear expression ability of the input movement and make the learned features more distinguishable. In recent years, rectified linear unit (ReLU), which is the most widely used activation unit, has been applied to CNNs to accelerate the convergence. Combined with the backpropagation learning method to adjust parameters, the ReLU makes shallow weights more trainable [31]. The ReLU function is calculated as shown in Equation (12), and the function image is shown in Figure 5.
The ReLU activation function has the following advantages: (1) Smaller computation: Because the ReLU function does not involve complex operations, it can save a lot of computation time and can improve the efficiency of the overall network model. (2) Prevent gradient decay: When the result of the activation function is small, training parameters are updated to a lesser extent or are not updated. In contrast, the ReLU function has a result of 1 in the activation function interval, avoiding this phenomenon. (3) The overfitting phenomenon is mitigated, as shown in Figure 5. When the feature value obtained after the calculation is less than zero, the ReLU activation function will be assigned to zero. Although this may cause information loss, it also increases the sparsity of the model, reduces the learning ability of the model, and enhances the generalization ability of the model.  Figure 5. When the feature value obtained after the calculation is less than zero, the ReLU activation function will be assigned to zero. Although this may cause information loss, it also increases the sparsity of the model, reduces the learning ability of the model, and enhances the generalization ability of the model.
The ReLU activation function performs poorly for data with more negative values in input features. In the continuing life forecast for rolling bearings, the input data used are all positive, and output target values are all greater than, or equal to, zero. Consequently, if initialization weight parameters of the control model are more significant than zero, the shortcomings of the ReLU activation function can be prevented, and the computational efficiency and accuracy of the model can be improved.
(3) Pooling layer: The pooling layer and the convolutional layer form the feature-extraction module. The pooling layer can reduce the redundancy of the feature matrix and alleviate the overfitting phenomenon. The activation value a j l in pooling layer l is calculated as shown in Equation (13).
a j l = f (b j l +β j l down(a j l-1 ,M l )) where b j l is the bias; β j l is the multiplicative remaining useful; M l is the pooling window size; down() denotes the pooling function; the commonly used pooling function is calculated as shown in Figure 6. The ReLU activation function performs poorly for data with more negative values in input features. In the continuing life forecast for rolling bearings, the input data used are all positive, and output target values are all greater than, or equal to, zero. Consequently, if initialization weight parameters of the control model are more significant than zero, the shortcomings of the ReLU activation function can be prevented, and the computational efficiency and accuracy of the model can be improved.
(3) Pooling layer: The pooling layer and the convolutional layer form the feature-extraction module. The pooling layer can reduce the redundancy of the feature matrix and alleviate the overfitting phenomenon. The activation value a l j in pooling layer l is calculated as shown in Equation (13). a l j = f b l j +β l j down(a l−1 j , M l (13) where b l j is the bias; β l j is the multiplicative remaining useful; M l is the pooling window size; down() denotes the pooling function; the commonly used pooling function is calculated as shown in Figure 6. (4) Flatten layer: The flatten layer converts the feature matrix output from the feature-extraction module into a one-dimensional feature vector so that the features meet the input dimension requirements of the subsequent, fully connected layers.  The flatten layer converts the feature matrix output from the feature-extraction module into a one-dimensional feature vector so that the features meet the input dimension requirements of the subsequent, fully connected layers.
(5) Fully connected layer: In a convolutional neural network, after feature-extraction operations such as convolution and pooling, the output feature matrix is converted into a one-dimensional feature vector by the flatten layer, which is input to the fully connected layer for classification or prediction tasks. The fully connected layer in a convolutional neural network is the same as a multilayer perceptron. The fully connected layer discovers the local information contained in features. The structure of the fully connected layer is shown in Figure 7 and is calculated as shown in Equation (14).
where ω is the weight between each hidden layer, b is the bias, and f() is the activation function. (4) Flatten layer: The flatten layer converts the feature matrix output from the feature-extraction module into a one-dimensional feature vector so that the features meet the input dimension requirements of the subsequent, fully connected layers.
(5) Fully connected layer: In a convolutional neural network, after feature-extraction operations such as convolution and pooling, the output feature matrix is converted into a one-dimensional feature vector by the flatten layer, which is input to the fully connected layer for classification or prediction tasks. The fully connected layer in a convolutional neural network is the same as a multilayer perceptron. The fully connected layer discovers the local information contained in features. The structure of the fully connected layer is shown in Figure 7 and is calculated as shown in Equation (14).  In the task of predicting the continuing life of bearings, the input is a one-dimensional feature vector. So, a one-dimensional convolutional neural network model is used as the base model for remaining useful life prediction. The one-dimensional convolutional neural network convolution process is illustrated in Figure 8. Convolutional kernels of dimensions (1,4) and (1,3) are used to convolve the input sequence under the condition of the concurrent length of the same value, respectively. The input sequence is an ascending sequence with fluctuations in the middle. From the convolution results, feature sequences calculated by convolution kernels of different scales reflect the feature trends of the input sequence differently. The feature sequence extracted from the convolution kernel of size (1,4) reflects the increasing trend of the input sequence well but does not reflect the fluctuation trend in the input sequence. The feature sequence extracted from the convolutional kernel of size (1, 3) reflects the rising and fluctuating trends of the input features but does not reflect either direction significantly. quence with fluctuations in the middle. From the convolution results, feature sequences calculated by convolution kernels of different scales reflect the feature trends of the input sequence differently. The feature sequence extracted from the convolution kernel of size (1,4) reflects the increasing trend of the input sequence well but does not reflect the fluctuation trend in the input sequence. The feature sequence extracted from the convolutional kernel of size (1, 3) reflects the rising and fluctuating trends of the input features but does not reflect either direction significantly. Step length1 Step length1 Convolutional neural networks often do not reflect the feature information well when the feature extraction is performed on input features at a single scale. Therefore, a multiscale convolutional module is proposed for feature learning, which consists of four conventional modules with different convolution kernel sizes in parallel. Each convolutional module consists of three layers, two ReLU activation layers, one BN layer, and one pooling layer [32], as shown in Figure 9. With the multiscale convolution module, the resolution of the features can be improved, which improves the remaining useful life prediction accuracy.

Convolution kernel size
(1, 1) Suppose x l-1 ∈R H×1×C and k l ∈R F×1×C×N denote the input vector and the learnable convolutional kernel, respectively, where H denotes the input vector length, C denotes the number of input channels, 1 × represents the size of the convolutional kernel, and N represents the number of convolutional kernels. Then, the n-th feature vector of the l-th convolutional layer is shown in Equations (15) and (16). Convolutional neural networks often do not reflect the feature information well when the feature extraction is performed on input features at a single scale. Therefore, a multiscale convolutional module is proposed for feature learning, which consists of four conventional modules with different convolution kernel sizes in parallel. Each convolutional module consists of three layers, two ReLU activation layers, one BN layer, and one pooling layer [32], as shown in Figure 9. With the multiscale convolution module, the resolution of the features can be improved, which improves the remaining useful life prediction accuracy. sequence differently. The feature sequence extracted from the convolution kernel of size (1,4) reflects the increasing trend of the input sequence well but does not reflect the fluctuation trend in the input sequence. The feature sequence extracted from the convolutional kernel of size (1, 3) reflects the rising and fluctuating trends of the input features but does not reflect either direction significantly. Step length1 Step length1 Convolutional neural networks often do not reflect the feature information well when the feature extraction is performed on input features at a single scale. Therefore, a multiscale convolutional module is proposed for feature learning, which consists of four conventional modules with different convolution kernel sizes in parallel. Each convolutional module consists of three layers, two ReLU activation layers, one BN layer, and one pooling layer [32], as shown in Figure 9. With the multiscale convolution module, the resolution of the features can be improved, which improves the remaining useful life prediction accuracy. Suppose x l-1 ∈R H×1×C and k l ∈R F×1×C×N denote the input vector and the learnable convolutional kernel, respectively, where H denotes the input vector length, C denotes the number of input channels, 1 × represents the size of the convolutional kernel, and N represents the number of convolutional kernels. Then, the n-th feature vector of the l-th convolutional layer is shown in Equations (15) and (16). Suppose x l−1 ∈ R H×1×C and k l ∈ R F×1×C×N denote the input vector and the learnable convolutional kernel, respectively, where H denotes the input vector length, C denotes the number of input channels, 1 × F represents the size of the convolutional kernel, and 1 × F represents the number of convolutional kernels. Then, the n-th feature vector of the l-th convolutional layer is shown in Equations (15) and (16).
where σ(·) denotes the Relu activation function, u l n denotes the output of the convolutional layer, * denotes the convolutional computation, k l n denotes the n-th convolutional kernel of the l-th convolutional layer, and b l n denotes the bias. In the multiscale convolution module, the pooling layer is set after the third convolution layer. The main feature information learned is obtained by the maximum pooling operation after passing through the convolution layer, as shown in Equation (17).
where y l−1 n is the output of the n-th feature map, Maxpooling(·) denotes the maximum pooling function, p denotes the pooling layer size, and s denotes the number of steps.

Attentional Mechanisms
The attention mechanism is an algorithm inspired by the human visual attention mechanism, which assigns different attention weights to each feature, thus allowing the model to focus more on more critical features, as shown in Figure 10.
lution layer. The main feature information learned is obtained by the maximum pooling operation after passing through the convolution layer, as shown in Equation (17). y n l = Maxpooling(y n l-1 ,p,s) where y n l-1 is the output of the n-th feature map, Maxpooling(⋅) denotes the maximum pooling function, p denotes the pooling layer size, and s denotes the number of steps.

Attentional Mechanisms
The attention mechanism is an algorithm inspired by the human visual attention mechanism, which assigns different attention weights to each feature, thus allowing the model to focus more on more critical features, as shown in Figure 10.
Vector The commonly used weight calculation methods in the attention mechanism are additive, dot product, and scaled dot product bilinear calculations, as shown in Equation (18).
where Q is the state of the last time step when the model performs time series prediction.
K is the state of each time step when the model performs time series prediction. s(K i ,Q) is the attention weight calculation mechanism that calculates the correlation between Q The commonly used weight calculation methods in the attention mechanism are additive, dot product, and scaled dot product bilinear calculations, as shown in Equation (18).
where Q is the state of the last time step when the model performs time series prediction. K is the state of each time step when the model performs time series prediction. s(K i , Q) is the attention weight calculation mechanism that calculates the correlation between Q and K. d is the dimensionality of the data in the time step. α i is the estimated attention weight, which suggests the importance of the time step to the overall time series feature expression importance; V is the same as K, which suggests the state of each time step when the model performs time series prediction.

Remaining Useful Life Prediction Module
The remaining useful life prediction module consists of the attention module and the fully connected neural network. The attention module is constructed to effectively fuse the feature information learned by the multiscale convolutional module and highlight the part of it that is relevant to the remaining useful life. As shown in Figure 11, features extracted by the multiscale convolution module are used as the input of the remaining useful life prediction module, assuming that z l n ∈ R I×1×J denotes the input feature vector; α l ∈ R I×1×J indicates the attention weight, where I is the length of the feature vector; J = N × C indicates the number of feature vectors. The attention module features are fused, as shown in Equation (19).
where ⊗ denotes the corresponding element multiplication operation in the matrix, z l ∈ R I×1×J is the fused eigenvector, φ(·) indicates the attention weight calculation function, and the scaled dot product calculation function is used in this paper.
fully connected neural network. The attention module is constructed to effectively fuse the feature information learned by the multiscale convolutional module and highlight the part of it that is relevant to the remaining useful life. As shown in Figure 11, features extracted by the multiscale convolution module are used as the input of the remaining useful life prediction module, assuming that z n l ∈R I×1×J denotes the input feature vector; α l ∈R 1×1×J indicates the attention weight, where I is the length of the feature vector; J = N×C indicates the number of feature vectors. The attention module features are fused, as shown in Equation (19).

Maxpooling
where ⊗ denotes the corresponding element multiplication operation in the matrix, z l ∈R I×1×J is the fused eigenvector, Φ(⋅) indicates the attention weight calculation function, and the scaled dot product calculation function is used in this paper. Finally, the fused feature vectors are input to the fully connected neural network for remaining useful life prediction. The fully connected neural network in this paper contains two hidden layers containing 64 and 128 nodes, respectively. The fully connected neural network prediction is calculated in Equation (20).
where ω m n denotes the weight of the n-th node of the m-th hidden layer, h m n is the output of the n-th node of the m-th hidden layer, f(⋅) represents the activation function after the hidden layer, and y p is the final predicted output.

Model Parameters and Structure
The MACNN remaining useful life prediction model of rolling bearings is built on a multiscale feature-extraction module with an attention mechanism. The overall model first uses a convolution kernel of size (1, 1) to extract the shallow features of the input data. Then, four convolution modules are used to remove the deep features at different scales, respectively. Because of the large number of parameters in the overall model, to prevent the model from overfitting, the remaining join is used to stitch the shallow features with the extracted deep features. Spliced multiscale features are input into the attention fusion layer to obtain the fused attention feature vector, which is input to the fully Finally, the fused feature vectors are input to the fully connected neural network for remaining useful life prediction. The fully connected neural network in this paper contains two hidden layers containing 64 and 128 nodes, respectively. The fully connected neural network prediction is calculated in Equation (20).
where ω n m denotes the weight of the n-th node of the m-th hidden layer, h n m is the output of the n-th node of the m-th hidden layer, f(·) represents the activation function after the hidden layer, and y p is the final predicted output.

Model Parameters and Structure
The MACNN remaining useful life prediction model of rolling bearings is built on a multiscale feature-extraction module with an attention mechanism. The overall model first uses a convolution kernel of size (1, 1) to extract the shallow features of the input data. Then, four convolution modules are used to remove the deep features at different scales, respectively. Because of the large number of parameters in the overall model, to prevent the model from overfitting, the remaining join is used to stitch the shallow features with the extracted deep features. Spliced multiscale features are input into the attention fusion layer to obtain the fused attention feature vector, which is input to the fully connected layer to obtain the final prediction results. The specific parameters of the overall model are shown in Table 1, and the model structure is shown in Figure 12.

Overall Methodology Flow
The rolling bearing remaining useful life prediction model using the MFPE-MACNN adequately reflects the complexity and instability of the movement from multiple dimensions. The overall method makes up for a defect: the features extracted at a single scale do not fully reflect the local trend of decline of the life of rolling bearings. It can improve the accuracy of the remaining useful life prediction for rolling bearings. Based on the construction of the multiscale fusion permutation entropy with low-resonance components as features for assessing the bearing life decline trend, the multiscale feature-extraction module and the attention mechanism are added to the one-dimensional convolutional neural network to enhance the learning ability of the model for multiscale features. A multiscale attentional convolutional neural network rolling bearing remaining useful life prediction model is built. The overall method flow chart is shown in Figure 13, and the specific steps are as follows.

Simulation Experiment Validation
To verify the effectiveness of the proposed feature enhancement and feature-extraction method, a feature extraction simulation experiment was set up. The simulation movement was composed of the superimposed shock movement and the modulated movement. The sub-constructions of the shock movement and the modulated movement are shown in Equations (21) [33]and (22). To simulate the failure of the bearing under operating conditions in order to find the best method for extracting the bearing vibration signal features, the sampling frequency was set to 8192 Hz, and the number of sampling points was set to 4096. y = y 0 e -2πgf n t 0 sin (πf n √(1-g 2 )(t 0 -KT)) x = (1+cos(2πf r t))cos(2πf z t) where y 0 is the displacement constant, set to 5; g is the damping coefficient, set to 0.5; f n is the intrinsic frequency, set to 1000 Hz; t 0 is the single-cycle sampling interval; K is the number of repetitions of the shock movement; T is the repetition period, set to 0.025 s; f r is the amplitude modulation frequency, set to 70 Hz; f z is the carrier frequency, set to 560 Hz. The time and frequency domain diagrams of the shock movements are shown in Figures 14 and 15, respectively. Step 1: The resonant sparse decomposition of the input movement sequence as X N = {x 1 , x 2 , · · · , x N } yields the high-resonance component and the low-resonance component.
Step 2: The short time series multiscale permutation entropy values are calculated to the entropy matrix MPEM =[E 1 , E 2 , · · · , E z ] for the low-resonance components.
Step 3: Find the nearest neighbor with entropy value k and calculate the covariance matrix Z i and the corresponding weight coefficient vector W i in the sample space of the k nearest neighbors.
Step 4: Construct the weight coefficient vector W i into the weight coefficient matrix W, and use it to calculate the conditioned matrix M.
Step 5: Calculate the first d + 1 eigenvalues of the conditioned matrix M and calculate the eigenvector y 1 , y 2 , · · · , y d+1 corresponding to these d + 1 eigenvalues.
Step 6: The matrix consisting of the second eigenvector y 2 to the d + 1st eigenvector is the multiscale fusion permutation entropy matrix MFPE = y 2 , y 3 , · · · , y d+1 obtained by dimensionality reduction.
Step 7: Determine the size of multiple convolutional kernels, select the loss function, select the activation function, and determine the number of layers of the multiscale convolutional kernel for the remaining useful life prediction model of the multiscale convolutional neural network.
Step 8: Incorporate the attention mechanism into the remaining useful life prediction model of the multiscale convolutional neural network to form the remaining useful life prediction model of the multiscale convolutional attention neural network.
Step 9: The extracted feature matrix MFPE = y 2 , y 3 , · · · , y d+1 of the training set is input to the remaining useful life prediction model of the multiscale convolutional attention neural network to obtain the output error, and the error is backpropagated to update the prediction model parameters.
Step 10: After the parameters of the prediction model are updated to reach the optimal requirements, the test set is input to the MACNN prediction model to complete the prediction of the remaining useful life of the rolling bearing.

Simulation Experiment Validation
To verify the effectiveness of the proposed feature enhancement and feature-extraction method, a feature extraction simulation experiment was set up. The simulation movement was composed of the superimposed shock movement and the modulated movement. The sub-constructions of the shock movement and the modulated movement are shown in Equations (21) [33] and (22). To simulate the failure of the bearing under operating conditions in order to find the best method for extracting the bearing vibration signal features, the sampling frequency was set to 8192 Hz, and the number of sampling points was set to 4096.
x =(1 + cos(2πf r t)) cos(2πf z t) where y 0 is the displacement constant, set to 5; g is the damping coefficient, set to 0.5; f n is the intrinsic frequency, set to 1000 Hz; t 0 is the single-cycle sampling interval; K is the number of repetitions of the shock movement; T is the repetition period, set to 0.025 s; f r is the amplitude modulation frequency, set to 70 Hz; f z is the carrier frequency, set to 560 Hz. The time and frequency domain diagrams of the shock movements are shown in Figures 14 and 15     The sparse decomposition high-resonance quality factor, Q 1 , was set to 3 dancy, r 1 , was set to 3; the number of decomposition layers, J 1 , was set to 27; resonance quality factor, Q 2 , and the redundancy, r 2 , were set to 3; the number o position layers, J 2 , was set to 7 [34]. The high-resonance component retrieved composition and the moderate-resonance components are depicted in Figures 18 respectively. As shown in Figure 18, the high-resonance component is mainly the oscillation component in the simulated signal. As shown in Figure 19, the low-re component principally contains the shock features in the simulated signal.  The sparse decomposition high-resonance quality factor, Q 1 , was set to 3; redundancy, r 1 , was set to 3; the number of decomposition layers, J 1 , was set to 27; the low-resonance quality factor, Q 2 , and the redundancy, r 2 , were set to 3; the number of decomposition layers, J 2 , was set to 7 [34]. The high-resonance component retrieved after decomposition and the moderate-resonance components are depicted in Figures 18 and 19, respectively. As shown in Figure 18, the high-resonance component is mainly the periodic oscillation component in the simulated signal. As shown in Figure 19, the low-resonance component principally contains the shock features in the simulated signal.  The sparse decomposition high-resonance quality factor, Q 1 , was se dancy, r 1 , was set to 3; the number of decomposition layers, J 1 , was set t resonance quality factor, Q 2 , and the redundancy, r 2 , were set to 3; the num position layers, J 2 , was set to 7 [34]. The high-resonance component retri composition and the moderate-resonance components are depicted in Figu respectively. As shown in Figure 18, the high-resonance component is main oscillation component in the simulated signal. As shown in Figure 19, the l component principally contains the shock features in the simulated signal.   The envelope spectra of the high-resonance component and the low-r ponent are analyzed as shown in Figure 20. The overall trend of the envelo the high-resonance component is no different compared to the simulated s spectrum, except for a slight decrease in amplitude. The overall amplitud component in the low-frequency band is not prominent, and features reflect in the remaining useful life of the bearing cannot be better extracted in t feature extraction. As shown in Figure 21, the overall amplitude of the component envelope spectrum decreases compared with the broad simul velope spectrum. However, the fault shock component in the low-frequen envelope spectrum is more evident in the low-frequency band. The envelope spectra of the high-resonance component and the low-resonance component are analyzed as shown in Figure 20. The overall trend of the envelope spectrum of the high-resonance component is no different compared to the simulated signal envelope spectrum, except for a slight decrease in amplitude. The overall amplitude of the shock component in the low-frequency band is not prominent, and features reflecting the decline in the remaining useful life of the bearing cannot be better extracted in the subsequent feature extraction. As shown in Figure 21, the overall amplitude of the low-resonance component envelope spectrum decreases compared with the broad simulated signal envelope spectrum. However, the fault shock component in the low-frequency band of the envelope spectrum is more evident in the low-frequency band. The envelope spectra of the high-resonance component and the low-resonan ponent are analyzed as shown in Figure 20. The overall trend of the envelope spec the high-resonance component is no different compared to the simulated signal e spectrum, except for a slight decrease in amplitude. The overall amplitude of th component in the low-frequency band is not prominent, and features reflecting the in the remaining useful life of the bearing cannot be better extracted in the sub feature extraction. As shown in Figure 21, the overall amplitude of the low-re component envelope spectrum decreases compared with the broad simulated si velope spectrum. However, the fault shock component in the low-frequency ban envelope spectrum is more evident in the low-frequency band.   The multiscale permutation entropy matrix of signals with low-resonance nents was calculated, and the obtained multiscale permutation entropy matrix i in Figure 22. Using the locally linear embedding algorithm, the high-dimensional multisca ment entropy value matrix is downscaled to a one-dimensional vector. The m fused permutation entropy feature vector reflecting the remaining useful life decli is obtained, as shown in Figure 23. The red curve facilitates the observation of t tiscale fusion permutation entropy trend feature curve. The envelope of the m fusion alignment entropy characteristic curve is drawn. With the occurrence of failure, the remaining useful life of the bearing declines over time. The overall m fusion permutation entropy value shows an increasing trend and negatively co The multiscale permutation entropy matrix of signals with low-resonance components was calculated, and the obtained multiscale permutation entropy matrix is shown in Figure 22. The multiscale permutation entropy matrix of signals with low-resonance compo nents was calculated, and the obtained multiscale permutation entropy matrix is show in Figure 22. Using the locally linear embedding algorithm, the high-dimensional multiscale align ment entropy value matrix is downscaled to a one-dimensional vector. The multiscal fused permutation entropy feature vector reflecting the remaining useful life decline tren is obtained, as shown in Figure 23. The red curve facilitates the observation of the mul tiscale fusion permutation entropy trend feature curve. The envelope of the multiscal fusion alignment entropy characteristic curve is drawn. With the occurrence of bearin failure, the remaining useful life of the bearing declines over time. The overall multiscal fusion permutation entropy value shows an increasing trend and negatively correlate with the remaining useful life. Due to the more intensive frequency of the impact fau signal in the simulation signal, the multiscale fusion alignment entropy value fluctuate more. Using the locally linear embedding algorithm, the high-dimensional multiscale alignment entropy value matrix is downscaled to a one-dimensional vector. The multiscale fused permutation entropy feature vector reflecting the remaining useful life decline trend is obtained, as shown in Figure 23. The red curve facilitates the observation of the multiscale fusion permutation entropy trend feature curve. The envelope of the multiscale fusion alignment entropy characteristic curve is drawn. With the occurrence of bearing failure, the remaining useful life of the bearing declines over time. The overall multiscale fusion permutation entropy value shows an increasing trend and negatively correlates with the  Figure 23. Simulated signal multiscale fusion permutation entropy curve.

Cincinnati Data Validation
To verify the proposed method effectiveness, the Cincinnati open data were used a experimental data [35]. Features were extracted as input using the multiscale fusion align ment entropy feature-extraction method. Then, degradation features among them were removed by the depth of the convolution module in each scale of the model. Bearing deg radation features were received by the attention fusion. Finally, the life prediction result were received by the fully connected layer. The test stand was fitted with four Rexford ZA-2115 bearings, each with 16 rollers in the raceway. The roller diameter was 0.331 cm the pitch diameter was 2.815 cm, the contact angle was 15.17°, the speed was 2000 r/min and the radial load was 26.66 KN. One acceleration sensor was installed on the axial and radial direction of each bearing, and the sampling frequency was 20 kHz, as shown in Figure 24. According to characteristics of the input data, the parameters of this experiment wer set as shown in Table 2.

Cincinnati Data Validation
To verify the proposed method effectiveness, the Cincinnati open data were used as experimental data [35]. Features were extracted as input using the multiscale fusion alignment entropy feature-extraction method. Then, degradation features among them were removed by the depth of the convolution module in each scale of the model. Bearing degradation features were received by the attention fusion. Finally, the life prediction results were received by the fully connected layer. The test stand was fitted with four Rexford ZA-2115 bearings, each with 16 rollers in the raceway. The roller diameter was 0.331 cm, the pitch diameter was 2.815 cm, the contact angle was 15.17 • , the speed was 2000 r/min, and the radial load was 26.66 KN. One acceleration sensor was installed on the axial and radial direction of each bearing, and the sampling frequency was 20 kHz, as shown in Figure 24.  Figure 23. Simulated signal multiscale fusion permutation entropy curve.

Cincinnati Data Validation
To verify the proposed method effectiveness, the Cincinnati open data were used as experimental data [35]. Features were extracted as input using the multiscale fusion alignment entropy feature-extraction method. Then, degradation features among them were removed by the depth of the convolution module in each scale of the model. Bearing degradation features were received by the attention fusion. Finally, the life prediction results were received by the fully connected layer. The test stand was fitted with four Rexford ZA-2115 bearings, each with 16 rollers in the raceway. The roller diameter was 0.331 cm, the pitch diameter was 2.815 cm, the contact angle was 15.17°, the speed was 2000 r/min, and the radial load was 26.66 KN. One acceleration sensor was installed on the axial and radial direction of each bearing, and the sampling frequency was 20 kHz, as shown in Figure 24. According to characteristics of the input data, the parameters of this experiment were set as shown in Table 2. According to characteristics of the input data, the parameters of this experiment were set as shown in Table 2.  Figure 25.    The resonance showed sparse decomposition after preprocessing the vibration data of the No. 3 bearing. The low-resonance component, containing more impact fault information, was selected for the subsequent feature-extraction operation, as shown in Figure  26a,b. The resonance showed sparse decomposition after preprocessing the vibration data of the No. 3 bearing. The low-resonance component, containing more impact fault information, was selected for the subsequent feature-extraction operation, as shown in Figure 26a    The resonance showed sparse decomposition after preprocessing the vibration data of the No. 3 bearing. The low-resonance component, containing more impact fault information, was selected for the subsequent feature-extraction operation, as shown in Figure  26a,b. After decomposing to obtain low-resonance components, the calculation of multiscale permutation entropy was performed. The multiscale permutation entropy feature matrix was obtained, as shown in Figure 27. The LLE was used to reduce and fuse the highdimensional entropy matrix to obtain multiscale fused permutation entropy values, as shown in Figure 28. After decomposing to obtain low-resonance components, the calculat tiscale permutation entropy was performed. The multiscale permutation ent matrix was obtained, as shown in Figure 27. The LLE was used to reduce a high-dimensional entropy matrix to obtain multiscale fused permutation ent as shown in Figure 28.   After decomposing to obtain low-resonance components, the calcu tiscale permutation entropy was performed. The multiscale permutation e matrix was obtained, as shown in Figure 27. The LLE was used to reduc high-dimensional entropy matrix to obtain multiscale fused permutation e as shown in Figure 28.  The prediction results shown in Figure 29 show that the overall life prediction performed by the MACNN model was better. The general trend was the same as the actual life curve, and the life prediction value deviated less from the actual value. Compared with the MACNN life prediction model, the MCNN model deviated from the predicted trend at the end of the bearing life, and the overall model deviated from the actual value. The CNN model and attention-CNN model failed to reflect the actual bearing life trend at the end of the bearing life, and the overall life prediction value deviated from the actual value. The life prediction result comparison between different models showed that the overall performance of the CNN model was improved by adding the multiscale featureextraction module. The attention mechanism can bring the life prediction results closer to the actual values and improve the accuracy of life prediction.
Three metrics, MAE, RMSE, and model score, were used to quantitatively analyze the model prediction results, as shown in Table 3 and Figure 30. It can be seen that the MACNN model improved by 12.47%, 39.07%, and 22.54% in total score compared with the MCNN model, CNN model, and attention-CNN model, respectively. It shows that the MACNN model had better prediction accuracy and comprehensive performance in life prediction. The MAE evaluation index and the MSE evaluation index were reduced by 45.31%, 57.94%, and 52.86% compared with those of the MCNN model, the CNN model, and the attention-CNN model. These data show that the MACNN model has better generalization ability.  Three metrics, MAE, RMSE, and model score, were used to quantitatively analyze the model prediction results, as shown in Table 3 and Figure 30. It can be seen that the MACNN model improved by 12

XJTU-SY Bearing Data Validation
The XJTU-SY bearing dataset [36] was used for the experimental validation. Thi pirical dataset was collected by the empirical bench shown in Figure 31. The emp bench mainly contained a motor speed controller, an acceleration sensor, a hydraulic ing system, the AC motor, and other components. The experimental bench can sim various working conditions by adjusting the load and the speed.

Digital force indicators
Motor speed controller Acceleration sensor (vertical) Test bearings Spindle AC motor Support Bearings Hydraulic loading system Acceleration sensor (horizontal) Figure 31. Data acquisition lab bench.
As shown in Figure 32, in the experiment, the bearing data were collected b

XJTU-SY Bearing Data Validation
The XJTU-SY bearing dataset [36] was used for the experimental validation. This empirical dataset was collected by the empirical bench shown in Figure 31. The empirical bench mainly contained a motor speed controller, an acceleration sensor, a hydraulic loading system, the AC motor, and other components. The experimental bench can simulate various working conditions by adjusting the load and the speed.

XJTU-SY Bearing Data Validation
The XJTU-SY bearing dataset [36] was used for the experimental validation. This empirical dataset was collected by the empirical bench shown in Figure 31. The empirical bench mainly contained a motor speed controller, an acceleration sensor, a hydraulic loading system, the AC motor, and other components. The experimental bench can simulate various working conditions by adjusting the load and the speed.

Digital force indicators
Motor speed controller Acceleration sensor (vertical) Test bearings Spindle AC motor Support Bearings Hydraulic loading system Acceleration sensor (horizontal) Figure 31. Data acquisition lab bench.
As shown in Figure 32, in the experiment, the bearing data were collected by the acceleration sensors in horizontal and vertical directions. Two PCB 352C33 accelerometers were positioned at 90° to monitor the degradation of the bearings. The sampling frequency was set to 25.6 kHz, the sampling interval was set to 1 min, and the duration of As shown in Figure 32, in the experiment, the bearing data were collected by the acceleration sensors in horizontal and vertical directions. Two PCB 352C33 accelerometers were positioned at 90 • to monitor the degradation of the bearings. The sampling frequency was set to 25.6 kHz, the sampling interval was set to 1 min, and the duration of each sampling was 1.28 s. The experimental bearings were LDK UER204 rolling bearings, wh rameters are shown in Table 4. The failure locations were labeled on di The experiments were designed with three types of working conditions, types of faults, as shown in Figure 33, with five bearings under each k condition. The specific experimental data are shown in Table 5. Bearing1_1 as the model validation set, and Bearing1_2 data were used as the mod the experimental validation.  The experimental bearings were LDK UER204 rolling bearings, whose relevant parameters are shown in Table 4. The failure locations were labeled on different bearings. The experiments were designed with three types of working conditions, containing four types of faults, as shown in Figure 33, with five bearings under each kind of working condition. The specific experimental data are shown in Table 5. Bearing1_1 data were used as the model validation set, and Bearing1_2 data were used as the model training set in the experimental validation.  The experimental bearings were LDK UER204 rolling bearings, whose relevant parameters are shown in Table 4. The failure locations were labeled on different bearings. The experiments were designed with three types of working conditions, containing four types of faults, as shown in Figure 33, with five bearings under each kind of working condition. The specific experimental data are shown in Table 5. Bearing1_1 data were used as the model validation set, and Bearing1_2 data were used as the model training set in the experimental validation.    The feature-extraction operation was performed on the collected data. Firstly, the data resonance sparse decomposition was processed, as shown in Figure 34a  The feature-extraction operation was performed on the collected data. Firstly, the data resonance sparse decomposition was processed, as shown in Figure 34a,b.
From Figure 35a,b, it can be seen that the characteristic information in the low-resonance component is more prominent. Although the overall amplitude of the low-resonance components obtained after the resonance sparse decomposition decreased, the feature information in the signal was more prominent. The ratio of frequency components from the 200 Hz-1000 Hz frequency interval to the highest frequency component became smaller, and the feature information in these frequency bands can be better extracted.  From Figure 35a,b, it can be seen that the characteristic information in the lowresonance component is more prominent. Although the overall amplitude of the lowresonance components obtained after the resonance sparse decomposition decreased, the feature information in the signal was more prominent. The ratio of frequency components from the 200 Hz-1000 Hz frequency interval to the highest frequency component became smaller, and the feature information in these frequency bands can be better extracted.
The low-resonance component of the resonance sparse decomposition was selected, and the multiscale permutation entropy value of the low-resonance component was calculated. The multiscale permutation entropy matrix was obtained, as shown in Figure 36. As the remaining useful life of the bearing declined, the multiscale permutation entropy value showed a significant upward trend. Additionally, the corresponding amplitude in the bearing vibration signal rose stepwise. This reflects the occurrence of severe bearing degradation well. The low-resonance component of the resonance sparse decomposition was selected, and the multiscale permutation entropy value of the low-resonance component was calculated. The multiscale permutation entropy matrix was obtained, as shown in Figure 36. As the remaining useful life of the bearing declined, the multiscale permutation entropy value showed a significant upward trend. Additionally, the corresponding amplitude in the bearing vibration signal rose stepwise. This reflects the occurrence of severe bearing degradation well. MFPE features were obtained by fusing the high-dimensional entropy matrix using the LLE dimensionality reduction algorithm, as shown in Figure 37. MFPE features can reduce the feature redundancy while retaining the primary remaining useful life trend information in the multiscale permutation entropy matrix. The same step-up in the MFPE features occurred when the bearing was severely degraded. The low-resonance component of the resonance sparse decompo and the multiscale permutation entropy value of the low-resonance c culated. The multiscale permutation entropy matrix was obtained, as As the remaining useful life of the bearing declined, the multiscale p value showed a significant upward trend. Additionally, the correspo the bearing vibration signal rose stepwise. This reflects the occurren degradation well. MFPE features were obtained by fusing the high-dimensional en the LLE dimensionality reduction algorithm, as shown in Figure 37. reduce the feature redundancy while retaining the primary remaini information in the multiscale permutation entropy matrix. The same s features occurred when the bearing was severely degraded. MFPE features were obtained by fusing the high-dimensional entropy matrix using the LLE dimensionality reduction algorithm, as shown in Figure 37. MFPE features can reduce the feature redundancy while retaining the primary remaining useful life trend information in the multiscale permutation entropy matrix. The same step-up in the MFPE features occurred when the bearing was severely degraded.   From the life prediction results, it can be seen that the general life prediction of the MACNN model was better. The overall trend was the same as the actual life curve, and the life prediction value deviated less from the actual value. Compared with the MACNN life prediction model, the MCNN model deviated from the predicted trend at the end of the bearing's life, and the overall variation of the model from the actual value was more significant. The CNN model and the attention-CNN model failed to reflect the actual bearing life trend at the end of the bearing's life, and the overall life prediction value deviated from the actual value. A comparison of the life prediction results between the different models showed that the overall performance of the CNN model improved with the addition of the multiscale feature-extraction module. The attention mechanism can bring life prediction results closer to the real values and improve the accuracy of the life prediction.
Three metrics, MAE, RMSE, and model score, were used to quantitatively analyze the models' prediction results, as shown in Table 6 and Figure 39. From Table 6, it can be seen that the MACNN model improved by 13.17%, 51.01%, and 25.36% in total score compared with the MCNN model, the CNN model, and the attention-CNN model, respectively. These data show that the MACNN model has better accuracy and comprehensive performance in predictions of remaining useful life. Compared with the MCNN model, the CNN model, and the attention-CNN model, the MAE evaluation index was reduced by 9.91%, 37.41%, and 32.5%, respectively. These data show that the MACNN model has a better fitting ability. In the RMSE evaluation index, it was reduced by 15.03%, 41.98%, and 38.02% compared with the MCNN model, the CNN model, and the attention-CNN model, respectively. These data show that the MACNN model has better generalization ability.
The MACNN-based bearing life prediction model was constructed, and the MFPE feature matrix size was reconstructed to (992, 1, 12) to meet the model input and training needs. The MFPE features were input into the MACNN life prediction model, and the prediction results were obtained as shown in Figure 38a. The prediction results of the CNN model, the MCNN model, and the attention-CNN model are shown in Figure 38b  To further demonstrate the effectiveness of the proposed method, the method was compared and validated with different RUL methods from three studies. The specific results are shown in Table 7.  As presented in Table 7, a fused CNN-based method for predicting the remaining life of rolling bearings was proposed in [37], and the MAE was 8.5176 by studying the accelerated life dataset from a test of XJTU-SY rolling bearings. This study used the MFPE combined with resonant sparse decomposition, then used the MACNN prediction model for the remaining useful life, and the MAE was 5.47256. Compared with other methods, the MFPE-MACNN model has improved fitting ability and prediction accuracy.

Conclusions
In this paper, an MFPE-MACNN model was proposed for the prediction of the remaining useful life of rolling bearings. This study solved the problem posed by the fact that the convolution-based deep learning model complicates the extraction of feature information from complex time series. The problem of redundant information concerning rolling bearing recession features was removed. The prediction accuracy of the rolling bearing life was improved. The multiscale fusion permutation entropy-based feature-extraction method extracts the MFPE features with low-resonance components, quantifies the evaluation signal time complexity, and reflects the decline trend of the remaining useful life. The remaining useful life prediction model for rolling bearings, based on the multiscale convolutional attention neural network, can extract the feature information of MFPE features at different time scales, fuse multiscale features, improve the fitting ability of the model, and reduce the prediction bias. The XTJU-SY rolling bearing complete lifecycle dataset was used for experimental validation and compared with other remaining useful life prediction models. Compared with the MCNN model, the CNN model, and the attention-CNN model, the MAE evaluation index was reduced by 9.91%, 37.41%, and 32.5%, respectively. The RMSE evaluation index was reduced by 15.03%, 41.98%, and 38.02% compared with the MCNN model, the CNN model, and the attention-CNN model, respectively, indicating that the MACNN model has improved fitting ability and generalization ability. The prediction error of the MACNN model occurs within 5 min, which means that researchers can better capture the information of life decline characteristics, with suitable accuracy for remaining useful life prediction.
The limitation of this article is that the overall effect of the proposed feature extraction fluctuates wildly when the bearing operating conditions are more complex, which may lead to significant deviations in the prediction of subsequent remaining useful life prediction models. In future research, a more stable feature-extraction method will be investigated to evaluate the remaining useful life of rolling bearings. Another shortcoming is that the proposed prediction model for the remaining useful life of rolling bearings has more training Entropy 2022, 24, 905 32 of 33 parameters and the model training time is longer. The model needs to be retrained after changing the bearing type or working conditions. In future research, the migration learning method will be used to solve this problem, and to improve the overall generalization and prediction efficiency of the model.