Time Series Forecasting of Motor Bearing Vibration Based on Informer

Electric energy, as an economical and clean energy, plays a significant role in the development of science and technology and the economy. The motor is the core equipment of the power station; therefore, monitoring the motor vibration and predicting time series of the bearing vibration can effectively avoid hazards such as bearing heating and reduce energy consumption. Time series forecasting methods of motor bearing vibration based on sliding window forecasting, such as CNN, LSTM, etc., have the problem of error accumulation, and the longer the time-series forecasting, the larger the error. In order to solve the problem of error accumulation caused by the conventional methods of time series forecasting of motor bearing vibration, this paper innovatively introduces Informer into time series forecasting of motor bearing vibration. Based on Transformer, Informer introduces ProbSparse self-attention and self-attention distilling, and applies random search to optimize the model parameters to reduce the error accumulation in forecasting, achieve the optimization of time and space complexity and improve the model forecasting. Comparing the forecasting results of Informer and those of other forecasting models in three publicly available datasets, it is verified that Informer has excellent performance in time series forecasting of motor bearing vibration and the forecasting results reach 10−2∼10−6.


Introduction
Electric energy plays an essential role in human life and technological development. The motor is the core equipment of the power station; therefore, monitoring the motor conditions can effectively avoid the occurrence of hazards and improve the safety. In recent years, there have been many bearing health monitoring technologies, such as noise monitoring, temperature monitoring, current detection and vibration monitoring, etc. [1][2][3][4][5]. Among them, vibration monitoring can detect, locate and distinguish faults before serious failures of bearings occur. For the research of bearing fault diagnosis and bearing remaining useful life (RUL) prediction, time series forecasting of motor bearing vibration is a crucial prerequisite step. Therefore, it is of great significance to study the vibration prediction of motor bearings. The vibration signal of the motor bearing obtained by the sensor can reflect the fault characteristics [6][7][8]. Different fault types will produce different frequencies, amplitudes and corresponding vibrations in different parts of the apparatus [9]. The fault prediction based on motor bearing vibration data, which is applied to the monitoring of the sensing technology, can effectively avoid hazards such as bearing heating, thus saving maintenance costs [10].
Time series forecasting of motor bearing vibration is to determine the possibility of future failure by analyzing the historical data of its components. Conventional methods can be broadly classified into three main categories: classical time series forecasting and its optimization methods, forecasting methods based on sliding window and forecasting methods based on encoder-decoder structure.
Classical time series forecasting methods [11,12] achieve forecasting mainly through fixed time dependence and the single factor. The time series analysis method proposed by Box et al. [13] predicted the subsequence data series based on the known data series. Nikovski et al. [14] verified by experiments that classical time series forecasting methods have some advantages in the single factor short-term forecasting. Classical time series forecasting methods rely on linear relationships and do not include complex nonlinear dynamic models. This property makes the learning ability and expression ability of such methods inadequate and the forecasting results are poor in the face of complex and weak periodic motor bearing vibration data.
Time series forecasting methods of motor bearing vibration based on sliding window forecasting, such as CNN [15], RNN [16], LSTM [17] and other algorithms, were able to forecast nonlinear functions and dynamic dependency [18,19], which brought new results for complex time series forecasting containing multiple covariate inputs. Time series forecasting based on CNN and their improved models have been widely used. Shao et al. [20] used a light-weight 1D-CNN model combined with an auto-encoder structure and adopted a correlation alignment (CORAL) method to reduce domain offset. Luo et al. [21] used the conditional mutual information method to filter variables and the Pair-Copula model by incorporating the kernel density estimation method to address the limitation that the traditional Copula model can only handle two-dimensional variables and finally chose to combine with SVM and BP neural network to realize the data prediction. Carroll et al. [22] used artificial neural networks, SVM and logistic regression methods to demonstrate that the prediction of gearbox failures can be achieved using vibration data training models. Rahmoune et al. [23] applied the residual neural network model to a gas turbine system to predict the vibration frequency of the bearing through the vibration frequency data obtained by the sensor at the bearing. As a model specializing in forecasting series applied to time series forecasting, RNN has its advantages. Senjyu et al. [24] used RNNs, obtaining the input and output data of the network by differential calculations, to better predict the power variation of wind turbine bearings. Liu et al. [25] used RNN in the form of auto-encoders to diagnose bearing faults and forecast the rolling bearing data from the previous cycle to the next cycle through a GRU-based nonlinear predictive denoising auto-encoder (GRU-NP-DAE). Che et al. [26] proposed a fault prediction model based on the RNN variant model, Gate recurrent unit (GRU) and hybrid auto-encoder fault prediction model, which introduced the original signals into a multi-layer gate recurrent unit model to achieve time series forecasting and then achieved fault detection by the variational auto-encoders and stacked denoising auto-encoders. The effectiveness of this method was verified by the bearing dataset of Case Western Reserve University. The LSTM model solved the long-term dependence problem of general RNN models and further improved the time series forecasting. Ma et al. [27] proposed a model based on optimizing maximum correlation kurtosis deconvolution (MCKD) and LSTM network for time series forecasting of motor bearing vibration to realize early bearing fault warnings. Liu et al. [28] proposed a multilayer long short-term memory-isolation forest model (MLSTM-iForest) to predict the bearing temperature in the future and then input the calculated deviation index of the predicted bearing temperature into iForest to realize bearing fault early warning. ElSaid et al. [29] proposed to improve the LSTM cell structure using the ant colony optimization algorithm (ACO) for forecasting engine data and the new model presented an improvement of 1.35%. Fu et al. [30] used CNN to extract features and then used LSTM for gearbox bearing forecasting to achieve bearing high speed-side monitoring and super high temperature warning. Based on the sliding window forecasting methods, there was an error accumulation problem in time series forecasting. If these models were then used in combination with other methods, the training time would become longer, so timely forecasting of motor bearing vibration could not be achieved. Some of the above methods are suitable for small datasets and the forecasting results are not satisfactory for big data.
Time series forecasting methods of motor bearing vibration based on encoder-decoder structure, such as the Transformer model [31], used the attention mechanism to improve model training speed, which was suitable for parallelized calculation and higher than RNN in accuracy and performance. The unique output mechanism of the Transformer model can largely reduce the error accumulation during forecasting. Tang et al. [32] used discrete wavelet transform (DWT) and continuous wavelet transform (CWT) to convert vibration signals into a time-frequency representation (TFR) map and performed preliminary prediction analysis of TFR map by multiple individual ViT models [33] which had better results compared with integrated CNN and individual ViT. Zhang et al. [34] proposed a self-attention-based perception and prediction framework based on Transformer, called DeepHealth. Xu et al. [35] proposed a prediction model (HNCPM) that combines encoder, GRU regression module and decoder, through which the prediction of vibration data is realized. This model deploys an enhanced attention mechanism to capture global dependency from vibrational signals to forecast future signals and predict facility health. However, the training time of time series forecasting methods of motor bearing vibration based on encoder-decoder structure was long; what is more, these above research methods used a single dataset, which could not well illustrate the robustness of the proposed methods.
Based on the above problems and analysis, in this paper, the Informer model [36] is innovatively introduced into the prediction of motor bearing vibration and a time series forecasting method of motor bearing vibration based on random search [37] to optimize the Informer model is proposed. In this paper, we mainly focus on solving the problems of error accumulation, time and space complexity, optimization of model parameters and singleness of the dataset. Three publicly available datasets are selected and divided to form ten new datasets to compare the robustness of different models. The structure of Informer is improved for time series forecasting of motor bearing vibration and the parameters of Informer are optimized by random search. The main contributions of this paper are summarized as follows: (1) Informer is innovatively introduced into time series forecasting of motor bearing vibration. (2) For time series forecasting of motor bearing vibration, Informer is optimized and random search is used to optimize the model parameters to improve the model prediction effect.
The rest of this paper is organized as follows. Section 2 describes CNN, Deep RNNs, LSTM and Transformer and illustrates the problems of applying the above four models to time series forecasting of motor bearing vibration. Section 3 introduces Informer and its model optimization. Section 4 presents three publicly available datasets, compares the forecasting results of Informer with the other four models, illustrates the experimental results and conducts analyses. Section 5 presents the conclusion.

Conventional Methods Applied to Time Series Forecasting of Motor Bearing Vibration
This section introduces four models (CNN, Deep RNNs, LSTM and Transformer) applied to time series forecasting of motor bearing vibration and analyzes their limitations.

Convolutional Neural Networks (CNN)
The nonlinear mapping through the activation function solves the problems that classical time series prediction methods cannot incorporate exogenous variables and they rely on linear relationships. The motor bearing vibration data contains positive and negative values and the values fluctuate around 0. According to the characteristics of this motor bearing vibration data, this paper selects the tanh function as the activation function of CNN, which maps the input values to the range (−1, 1). The equation is as follows: There are some common activation functions: The softmax function is as follows: where C is the length of the input sequence and x i (0 ≤ i ≤ C) is the i-th element in the input sequence. The ELU function is as follows: where a is a positive decimal close to 0.

Deep Recurrent Neural Networks (Deep RNNs)
Deep RNNs [38,39] as a model specially dealing with series, in view of the long sequence and big data characteristics of motor bearing vibration data, this paper selects an input window of 100 to verify the long sequence forecasting effect of this model. According to the motor bearing vibration data characteristics described in Section 2.1, the tanh function (Equation (1)) is selected as the activation function of Deep RNNs. The input data of the cell at the i-th layer and t-th time come from two directions, one is the output h i−1 t from the (i − 1)-th layer and its equation is as follows: The other comes from the i-th layer and (t − 1)-th time memory data and its equation is as follows: The equation of the output h i t of the cell is as follows:

Long Short-Term Memory (LSTM)
Generally, the frequency of collecting motor bearing vibration data is relatively large and some values of the adjacent data collected in a very short period of time are very small, resulting in data redundancy in the process of learning. LSTM selects and discards part of the information through the forget gate and determines how much historical information enters, i.e., filters extremely similar adjacent motor bearing vibration data while preserving the trend of the original motor bearing vibration data. The forget gate will read h t−1 and x i and output a value between 0 and 1 to each number in the cell state C i−1 . The equation is as follows: where h t−1 is the output of the previous cell; x t is the input of the current cell; σ is the tanh function (Equation (1)). Update the old cell state with the following equation: The result is output through the output gate and the equation is as follows:

Transformer
Motor bearings are extremely delicate components in machines; for various reasons, only a small fraction of them can reach their design life [40,41]. Therefore, it is important to perform long-term vibration detection of motor bearings as well as to record recent abnormal vibrations. Transformer model based on Multi-head self-attention has the ability to simultaneously model long-term and short-term time series features, which is applicable to long-term motor bearing vibration data while learning short-term vibration features. This paper selects an input window of 100 to verify the Transformer's ability to model time series data. The equation of the multi-head self-attention mechanism is as follows: where i are the parameters that can be learned. The attention method is as follows: where the softmax function is shown in Equation (2). K is the key matrix, Q is the query matrix and V is the value matrix. The equation of layer normalization is as follows: In order to ensure that the decoder cannot see those inputs after the current moment, Transformer uses an attention mechanism with a mask to ensure consistent behavior during training and forecasting. To solve the problem that the relative position of the input is disrupted, Transformer adds the position encoding of the input information to the input information at the Positional Encoding layer before sending the input into the self-attention layer. The specific calculation equation is as follows: where pos is the position of the current word in the whole input sequence. i is the dimension of the current calculated value (maximum is d). d is the dimension of the input sequence. L is the length of the sequence.

Insufficiency of Sliding Window Forecasting
There are some defects in the forecasting method of motor bearing vibration time series based on sliding window mechanism model [42]. The commonly used sliding window leads to spatial and temporal deviations in the feature map or the feature sequence. This deviation leads to ambiguity and offset in the feature sequence. The commonly used sliding window is applied to the motor vibration data with long sequence and big data characteristics, which will cause the error to accumulate continuously, the sliding window mechanism, as shown in Figure 1. Real bearing vibration data [43] is chosen for illustration, as shown in Figure 2. Meanwhile, time series forecasting methods based on CNN, Deep RNNs and LSTM of motor bearing vibration also have their own insufficiency. The time series forecasting method based on CNN captures short-term local dependency; thus, its forecasting effect depends on the degree of correlation of the short-term data. Normal bearing vibrations have a certain periodicity in the short term, but this model could not forecast abnormal vibrations without regularity. Although the Deep RNNs can enhance its expression ability, this model is calculationally intensive and the training process is time-consuming and is unable to give timely forecasting results when facing new data, i.e., it cannot give ideal forecasting results for future abnormal vibrations. In addition, as the scale and depth of the Deep RNN model increase, learning will become more difficult. Therefore, when faced with motor bearing vibration data with big data characteristics, building a matching Deep RNN is still a problem that needs to be solved. LSTM also has the problem of calculational time consumption and the disadvantage of parallel processing. LSTM is not able to give reasonable prediction results because of the poor correlation between the abnormal vibration data and the previous data.

Insufficiency of Transformer
Position encoding is an important part of Transformer, which is divided into absolute position encoding and relative position encoding. Currently, relative position encoding operates on the attention matrix before softmax, which has a theoretical drawback [44,45]. The attention matrix with relative location information is a probability matrix with each row summed equal to 1. For Transformer, self-attention implements the interaction between tokens and the same input indicates that each v t−1 is the same. According to the description in Section 2.3, some values of the motor bearing vibration data collected in a very short period of time differed very little. That is, the output results for each location of the model are always the same or extremely similar data due to the accuracy problem resulting in the same output results.
where o i is the output value; a i,j is the softmax value (shown in Equation (2)); ∑ j a i,j = 1 causes the sum of each row of the attention matrix to be 1; v j is the value. Transformer also has the defects of large amount of calculation and long training time. Compared with CNN and RNN, Transformer has a weaker ability to acquire local information.

Informer Applied to Time Series Forecasting of Motor Bearing Vibration
This section introduces Informer applied to time series forecasting of motor bearing vibration, describes the insufficiency of using Informer directly and optimizes Informer. Informer structure, as shown in Figure 3.

Informer Introduction
Informer adds positional encoding to the data input to ensure that the model can capture the correct order of the input sequence. The location encoding is divided into Local Time Stamp and Global Time Stamp. The equation of the Local Time Stamp is shown in Equations (15) and (16).
After the encoding steps, the input data into the encoder layer can be obtained, as shown below: where u i is the original data sequence, i ∈ [1, 2, ..., L]; L is the length of the data sequence; t is the number of series; α is a factor to balance the size between the mapping vector and the position encoding and is taken as 1 in the case that the input sequence has been standardized. Informer introduces ProbSparse self-attention, which first calculates the KL divergence of the i-th query and the uniformly distributed query to obtain the difference degree and then calculates the sparsity score. The formula for calculating KL divergence is as follows: where p(k j |q i ) is the probability distribution of the attention query for all keys; q(k j |q i )= 1 L K is the uniform distribution; d is the dimension of the input sequence after mapping; L K is the sequence length; k(q i , k j ) is the intermediate value of the i-th query and the j-th key when performing the softmax (Equation (2)) calculation. The sparsity score metric of the i-th query is as follows: Based on the above metrics, each key focuses on only u dominant queries, namely ProbSparse self-attention: whereQ is a sparse matrix with the same shape as Q, which contains only the first u queries under the sparsity measure M(q i , K), which has the following properties of the upper and lower bounds: where max Informer introduces the self-attention distilling, as shown in Figure 4, which adds convolution, activation and maximum pooling operations between each encoder and decoder layer to reduce the length of the input sequence of the previous layer by half, thus solving the problem of occupying too much memory when the input sequence is long. The equation is as follows: where X t j+1 is the output of the multi-headed ProbSparse self-attention layer in this layer; [X t j ] AB is the calculation result of the multi-headed ProbSparse self-attention layer in the previous layer; ELU (Equation (3)) is used as the activation function. Informer model uses batch generation forecasting to directly output multi-step forecasting results, thus improving the speed of long series forecasting. The equation is as follows: where X t 0 is the placeholder (predicted value); X t token ∈ R L token ×d mode is the start token; L token is the length of the sequence of start tokens; L y is the length of the predicted sequence; d model is the model dimension.

Informer Optimization
Informer forms sparse attention through query and key in ProbSparse self-attention to reduce the computational complexity of motor vibration feature learning. In Equation (23), L Q = L K = L, so that the total time complexity and space complexity are O(L ln L). In self-attention distilling, the input of the cascade layer is halved to deal with the superlong input sequence and alleviate the accumulative error problem of the classical neural network model. Zhou et al. [36] predicted results of long-series based on ETT, ECL and ELU activation function to be 10 −1 , which did not meet the requirements of time series forecasting of motor bearing vibration results. This paper optimizes the Informer model based on the vibration data of motor bearings. Time series forecasting methods of motor bearing vibration based on Informer, as shown in Figure 5.
The motor bearing vibration data contain positive and negative values and the values fluctuate around 0. According to the GELU activation function image and its corresponding derivative image, it can be seen that, compared with the ELU activation function, the GELU activation function is more consistent with the motor bearing vibration data characteristics. Therefore, GELU is chosen as the activation function of Informer in this paper. The GELU activation function image and its corresponding derivative image is shown in Figure 6. The equation of the GELU activation function is as follows: The three datasets used in this paper have high sampling frequency. For this feature, the time feature code was selected as hour, which can realize the training and prediction of the model for long-sequence data. The verification prediction length has 500 sample points and the results showed that the model was able to process and forecast the data series with long series and big data characteristics. After several tests, Informer converged at epoch 10 for all three datasets. According to the characteristics of motor bearing vibration data, the conventional method cannot complete the model training quickly when facing the newly generated data. Therefore, under the premise of ensuring the accuracy of prediction, this paper reduces the model size and the model calculation running time and selects two encoder layers and one decoder layer. In this paper, the hyper parameter λ of Informer was optimized for time series forecasting of motor bearing vibration data. Usually the ultimate goal of the learning algorithm is to find a function that satisfies the minimum loss function and the so-called learning of the algorithm is the learning of the hyper parameter. In this paper, random search was used to optimize the hyper parameter λ to determine a better model [34,[46][47][48]. The hyper parameter λ is as follows: λ( * ) ≈ argminmeanL(x; A λ (X train )), λ ∈ Λ, x ∈ X valid (28) ≡ argminΨ(λ), λ ∈ Λ ≈ argminΨ(λ) ≡ λ, λ ∈ {λ (1) ...λ (S) } where Ψ is the hyper parametric response function. {λ (1) ...λ (S) } is the experimental set.
where Ψ valid denotes the performance of the validation set; Ψ test denotes the performance of the testing set. The equation of the estimated variance of the mean is as follows: When multiple parameter values are close to optimal and do not differ significantly, they are determined by weighting the best probability in their particular λ (S) . In [34], it was proposed that X valid is a finite sample of G x ; thus, the testing set score of the best model in λ (1) ...λ (S) is a random number Z which is modeled by a Gaussian mixture model with µ S = Ψ test (λ S ) (the mean of S) and σ 2 S = V test (λ S ) (the variance of S). The weights are: The mean and standard error of Z in the optimal model are: By the above method, the hypothesis validation score Z S is continuously extracted from the normal distribution, its testing score is calculated, the optimal estimate value is selected and the optimal parameters are determined. In the face of time series forecasting of motor bearing vibration, the best forecasting result is obtained when the batch size is 16 and the learning rate is 0.0001 in Informer. When the learning rate is too large, the model will oscillate near the optimal solution, and when it is too small, the model will converge too slowly. The choice of dropout is related to whether the model excessively considers the data correlation and noise data. In order to prevent the model from being over-fitted which leads to the reduction of the model robustness, the best result is obtained when dropout is selected as 0.02 after the test. The parameters of Informer used in this paper are shown in Table 1.

Case Western Reserve University Bearing Dataset
This paper uses a publicly available bearing dataset from the Bearing Data Center at Case Western Reserve University (CWRU) in the United States [49]. The experimental rig used to acquire this dataset consisted of a 2 hp motor, a torque transducer/encoder, a dynamometer and control electronics. An accelerometer was placed above the bearing seat of the motor drive side and the fan side and a 16-channel DAT recorder was used to collect vibration signals. Speed and horsepower data were collected using the torque transducer/encoder and were recorded by hand. The bearing specification data used on the drive side and fan side are shown in Table 2. This dataset [43] is the life cycle data of bearings and there is a vertical and horizontal accelerometer on the housing of each bearing. There are three datasets, each containing the vibration data of four bearings. The bearing specifications used in this paper are shown in Table 3. The data information is shown in Table 4.

v43hmbwxpm Dataset
The data come from Taihua University and the experiments were performed on the SpectraQuest Mechanical Failure Simulator (MFS-PK5M) and the data consisted of vibration signals collected from bearings with different health conditions under time-varying rotational speed conditions [50]. Data were acquired by an NI data acquisition board (NI USB-6212 BNC) for a total of 36 datasets. For each dataset, there were two experimental setups: bearing health condition and variable speed condition. The bearing health conditions included (i) healthy, (ii) inner race damage, (iii) outer race damage, (iv) rolling element damage and (v) a combination of inner race damage, outer race damage and rolling element damage. The operating speed conditions were (i) increasing speed, (ii) decreasing speed, (iii) increasing then decreasing speed and (iv) decreasing then increasing speed. Thus, there were 20 different cases for the setup. The bearing parameters are shown in Table 5. Some of the bearing failure information is shown in Table 6.  Table 6. Bearing damaged information.

Dataset Selection and Division
Select 20,000 sample points from the DE side and FE side of the CWRU dataset, respectively, to form a new dataset, the CWRU_DF dataset. In IMS data, 20,000 sample points were selected respectively from channels 5 and 7 of the datasets, sets 1-8, to form the new dataset set 1; select the 1st to 20,000th sample points and 100,001st to 200,000th sample points from channel 1 of the sets 2-4 to form the new dataset set 2; select the 1st to 20,000th sample points and 30,001st to 50,000th sample points from channel 3 of the sets 3 and 4 dataset to form a new dataset set 3. In the v43hmbwxpm data, 20,000 sample points were selected, respectively, from I-I-1 and I-I-2 of the I-I dataset to form a new dataset; other new datasets were formed in the same way. The selection of the datasets, as shown in Figure 7. The above ten datasets were divided into the training set, the validation set and the testing set in the ratio of 7:1:2, respectively.

Experiment and Analysis
Because the epoch times of the five models used in the experiments of this paper varies widely, other convergence properties such as the speed of loss convergence of the five models trained under the dataset are not compared.
Each network model in this paper is implemented based on Python 3.9. The operating system is a 64-bit Windows operating system with 16.00 GB of RAM and a 12th Gen Intel(R) Core(TM) i7-12700KF 3.60 GHz processor.

Time Series Forecasting of Motor Bearing Vibration Based on Case Western Reserve University Bearing Dataset
CWRU data were selected to test the time series forecasting effects of CNN, Deep RNNs, LSTM, Transformer and Informer on data on the DE side and FE side. The data from different ends were tested to enhance the experimental results to be more accurate and convincing.
After the training and forecasting of the above five models, the MAE, MSE and RMSE of the above models were calculated. It was concluded that the Informer model has the best forecasting performance compared with other models, with MAE lower by 1.711 × 10 −3 , 6.692 × 10 −3 , 6.343 × 10 −3 and 3.361 × 10 −3 , respectively; with MSE lower by 1.147 × 10 −4 , 5.069 × 10 −4 , 3.887 × 10 −4 and 2.084 × 10 −4 , respectively; with RMSE lower by 2.511 × 10 −3 , 9.605 × 10 −3 , 7.649 × 10 −3 and 4.383 × 10 −3 , respectively, which is shown in Table 7. The forecasting diagrams are shown in Figure 8. It can be seen from the forecasting diagrams that the five models can forecast the next 500 sample points well on the DE and FE sides, but CNN and Deep RNNs were worse and LSTM was better in forecasting extreme values. The Informer not only fitted the trend of the data correctly, but also forecast the extreme values correctly to the maximum extent, with less offset than other models and fitted the original data best among five models.  The IMS data were selected to test the time series forecasting effect of the five models when different structures fail. Further comprehensive experiments were conducted by testing the data at the outer race of the bearing, the inner race of the bearing and the rolling element of the bearing to illustrate the forecasting ability of each model at different structures. The forecasting results of the five models used in this paper are worse under the IMS-based dataset compared to the CWRU-based dataset. The reason was that the IMS dataset has a large oscillation in the process of collecting data, which makes the collected data fluctuate more in amplitude and frequency. This problem will be the next research goal.
After training and forecasting of CNN, Deep RNNs, LSTM, Transformer and Informer, the MAE, MSE and RMSE of the above models were calculated. Compared with other models, the Informer had the best prediction performance, with MAE lower by 1.280 × 10 −4 , 1.896 × 10 −3 , 4.38 × 10 −3 and 1.245 × 10 −3 for set 1, respectively; with MSE lower by 9.900 × 10 −6 , 3.243 × 10 −4 , 7.720 × 10 −4 and 2.032 × 10 −4 , respectively; with RMSE lower by 7.200 × 10 −5 , 2.306 × 10 −3 , 5.372 × 10 −3 and 1.454 × 10 −3 , respectively, as shown in Table 8. The forecasting diagrams are shown in Figure 9. CNN and LSTM had the worst forecasting results with the damaged inner race of bearing 3 and the damaged rolling element of bearing 4 and they could not forecast the trend and extreme values well. It was able to forecast most of the extreme values with the damaged rolling element of bearing 4.  The MAE, MSE and RMSE of the Informer were slightly worse than those of CNN for set 2, with a difference of 2.710 × 10 −4 for MAE, 4.050 × 10 −4 for MSE and 3.25 × 10 −4 for RMSE. The MAE was 4.847 × 10 −3 , 4.973 × 10 −3 and 3.272 × 10 −3 lower than the other models, respectively. The RMSE was 5.745 × 10 −3 , 6.068 × 10 −3 and 4.133 × 10 −3 lower than the other models. The calculation results of MAE, MSE and RMSE for set 3 were the best in terms of forecasting performance compared with other models. The results are shown in Table 8. By comparing the forecasting results of the five models in Figures 10 and 11, it can be seen that Deep RNNs, LSTM and Transformer do not have good forecasting results in the case of damaged outer race of bearing 1 and outer race of bearing 3. The results of the Informer comparing MAE, MSE and RMSE under set 2 were not as good as those of CNN. However, it can be seen from Figure 10 that CNN did not forecast the trend and extreme values well in the first testing set of set 2, although it was improved in the second testing set, but based on these two testing sets, Informer performed better, not only forecasting the trend of the data series better but also forecasting some of the extreme values. It can be seen from Figures 10 and 11 that the five models can forecast the basic trend of the data series, but the forecasting of the extreme values is poor.  In this paper, the v43hmbwxpm data are selected in order to investigate the time series forecasting capability of the five models under six different conditions. These data contain data collected from the inner race, outer race and rolling element of the bearing in the accelerated condition and data collected from the inner race, outer race and rolling element of the bearing in the decelerated condition. These data were selected to complement the time series forecasting based on multiple conditions for different structures. The robustness of each model was further compared by training and testing the data to provide a strong experimental illustration for the findings of this paper.
After the training and forecasting of CNN, Deep RNNs, LSTM, Transformer and Informer, the MAE, MSE and RMSE of the above models were calculated. For datasets of inner race damage (I-I), outer race damage (O-I) and rolling element damage (B-I) under accelerated conditions, compared with other models, the Informer achieved the best forecasting results, as shown in Table 9. The forecasting diagrams are shown in Figures 12 and 13. The forecasting diagrams show that Transformer has poor forecasting results, while CNN, Deep RNNs and LSTM are able to forecast the data transformation trends and some of the extreme values, but their forecasting results had a certain offset. Compared with the other models, Informer had the best forecasting results, which can not only forecast the trend of data series transformation and extreme values better, but also has less offset. The forecasting diagrams of the dataset (B-I) with damaged rolling element forecast under the accelerated condition are shown in Figure 14. CNN, Deep RNNs and LSTM are able to forecast the trend of data series, but they are not better than Transformer, which is not specifically designed for the time series forecasting. Informer was closest to the real data in terms of trend and also forecast most of the extreme values with minimal offset.    The prediction results for the inner race damage dataset (I-D) under decelerated conditions and the outer race damage (O-D) dataset under decelerated conditions showed that Informer achieved the best forecasting results compared to the other models, which is shown in Table 10. The forecasting diagrams are shown in Figures 15 and 16. It can be seen from Figure 15 that the Transformer model has a better forecasting effect of the data series trend, but there is an overall upward shift. CNN, Deep RNNs and LSTM are found to have poorer forecasting results for the trend and extreme values of the data series, compared with Informer which has a better fit with the real data.   The MAE, MSE and RMSE of Informer based on the rolling element damage (B-D) dataset under decelerated conditions were slightly worse than those of CNN and Transformer; the difference of MAE is 1.243 × 10 −3 and 1.261 × 10 −3 , respectively; the difference of MSE is 2.030 × 10 −3 and 1.948 × 10 −3 , respectively; and the difference of RMSE is 1.623 × 10 −3 and 1.548 × 10 −3 , respectively. Compared with Deep RNNs and LSTM, the MAE of the forecasting results are lower by 4.377 × 10 −4 and 6.674 × 10 −4 , respectively; the MSE lower by 9.361 × 10 −6 and 1.056 × 10 −5 , respectively; and the RMSE lower by 6.340 × 10 −3 and 7.113 × 10 −3 , respectively, as shown in Table 10. The forecasting diagrams are shown in Figure 17, from which it can be seen that Deep RNNs and LSTM have offsets in the data sequence forecasting and some extreme values are not well forecasted. Compared with CNN and Transformer, Informer has a small difference in the forecasting of the change trend of the data series and the offset of its own forecasting results is small. The offset of individual extreme value forecasting is relatively large, so the calculation results of MAE, MSE and RMSE are not as good as these two models.

Conclusions
The motor is the core equipment of the power station and time series forecasting of motor bearing vibration is a crucial step in bearing fault diagnosis, bearing remaining service life prediction, etc. Therefore, we specialize in research on time series forecasting of motor bearing vibration. In this paper, Informer is innovatively introduced into time series forecasting of motor bearing vibration and the model structure is optimized and the parameters of Informer are optimized by applying random search. The datasets CWRU, IMS and v43hmbwxpm were used for time series forecasting of motor bearing vibration and the experimental results were analyzed. The analysis showed that, compared to the existing work, Informer is able to forecast the future time series quickly and accurately when facing inner race damage, outer race damage and rolling element damage. Superior results can still be obtained for damage under accelerated or decelerated conditions, with better forecasting results for data-series trends and extreme values of data. It had excellent performance in evaluation indexes such as MAE, MSE and RMSE and the forecasting results. The forecasting of conventional models is prone to certain offset, while the forecasting results of the method proposed in this paper were more closely matched to the real data and this method reduced the error accumulation in forecasting and improved the model forecasting performance. It can be used for sensing technology monitoring.
In the future, we will conduct study and research concerning time series forecasting methods. Deeper research on data with oscillation, fluctuation amplitude and fluctuation frequency will be carried out and the impact of this problem on the forecasting operation will be solved. Self-testing data will be added in future experiments to further improve the persuasiveness of the model. Bearing fault diagnosis or bearing remaining useful life prediction will be taken as the next directions of research.