A Dual-Attention Recurrent Neural Network Method for Deep Cone Thickener Underflow Concentration Prediction

This paper focuses on the time series prediction problem for underflow concentration of deep cone thickener. It is commonly used in the industrial sedimentation process. In this paper, we introduce a dual attention neural network method to model both spatial and temporal features of the data collected from multiple sensors in the thickener to predict underflow concentration. The concentration is the key factor for future mining process. This model includes encoder and decoder. Their function is to capture spatial and temporal importance separately from input data, and output more accurate prediction. We also consider the domain knowledge in modeling process. Several supplementary constructed features are examined to enhance the final prediction accuracy in addition to the raw data from sensors. To test the feasibility and efficiency of this method, we select an industrial case based on Industrial Internet of Things (IIoT). This Tailings Thickener is from FLSmidth with multiple sensors. The comparative results support this method has favorable prediction accuracy, which is more than 10% lower than other time series prediction models in some common error indices. We also try to interpret our method with additional ablation experiments for different features and attention mechanisms. By employing mean absolute error index to evaluate the models, experimental result reports that enhanced features and dual-attention modules reduce error of fitting ~5% and ~11%, respectively.


Introduction
Deep cone thickener, also named paste thickener, is an important equipment in industrial mining process, especially for sustainable mining environment protection. It is a giant complex system to generate raw material for backfill paste in the processed mines. A general framework of thickener and key processing parameters are illustrated in Figure 1. Stable underflow concentration is a fundamental index to discriminate against the performance and stability of industrial production process. Many parameters during production affect the stability of underflow concentration. Unstable volume and concentration of feed flow disturb the mass balance of mud bed in thickener. This usually leads to underflow concentration oscillation. Other parameters, such as flocculant dosage and underflow volume, also affect the underflow concentration. In industrial thickener production process, underflow concentration prediction is the top priority for further system control. The current thickener system is highly depending on massive integrated sensors to monitor and control the production process, known as thickener with Industrial Internet of Things (IIoT) [1]. From this system, data are collected on real time from all the sensors and provide decision support for operators and managers [2]. These data are also useful for future equipment diagnosis.
Traditional underflow concentration can be modeled as a typical multidimensional time series prediction formulation. The change of underflow concentration obeys an unknown distribution in time domain which can be formulated by p(y t+1 − y t |y 1 , . . . , y t−1 , y t ) with y t ∈ R. Expect for underflow concentration, some other relevant series, which are monitored from different sensors, provide additional prior knowledge to predict underflow concentration in future. Formally, we assume n additional sensors are considered and all sensors capture the processing values at the same time.
x t ∈ R n represents a group of monitored values from n sensors at time step t. Theoretically, distribution p(y t+1 − y t |y 1 , . . . , y t−1 , y t , x 1 , . . . , x t−1 , x t ) has lower entropy than p(y t+1 − y t |y 1 , . . . , y t−1 , y t ). This paper focuses on the construction of such a multidimensional time series prediction model, which can predict y t+1 according to previous seen spatial features (x 1 , . . . , x t−1 , x t ) and temporal features (y 1 , . . . , y t−1 , y t ). Most of these studies modeled the thickener system mostly with mathematical methods [3,4] or data-driven methods [5]. Mathematical models give state equations of underflow concentration. These equations are deduced from the physical and structural law. However, these methods suffered from the complexity of thickener system and external environment disturbance. Therefore, they are restricted for accuracy and universality. Data-driven system identification has better adaptability and better performance than conventional mathematical model-based methods [6,7]. In this paper, for problem setting, we have collected massive sensor data from the concrete industrial process. After the discussion with the domain expert, the aim is to build relationship between sensor data and underflow concentration values. For that reason, we need an end-to-end regression model based on sufficient training data.
Conventional time series prediction models are widely used in industrial analysis, such as autoregressive integrated (AR) [8], autoregressive integrated moving average (ARMA) [9], recurrent neural network and Long Short-Term Memory (LSTM) [10]. These methods achieved much success in various industrial fields. Here, we list two main challenges in cone thickener systems: • Long time delay. It occurs inevitably during the change of underflow concentration. In practice, one parameter evolves and can affect the concentration after a long time interval. In addition, the influence levels can vary over time.

•
Unknown spatial sensor correlations. Different parameters in the system can affect the underflow concentration in distinct and complex forms. The challenge is that these complex interactions are still unknown from domain knowledge.
To overcome these challenges, we seek a model which can both encode the long time series and explore useful features from high-dimensional and plenty of data adaptively. Therefore, in this paper, we propose a dual-attention recurrent neural network method to solve this question. It generally includes two mechanisms: encoder and decoder. They are used to capture the spatial and temporal features from original sensor data and predict underflow concentration accurately in the thickener. To further enhance the accuracy of model, we also introduce some domain knowledge of the thickener system into the design of model. The numerical relationships between concentration, density, volume and mass are considered in our feature designing. Our industrial case study results show that the dual-attention mechanisms and added features play an important role in this problem. In addition, this method outperform the other commonly used time series predict models in comparative accuracy and efficiency.
The contributions of our work are listed as follows.
• We propose a dual-attention time series prediction model to predict the underflow concentration in the thickener system. It consists of encoder and decoder. The encoder is used to capture spatial importance of the inputted high-dimensional series. The decoder is used to capture temporal importance of the inputted long time series. • Feature enhancement are designed based on domain knowledge for underflow concentration prediction.

•
This method is applied in a concrete case study with Tailings Thickner from Metso. The data are collected directly from the industrial mining process. The prediction results show this method outperforms both in accuracy and efficiency.
The remaining part of the paper is organized as follows. Section 2 reviews the related studies about thickener system identification, data-driven data analysis methods, and attention-recurrent neural network. Section 3 introduce the details of proposed method, including basic formulation, feature enhancement methods, and model structure. Section 4 presents extensive experiments to evaluate the proposed methods and verify the effectiveness of model details. Section 6 gives the conclusion and discusses the meaningful future work directions.

Related Work
The thickening of tailing slurry is the primary process of paste filling. It is a critical procedure in modernized mining [11]. In thickening process, too high concentration can lead to accidents such as pipe plugging. In the opposite side, too low concentration will decrease the strength of backfilled paste and further reduce safety level of the whole mining process. Therefore, it is significant to predict the change of underflow concentration for the operators to keep concentration stable. Underflow concentration prediction can be seen as a system identification field based on the thickener itself with complex physical process inside. Here, we discuss two general research categories: model-based simulation and data-driven system identification.

Model-Based Thickener System Simulation
One typical solution is to build a mathematical function for system input and underflow concentration to predict the dynamic thickening process. This function is usually with the form of differential equations. Based on this model, the future underflow concentration can be calculated directly or by numerical integration method. A thickener dynamic model based on the sedimentation consolidation theory is proposed in [4,12]. The authors of [3] extend a one-dimensional model for the dynamics of a flocculated suspension in a clarifier-thickener to include the discharge yield stress and particle size distribution in a manner that is computationally tractable.
Mathematical methods can be explained and accurate dynamical equation could be helpful for other works, such as fault detection and optimal control. It usually suffered from the complexity of slurry particles dynamics and external unknown environment disturbance. Most dynamical models are built on lots of ideal hypotheses, which cannot often be satisfied in practical industrial process.

Data-Driven Thickener System Identification
In contrast, another idea which is widely used in the current IIoT systems. Ref. [13][14][15][16] adopted the data-driven method for learning a parameterized model from the real system trajectories. This method lessens the difficulty of theoretical analysis and learns from data directly. Normally, learned parameterized model performs better than conventional purely model-based method on a specific dataset. In The Internet of Things(IoT), Xiao et al. [5] analyzed the characteristics of the thicker washing process and propose the hybrid model combining mechanism modeling and error compensation model based on Extreme Learning Machine algorithm [17]. The results show that the prediction error of the hybrid model is lower than that of the mechanism model. Zhang et al. [18] designed a deep neural network model to predict equipment running data and improve the accuracy by systematic feature engineering and optimal hyper parameter searching.
Inspired by some theories of human attention [19], an encoder-decoder with attention recurrent neural network has been used in industrial systems [20]. Attention mechanisms can capture the long-term temporal dependencies appropriately and select the relevant feature series to assist the prediction module. In this work, we follow the basic structure of encoder-decoder model to construct our recurrent neural network.
From the perspective of data, feature enhancement is a key process of feature engineering in machine learning tasks [21]. The trained model can performs much better by learning from sophisticated features. In this paper, we will also build several additional features according to the prior knowledge of thickening system. Table 1 compares the detailed properties contributions of each reference and the proposed method. It suggests that the proposed DARNN method has better accuracy with the benefit from the design of network structure and input features. However, the pure deep neural network framework makes the model have less interpretability and it is hard to transfer the model from one thickener to another.

Methods
This section will first introduces the mathematical formulation of solved problem and shows the model details from two aspects: Feature enhancement and Dual-Attention mechanism for high-dimensional time series prediction. The overall illustration of the proposed method is shown in Figure 2.

Problem Formulation and Variable Definition
The underflow concentration prediction problem belongs to time series analysis field. n sensors installed in thickener monitor parameters ., x n t ] T and underflow concentration y t by physical signal transmitter module. Details of state parameters x are shown in Table 2. All of employed monitoring points are designed from industrial perspective and have direct or indirect impact to the change of underflow concentration in future. The statistical relationships among various sensors installed in separate positions are named spatial relationships. The statistical relevance of sensors in the time dimension are named temporal relationship. Two kinds of relationships are employed in the proposed model to predict the future underflow concentration. Collected data will be stored in historical database which is usually installed in Distributed Control System (DCS) system. To predict the future unknown underflow concentration, historical data (x t−T+1 , ...x t−1 , x t ) and (y t−T+1 , y t−1 , y t ) are exploited to estimateŷ t+1 ∈ R. Our goal is to makeŷ t+1 closed to y t+1 . The question above can be summarized as a minimization problem shown in (1).
An optimal model f is desired to minimize the mean square error between estimatedŷ t+1 and real y t+1 over the probability distribution of input which are assigned by thickener system.

Feature Enhancement
Many researchers demonstrate that solid mass of the mud bed, m(t), makes a strong impact to underflow concentration. Meanwhile, based on mass balance law, changes of the total solid mass of mud bed mainly depend on the solid mass flow of feeding and discharging changes [22]. Therefore, the changed solid mass can be calculated by (2).
We assume the flow speed and concentration change linearly and let I is the data sampling interval. Therefore, the current solid mass in tank can be simplified to (3), where φ U (t) and φ F (t) are the real-time density of underflow and feed flow, respectively. The relationship of density and concentration for tailing slurry usually obeys the quadratic function in (4): We adopt physical detection methods to measure the concentration and density data from plenty of slurry samples. The parameters in the equation are fitted and the result is : a = 1.2198, b = 0.2390, c = 1.0510. Finally, we add six additional features to represent the properties of solid mass in Table 3: Table 3. Detailed monitoring point list in thickener system The increment of solid mass from feed slurry.
The decrease of solid mass by discharging underflow.
The changes of solid mass in tank.
Cumulative changes of solid mass in tank.
To the end of the paper, the features for prediction we utilize are

Dual-Stage Attention-Based Mechanism for High-Dimensional Time Series Prediction
This paper employs a time series prediction model named DARNN for predicting underflow concentration. In the subsection, the structure of DARNN will be introduced at first and then we will explain how to model underflow concentratioin prediction problem based on DARNN model.
To simplify the expression in this part, we make a little change on the input series. For the given input sequence X = (x t−T+1 , ...x t−1 , x t ) and y = (y t−T+1 , ...y t−1 , y t ), we rewrite the indexes of each series to construct equivalent X = (x 1 , ...x T−1 , x T ) and y = (y 1 , ...y T−1 , y T ). Correspondingly, our goal is changed to estimate theŷ T+1 as accurate as possible.

The Relationship Between DARNN and RNNs Family
RNNs are a family of architectures that have been used to model squential problems, as their hidden states carry information of past input series. As one of the most popular architecture, the encoder-decoder framework parts the sequence translation process into two phases and it is widely used in machine translation and sequence generation. Two stacked RNNS build the architecture. The first one is named encoder, which encodes the input series of arbitrary dimension to a vector representation in a fixed-length space. The second RNN is named decoder, which decodes the vector representation above to a target sequence. Two modules are trained together to minimize the loss penalty of the output target sequence. The two processes above can be formulated as f 1 and f 2 : Decoding stage: Some references [23,24] show that when the dimentions of input sequence increase, fixed-length representation cannot encode the high-dimensional sequence well, which makes the performance dropped rapidly. To confront this problem, a mechanism named attention is employed in decoding stage which assign the weights of hidden states h j dynamically at each time step. The formulation of decode stage is changed to (7): with (8): The attention weight β i t represents the temporal importance of encoded information. It is calculated by (9) and (10): and where [d t−1 ; h i ] ∈ R p+m is a concatenation of previous hidden state in decoding stage and the output from encoder mechanism. v d ∈ R m and W d ∈ R m×(p+m) are parameters to learn. The fully connected neural network determined by parameters Decoder predicts the target sequence conditioned on time-varing hidden vector c t . Plenty of successes in sequence modeling tasks make the encoder-decoder framework used in almost all advanced recurrent architectures. Some theories of human attention [19] argue that behavioral results are best modeled by a two-stage attention mechanism. Human attention system can select elementary stimulus features in the early stages of processing. Based on the encoder-decoder framework, a new network structure, named dual-stage attention-based recurrent neural network (DARNN) is proposed in [25]. Compared with single attention encoder-decoder architecture, DARNN adds the consideration about the weighted-importance of input relevant series. In the encoding stage, an input attention mechanism is used to adaptively select the importance for every component x k t at each time step t. The encoding process (5) is updated to (11): Encode stage: Each original component is transformed to a weighted one with (12): Attention weight a k t is determined by hidden state h t−1 and the complete kth relevant sequence x k = [x k 1 , x k 2 , . . . , x k T ] in all time steps. Here, another fully connected network and a softmax normalization are employed in the second attention model: and where h t−1 is hidden state of encoder, and v e ∈ R T and W e ∈ R T×(m+T) are learnable parameters and shared to each relevant sequence x k . With the above attention mechanism, h t carries the deeply encoded information of x t accompanied with the input information from other time step x i where i = t.

Modelling Underflow Concentratioin Prediction Problem based on DARNN Model
This paper follows the concept of DARNN framework and solves the high-dimensional underflow concentration prediction problem with a Temporal and Spatial Attention Mechanism. A graphical illustration of the model is shown in Figure 3.  The output of the encoder mechanism is the input of the decoder mechanism. Encoder is employed to embed the history series to encoded features h t , which are inferred from a Lstm mechanism in encoder module. Then, the encoded features will be decoded by decoder module and produce new hidden state d t . The third neural network estimates the difference between y t+1 and y t from d t and another context features c t . (a) Overall framework of Encoder mechanism; (b) Overall framework of Decoder mechanism and output neural network As Figure 2 shows, the complete model is a learnable chain that consists of three main parts: encoder, decoder, and a global residual network for predicting the underflow concentration. The work flow of encoder and decoder has been introduced in the last part. There is a slight difference in proposed method that the underflow concentration sequence. y = (y 1 , ...y T−1 , y T ) is not encoded by the encoder mechanism. Because the sequence y is a shallow feature, which has straightforward statistic relationship with predictedŷ T+1 , it does not need to encode the y like other relevant series X. We make it as a part of the input of decoder mechanism. Therefore, the equation of decoding process (7) is changed to (15).
The key reason for using an LSTM unit is that it can overcome the problem of vanishing gradients and better capture long-term dependencies of time series. This advantage is especially useful for thickener system prediction because long time delay often occurs when system changes. Finally, encoder and decoder modules transform the original input sequences y and X to another high-dimensional feature sequences (d 1 , d 2 , . . . , d T ) and (c 1 , c 2 , . . . , c T ). Another network module reserves the feature representation in last time step T and produce the desiredŷ T+1 in (27).
y T+1 = F (y 1 , · · · , y T , x 1 , · · · , x T ) [d T ; c T ] ∈ R p+m is a concatenation of the decoder hidden state and the context vector. A single hidden layer neural network composed with learnable input layer (W y , b w ) and hidden layer(v y , b v ) is utilized to produce the final prediction result. The usage of c T in the last prediction phase could be explained from multi-level feature fusion perspective [26]. Because c T is the weight-sum of (h 1 , h 2 , . . . , h T ), it includes all the embedded information from encoder module. This skip connection plays a similar role to maintain the range of gradient just like res-block or dense-block [27]. Furthermore, there is a bias term y T in the (27), which means the model does not learn the underflow concentration y T+1 , but the difference ∆y = y T+1 − y T . Because the underflow concentration almost changes in continuous way. In adjacent two time steps, underflow concentration in next time step y T+1 is approximately equal to the current value y T . This trick makes the model employs the prior information from y T more adequately. Experimental result shows that the bias term results in much lower initial model error before training than no-bias schema and the model could converge to best parameters rapidly.

Model Training
All of operations in our model are smooth and differentiable, so we can train the model by standard back propagation algorithm with the loss function defined in (22), where N is the number of training samples. More details of the training will be introduced in next part. The code is implemented by pyTorch and the source code can be found in github (https: //github.com/Kyrie-Hu/Thickener-Underflow-Concentration-Prediction).

Industrial Case Study
In this section, we first describe the dataset collected from the our thickener IIoT platform. Detailed experimental settings are given with comparative results against LightGBM, RNN, and LSTM on prediction accuracy. To provide explanations of this method, ablation tests are done for further analysis of the attention mechanisms.

IIoT Platform
This study is based on an IIoT platform to support the communication among sensors, industrial equipment, distributed control system, and high-performance computing server. The topology graph of the framework is shown in Figure 4. Details of deployed sensors in factory are listed in Table 4. A sample of the dataset is shown in Table 5. This system takes the advanced SIMATIC Process Control System PCS 7 APL in our case. Training data are all real production data and collected from the IIoT platform. Table 4. Details of sensors in data collection system.

Data Preprocessing and System Set-Up
To verify the performance of proposed method and other baselines adequately and fairly, batches of data come from different time periods are employed to train model and test model separately. We construct training data set by using production data during May to June in 2018. Test dataset is corresponding to original data which are produced in September 2019.
We make lots of data preprocessing procedures on the origin dataset which are derived from the thickener system, including removing outlier data, deleting the interval when the system is out of service, and normalizing data to make each series indicate standard normal distribution. There are~14,800 clean data left after preprocessing, and the sampling period between two adjacent points is 2 minutes. Each data point has a total of eight parameters including the underflow concentration column. Then, according to the correlation analysis between features, we create six additional features for each record by using the method introduced in Section 3.2.
Finally, we collect a dataset which has 14 features in each data point. In our study, underflow concentration is the predicted target series, and other 13 features are relevant series. The first 8847 data points from training set are used to train the model, and the following 2949 data points are the validation set which can help us find the best experimental parameters and stop the training iterations properly. Test data set has 2949 data points of all which are used as to test. A diagram illustrating the process of data preprocessing is shown in Figure 5. We use minibatch stochastic gradient descent (SGD) together with the Adam optimizer [28]. The size of one batch is 128 and learning rate is set to 0.001 invariably.

Accuracy Analysis of Underflow Concentration Prediction
To demonstrate the effectiveness of our method, we compare it against three other methods. Among them, LightGBM [29] is a gradient boosting decision tree (GBDT) algorithm. It contains two novel techniques: gradient-based one-side sampling and exclusive feature bundling, dealing with the problem of large number of data instances and features, respectively. Recurrent neural network (RNN) is a classical method to address time series prediction. Long short-term memory (LSTM), which is the most popular method for time series prediction, successfully solved the problem of gradient explosion and gradient vanishing of RNN.
To measure the effectiveness of various methods for time series prediction, we consider four different evaluation metrics. Among them, root mean squared error (RMSE), root mean squared logarithmic error (RMSLE) [30], and mean absolute error (MAE) are scale-dependent measures, and mean absolute percentage error (MAPE) is a scale-independent measure. Specifically, assuming y t is the target at time t and y t is the predicted value at time t, RMSE is defined as and MAE is denoted as When comparing the prediction performance, mean absolute percentage error is popular because it measures the prediction deviation proportion in terms of the true values, i.e., RMSLE is an evaluation metric from the Kaggle competition, calculated as The results of baseline methods and ours over the dataset are shown in Table 6.
In Table 6, we observe that the MAE of LightGBM is generally worse than RNN-based approaches. Because the input of LightGBM model does not include historical data points, the model cannot make full use of the historical information of sequences. For RNN-based approaches, the performance of LSTM is better than that of RNN, illustrating that LSTM is more capable to capture long-term temporal dependence which is essential in our problem. It not only uses an input attention mechanism to extract relevant feature series, but also employs a temporal attention mechanism to select relevant hidden features across all time steps. Both attention mechanisms preserve meaningful features and inhibit useless features during the feedforward stage. It is a significant improvement because that attention branch makes the model no longer infer theŷ T+1 in statistic schema constantly. The comparison of prediction results of different algorithms is shown in Figure 6.     To further investigate the importance of input features, we designed a comparative experiment. Specifically, we generate six additional feature series through analyzing the operating characteristics of deep cone thickener. Then, we put these six enhanced feature series together with the eight original feature series as the input and test the effectiveness of our method. In Table 6, we can that clearly, using either LSTM or our method, the performance of enhanced feature series are significantly higher than that of original feature series.

Comparison of Temporal Attention and Spatial Attention
To verify the efficiency of two attention mechanism in our model, we make an ablation experiment to study the promotion of each attention part by deleting one or two attention modules. The experimental results are shown in Table 7. In Table 7, the temporal attention RNN outperforms the no attention RNN. This suggests that adaptively extracting feature series can provide more reliable input features to make accurate predictions. From another aspect, the performance of spatial attention RNN are better than that of the no attention RNN. This shows that the importance of different time points in the time series can provide effective data support for the prediction. Our method combined temporal attention and spatial attention, as a result, achieving the best results in the predictions.

A Study on the Effect of Global Residual Connection
In this subsection, an ablation experiment is conducted to study the effect of global residual connection in Equation (27). The skip connection is deleted in the compared model and two models are all trained with stochastic parameters. The validation losses of two models during training phase is illustrated in Figure 7. The improvement comes from the skip connection can be explained from the properties of thickening system. In the industrial control field, the dynamical system of thickener is always formulated as an ordinary differential equation (ODE) [3]: Relevant parameters x(t), such as mud pressure, feed flow rate, and the other dynamical variables and the underflow concentration y(t), make direct impact on the derivative of underflow concentration, which is defined by h(y(t), x(t)). In the proposed method, the global residual connection makes the DARNN model learn the current derivative h(y(t 0 ), x(t 0 )), which can be viewed as discretizing the continuous thickening system. When t 0 is approximately equal to t 1 , the difference of underflow concentration y(t 1 ) − y(t 0 ) is approximately equal to the (t 1 − t 0 )h(y(t 0 ), x(t 0 )). In our method, the distance between two adjacent time steps is 2 minutes which is extraordinarily short for thickening process. So the error of discretization is relatively slight and prediction accuracy can be improved by simplifying the target function.

Discussion
This study supports evidence that dynamic attention branches bring into correspondence with the dynamic properties of thickener. For example, various feed concentration will not influence the underflow concentration at once. The effect will take place after a while. However, the time delay is not constant which is closely related to the height of mud bed. Many similar phenomenons exist in thickening process. Therefore, a simple sequential network without dynamic branches can hardly fit the dynamic properties well. In the perspective of the data quality, as we all know, sensors monitor industrial data by converting physical signals to electrical signals and generating the numerical values. In this process, various noises degrade the performance of sensors. In the thickener system, the prediction model not only learns to estimate the future underflow concentration, but also counteracts the noisy input and noisy feedback loss. Data with poor quality can hardly generate high quality models which perform well to predict concentration in long-time future. Compared with other models, DARNN has added parameters and a dynamical branch which improve the ability to filter the high frequency noise from the input.
Furthermore, thickening is a slow process and underflow concentration almost does not change impulsively. Compared with DARNN, other time series prediction methods all represent that the estimated underflow concentrationŷ t+1 is extremely close to the current underflow concentration y t . The behavior makes the model receives relatively low loss-penalty, but it has no significance for industrial demand. Because of the global residual connection, DARNN fits these tiny changes of concentration well which improves the accuracy and gives important indications to help the operator evaluate the current production and feedforward control.

Conclusions
In this paper, we present a dual-attention method for predicting the future underflow concentration of thickener system. This method also include a feature enhancement stage from domain knowledge. By considering the properties of thickener system, we produce another six derived features from original sensor data to make the model learn latent regularity of underflow concentration changes in Thickener easily. The dual-attention method is implemented by a composition of encoder and decoder mechanisms. They are used to capture both temporal information and relevant information from inputted history data.
We applied this method in an industrial IIoT platform. The results show that the enhanced features improve the prediction accuracy significantly and the proposed method outperforms other commonly used time series models. Meanwhile, two ablation experiments are conducted to prove that the contributions of different attention mechanisms and global residual connection are significant. This method also have potential usages in other industrial time series problem which has obvious temporal and high-dimensional properties. However, numerous parameters and complex operations restrict the efficiency of the model which makes it not suitable for real-time occasion. A more lightweight network structure is expected to achieve similar performance in the future studies.