A Multivariate Temporal Convolutional Attention Network for Time-Series Forecasting

: Multivariate time-series forecasting is one of the crucial and persistent challenges in time-series forecasting tasks. As a kind of data with multivariate correlation and volatility, multivariate time series impose highly nonlinear time characteristics on the forecasting model. In this paper, a new multivariate time-series forecasting model, multivariate temporal convolutional attention network (MTCAN), based on a self-attentive mechanism is proposed. MTCAN is based on the Convolution Neural Network (CNN) model, using 1D dilated convolution as the basic unit to construct asymmetric blocks, and then, the feature extraction is performed by the self-attention mechanism to ﬁnally obtain the prediction results. The input and output lengths of this network can be determined ﬂexibly. The validation of the method is carried out with three different multivariate time-series datasets. The reliability and accuracy of the prediction results are compared with Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Long Short-Term Memory (ConvLSTM), and Temporal Convolutional Network (TCN). The prediction results show that the model proposed in this paper has signiﬁcantly improved prediction accuracy and generalization.


Introduction
A multivariate time series is an important data object, which is a series of observations formed by multivariate variables recorded in chronological order. Multivariate time series are used in more and more fields, such as the environment [1,2], finance [3,4], transportation [5][6][7], healthcare [8], and energy [9,10]. In these fields, time-series prediction is used to monitor some critical data and avoid the occurrence of unforeseen situations that cause economic losses. For multivariate time-series prediction tasks, early solutions mainly choose recurrent networks, but recurrent networks suffer from gradient disappearance and gradient explosion problems, due to which the long-term dependence problem of RNNs [11,12] cannot be solved. The time-series structure on the one hand makes it difficult to have efficient parallel computing capability (the computation of the current state depends not only on the current input but also on the input of the previous state), and on the other hand makes the RNN model, including variants of LSTM [13], GRU [14], etc., more similar to a Markov decision process [15] in general and difficult to extract global information. In addition, CNN models [16] have started to be applied to sequence modeling. For multivariate time-series [17] problems, these models also have difficulty capturing the mapping relationships between multiple variables as well as adapting to complex data features.
Traditional CNNs are generally considered less suitable for modeling time-series problems, which is mainly due to the limitation of convolutional kernel size and thus cannot capture long-time dependent information well. With the development of deep learning, some specially processed convolutional neural networks can also achieve good results for time-series modeling. The TCN model [18] based on CNN model uses causal and inflation convolution and residual modules [19] to make it suitable for temporal modeling tasks, and TCN can reach or even surpass the RNN model in many tasks. In contrast to the RNN model, the CNN model has no temporal structure and can perform parallel computations to maximize the use of computing power. Our goal is to explore a better architecture based on CNN with attention mechanism and feedforward neural networks to achieve an approximate replacement of recurrent networks and improve training efficiency while ensuring effectiveness for multivariate time-series problems.
Inspired by TCN and the attention mechanism [20], in this paper, we propose a MT-CAN model for multivariate time-series prediction. In this MTCAN, we use a feedforward neural network as the base unit to construct residual blocks; then, we enhance the interpretation of features by an asymmetric residual block network [21] and finally perform feature extraction by a self-attentive mechanism. The paper is organized as follows: Section 2 reviews the background of the work. Section 3 describes the modeling approach. The experiments are analyzed and discussed in Section 4. Finally, conclusions and outlook are drawn in Section 5.

Background and Related Work
In the task of time-series prediction [22,23], researchers have proposed many solutions. Various models have been developed from the earliest classical statistical-based methods to the current deep learning algorithms. In 1927, the British statistician G.u. Yule proposed the AR (Auto Regressive) model [24,25]. In 1931, G.T. Walker proposed the MA (Moving Average) model and the ARMA model [26,27], which formed the basis of time-series analysis. Subsequently, Box and Jenkins discussed the ARIMA (an autoregressive integrated moving average) model [28]. All four models require the time series to be univariate, homoskedastic linear models. In recent years, techniques such as machine learning and neural networks have developed rapidly, and these new methods have been applied to time-series forecasting. In 1998, White applied the neural network approach to time-series forecasting. Vladimier N. Vapnik proposed the original support vector machine [29] and used it in capital cost estimation. In 2006, Geoffery Hinton and Ruslan Salakhutdinov proposed a solution to the gradient disappearance problem in deep network training, and deep neural networks came back into the limelight. In particular, CNNs and RNNs have received widespread attention. Convolutional neural is widely used in image recognition, and recurrent neural network is widely used in sequence modeling.
The main approach in deep learning to deal with prediction problems is recurrent neural networks. However, since recurrent neural networks suffer from gradient disappearance or explosion, they cannot solve the long-range dependence problem. For this problem, Hochreiter and Schmidhuber proposed the long short-term memory network [13]. For the gradient disappearance problem, a gate mechanism is used to solve it. For the problem of short-term memory overwriting long-term memory, LSTM adopts a cell state to preserve long-term memory and then cooperates with the gate mechanism to filter the information to achieve the control of long-term memory. The gated recurrent cell network was proposed by Cho et al. GRU can be regarded as a simplified version of LSTM. For LSTM and GRU, the iterative process can be greatly accelerated because GRU has fewer parameters and converges faster.
Although CNNs are generally tasked with image classification, with dedicated design, they have been confirmed to be significant tools for sequence modeling and prediction. Bai et al. proposed time-domain convolutional networks, which consist of dilated, causal 1D convolutional layers with the same input and output length. They were able to show that in many tasks, convolutional networks can achieve better performance than that of RNNs in avoiding common drawbacks of recursive models, such as the gradient explosion/disappearance problem or the lack of memory retention. For a multivariate time-series task featuring the dataset of multivariate series at time step t, depending on the previous data point, any two data points may be correlated, and the data within the data points may be correlated. Therefore, a feasible multivariate time-series model should describe not only the correlation between the elapsed relationships and data points as in a univariate time series but also the correlation of the data within the data points.
Wan et al. proposed an M-TCN model [30] in solving multivariate time-series forecasting problems. It is proved to have better performance on multivariate time-series tasks, but M-TCN uses fully connected layers, which leads to a large number of parameters in the model and thus makes the model match more time-consumption.
The self-attention mechanism is a new attention mechanism proposed by Ashish Vaswani et al. The attention mechanism essentially assigns a weight factor to each element of the sequence, which can also be understood as soft addressing. If each element in the sequence is stored as (K, V), then the attention mechanism accomplishes addressing by computing the similarity between Q and K. The similarity computed between Q and K reflects the importance of the taken out V value. The difference between the self-attention mechanism and the attention mechanism is that the self-attention mechanism reduces the dependence on external information, and it is the elements in the sequence that find the similarity themselves. Such a mechanism enhances the capture of dependencies. We adopt the self-attention mechanism for feature extraction here to improve the effectiveness of the model and reduce the number of parameters.
In this context, our model uses a TCN-based design with a self-attention mechanism for feature extraction. It is tested under three datasets in the areas of PM2.5 prediction and electricity prediction as well as weather prediction.

Methodology
For the multivariate time-series forecasting task, we first describe the definition and construction of a sequence model. What we highlight is the idea and structure of the proposed model MTCAN by incorporating a self-attention mechanism.

Sequence Problem Statement
Deep learning is essentially the use of deep neural networks to fit the complex nonlinear relationships between data and labels. In order to obtain the desired nonlinear mapping during model learning, a large amount of data is needed for learning to extract the features of the data. A multivariate time-series forecast is actually a sequential prediction problem [31] as well. Suppose the input sequence is x 1:T = x 1 , x 2 , . . . , x T with length T and the target sequence is y 1:H = y 1 , y 2 , . . . , y H with length H. The goal of model is to construct a nonlinear mapping of the predicted time series from the current state: It is important to note a constraint that y h should satisfy the causal constraint to prevent future information x t>h from leaking. The length of the input and output may not be the same. The SeqMod is essentially to find a neural network with the best prediction result. In the traditional time-series modeling process, RNNs are generally chosen for sequence modeling because of the reliance on past information for sequence modeling. However, with the development of feedforward models, researchers have found that by applying special treatments to feedforward models, they can also be used for sequence modeling and can take advantage of parallel training.

Model Structure
For the time-series prediction problem, the memory of past information is required, and CNNs in general do not have the ability to remember. We generally believe that RNN models, such as LSTM, are the best standard approach to solve time-series prediction problems; however, CNN models can save a lot of time by being able to perform parallel operations compared to RNN models. Based on these considerations, we designed the general framework inherited from CNNs. The goal is to obtain the best structures of convolutional network design as a flexible and stable framework for multivariate timeseries forecasting. We denote the proposed network structure as the multivariate temporal convolutional attention network, MTCAN. The salient features of MTCAN are (1) a onedimensional convolution is used instead of causal convolution, (2) a Residual Connect structure is used, (3) an attention mechanism is used to capture features, and (4) the input and output length of the model can be determined flexibly. The MTCAN network structure is in general a multi-headed network structure. In this work, we are using the asymmetric residual blocks and the attention mechanism to construct an effective network approach. The detailed structure of the model is as follows.

1D Dilated Convolution
In causal convolution [32], the output at time t is only convolved with elements from the previous layer at time t and earlier. One major drawback of this network design is that in order to get a large enough perceptual field to obtain information over a long period of time, we need a very deep network or a very large filter, which would make the model too large. In causal convolution, it has a one-to-one causal relationship between input and output. For time-series problems, this design leads to no parallel computation during the operation, which makes feature learning inefficient. The one-dimensional convolutional network is used to avoid this situation and to improve the feature learning efficiency.
However, pooling is used in traditional 1D convolutional networks, which is designed to expand the perceptual field and reduce the size of the sequences. An obvious drawback of pooling is sequence feature loss in the information-merging process. The expanded convolution increases the range of the filter without increasing the number of weight parameters in it. Thus, the inflation convolution increases the perceptual field of the neural network without increasing the computational cost. Thus, for long information-dependent problems inside a time series, the dilation convolution can be well applied. It is because of these advantages of one-dimensional dilated convolution [33] that we adopt it as the basic unit of our model here. We can define the 1D dilated convolution as: where In(c, w) is the input vector, Out(k, q) is the output vector, Weight(k, c, s) is the filter size, and d is the expansion factor. Figure 1 shows the process of a dilated convolution.

Residual Connect
The use of convolutional neural networks for time-series prediction requires considering the problem of gradient disappearance due to the large number of layers in multilayer convolutional networks, while we use residual blocks [34] to build the model to avoid this problem and to improve the model by deepening the number of layers. A residual block contains a branch that leads to a series of transformations whose output adds the input of the residual block. We design a novel structure by multi-layer ordered residual networks and parallel residual networks. The core of the residual block is to create a shortcut between the front and back layers and does not introduce additional parameters or computational complexity. Whereas the hopping connections in ResNet lead to the fact that not all residual blocks are functional, we use direct connections to ensure that each residual block learns useful information.
To increase the effectiveness of the neural network, the size of the convolutional kernel can be increased or the depth of the network can be increased, but this would make the computation very large. We have taken the approach of invoking the asymmetric block structure [35], which will create an asymmetric factor in the structure of the whole network and have a positive effect on the whole model.
The structure of our residual block is shown in Figure 2. Our input goes into two channels; then, we start the expansion of the 1D convolution first, then correct it by correcting the linear unit, and finally sum the output. This process is repeated three times in residual block 1 and four times in residual block 2. In this way, residual block 1 and residual block 2 form an asymmetric structure. Since the dimensionality of the final feature mapping may be different, 1 × 1 convolution is introduced to adjust the dimensionality. The operation process in the repetition cell is as follows.
The last step is to add the input to the output of the block.

Self-Attention
The attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, key, value and output are all vectors. The keys and queries are dotted multiplied to obtain the corresponding attention weights, and finally, the obtained weights and values are dotted to obtain the final output. For self-attention, the three matrices Q (Query), K (Key), and V (Value) are all from the same input. We first compute the dot product between Q and K and then divide the result by a scale √ d k to prevent the result from being too large, where d k is the dimensionality of a query and key vector. The result is then normalized using the Softmax operation and then multiplied by the matrix V to obtain the representation of the weight summation. This computational procedure can be expressed as follows. Figure 3 shows the process of calculating attention weights. The self-attention mechanism is a variation of the attention mechanism, which reduces the reliance on external information and is better at capturing the internal relevance of data or features. Compared with RNN and its variant models, it is able to compute in parallel and can make better use of computing power. We use the self-attention mechanism in MTCAN to capture the internal correlation of long time series, which can better describe the internal correlation of time series and improve the performance of the whole model.

MTCAN Model
Multivariate time series are actually multiple unitary time series, but there is a certain mapping relationship between these multiple time series. It is due to this structural feature that we adopt a structure similar to that of a multi-headed attention network. The difference with the multi-headed attention network is that we split the multivariate time series into multiple univariate time series and use each univariate time series as the input of each repeat block. This way, instead of using an identical sequence as the input of each repeat block, the mapping relationship between the multivariate time series can be captured more effectively.
We use a convolutional neural network as the basis of the model, and we need as deep a network as possible to solve the long time information dependence. However, deeper convolutional neural networks may experience gradient disappearance or add multiple layers without improving the model effect. To solve this problem, a residual block is used, which ensures that the depth of the convolutional neural network is deep enough to remember the information for a long time.
The information learned in each repetition block has different features for each individual variable. Moreover, in the repeat module, we first interpret by residual blocks; then, we extract features from the interpretation by self noticing, and finally, we merge the output vectors of each repetition block. The finally obtained variables are interpreted by the fully connected layer, and then, the prediction results are obtained. The structure of the multivariate temporal convolutional attention base network is shown in Figure 4.

Experiments
To evaluate the effectiveness of MTCAN, we perform experiments on a multivariate time-series forecasting task. The three datasets are firstly described for the empirical study. All data can be downloaded online. Then, the parameters turning as well as the evaluation criteria are described. Finally, the proposed MTCAN model is compared with different models.
The ISO-NE dataset includes hourly demand, price, temperature and other features. The time frame of this dataset is from 2003 to 2014. We use two variables from this dataset, namely, hourly electricity demand and dry bulb temperature. For this dataset, hourly electricity demand is used as the forecast value.
The Beijing PM2.5 dataset has eight features that include dew point, temperature, atmospheric pressure, etc. The time range of this dataset is from 2010 to 2014. For this dataset, temperature is used as the predicted value.
The Jena Climate dataset contains 14 different features such as temperature, atmospheric pressure, and humidity, among others. We used only the data collected between 2009 and 2016, and these data were collected every 10 min. We used the first 10 variables of this dataset and extracted the data so that the time interval was 1 h. For this dataset, hourly temperatures were used as predicted values. Table 1 shows the length of the time series, the number of variables, and the sampling interval for the three datasets. There are cases where the values in the dataset are null, so preprocessing is required. For some "NA" values in the dataset, we use 0 instead.

Experimental Details
Algorithm 1 is an algorithm for learning rate iteration, where the learning rate decreases for every eight periods of no improvement in the validation score before the minimum learning rate is reached. This algorithm is used for all model learning rate iteration processes. For the multivariate time-series forecasting task, most models are chosen from {24, 72, 144} in length with a batch size of 100. MSE is used as a default loss function for the forecasting task. The optimization strategy uses Adam, and the initial learning rate is set to 0.001.
An LSTM model with a hidden layer of {50, 100, 200} cells is defined. The number of cells in the hidden layer is independent of the number of time steps in the input and output sequences. The final output represents the prediction results for the next 24 h. The optimizer is SGD [36]. The learning rate is set to 0.05, and the reduction rate is 0.3. For the GRU model, we used a hidden layer of {64, 128, 256} units. The final output is a vector with 24 elements, which is the predicted result for the next 24 h. Adam [37] is used as the optimizer. The learning rate and reduction rate are the same as those of the LSTM model.
In the ConvLSTM encoder-decoder model, the shape of the input data is [timestep, row, column, channel]. Timestep is selected from {1, 3, 7}. Row is set to 1. Column is selected from {24, 72, 144}. Channel is selected from {2, 8, 10}. The optimization algorithm uses SGD. The learning rate is also set to 0.05. The size of all input-to-state and state-to-state kernels is 1 × 3.
For the TCN network, a hidden layer of {30, 50, 100} units is defined. The size of the convolution kernel is 1 × 3.
The MTCAN model uses Adam as the optimization strategy, and the initial learning rate is set to 0.001.
The  if best_socre > cur_score then 5: best_lr = cur_score; wait = 0 6: save model 7: else 8: if wait == 8&&new_lr > min_lr then 9: new_lr = new_lr * f actor We use three evaluation metrics here, root-mean-square error (RMSE), relative root error (RRSE), and empirical correlation coefficient (CORR) of multivariate prediction to assess the performance of the model. The formula for calculating the three indicators is as follows: In the formula, P it is the predicted value and T it is the actual value,P i is the mean value of the predicted value andT i is the mean value of the actual value. The smaller the value of RMSE and RRSE, the more accurate the prediction result of the model. The larger the value of CORR, the stronger the correlation between the predicted result and the actual value of the model. From the values of these three evaluation parameters, we can also evaluate the validity and accuracy of a model more clearly.
The training process of the model is shown in the Figure 5.
Step 1: Preprocess the time-series dataset. The original time-series dataset is divided into a training set and a test set, while the inputs and outputs are extracted from them separately.
Step 2: Initialize the parameters and hyperparameters of the proposed model.
Step 3: Train the MTCAN time-series prediction model. The Adam optimization algorithm and the mean squared loss function are used. The mean squared loss function is obtained from the predicted and actual values. The RRSE is calculated to save the model with the smallest RRSE, and the learning rate is changed if the loss value has been unchanged.
Step 4: Terminate the model training or loop the third step. If the maximum number of training times is reached, the training process of MTCAN is ended, and the optimized weights and biases of the model are obtained. If not, repeat step 3 until the termination condition is met.

Experimental Results
In this section, we evaluate the performance of the prediction model over different time horizons. The ISO-NE dataset, Beijing PM2.5 dataset, and Jana Climate dataset are used for prediction. We use three datasets with different numbers of parameters, different domains, and different amounts of data, so that we can evaluate a model more comprehensively. The time prediction range is between 1 and 24 h. This section compares the performance of our proposed MTCAN method with LSTM, TCN, ConvLSTM, and GRU. Table 2 shows the results of the overall RRSE, CORR, and RMSE metrics for the multivariate test set from 1 to 24 h. The output sequence length is set to 24, representing the prediction time period from 1 to 24 h. In multivariate time-series prediction tasks, the longer the prediction time, the more difficult the prediction is. Therefore, our experiments were performed to analyze the results in detail within this time frame. The best results in each dataset are highlighted in bold. For RMSE and RRSE, lower values are better, while for CORR, higher values are better.
The effectiveness and generalization of the MTCAN model are stronger than the comparison models. The GRU model works well on the Beijing PM2.5 dataset but less well on the ISO-NE dataset and the Jena climate dataset. The LSTM is more balanced on the three datasets, but it is not as good as the MTCAN model. The ConvLSTM model works well on the ISO-NE dataset but not well enough on the other two datasets. The TCN model, on the other hand, has even worse generalization and has better results only on the ISO-NE dataset and the results on the other two datasets are very poor and not meaningful for comparison.  Figure 6 shows in detail the specific performance of the four models on the three datasets. From this figure, it can be concluded that the MTCAN model has smoother fluctuations in its curves due to several other models both in terms of generalizability and accuracy. We do not depict the experimental data curves of the TCN model because of its too poor generalizability.

Ablation Experiments
For complex model structures, we do not know whether the added structure has a beneficial effect on the performance of the model. To better explain the validity of our proposed model, we conducted a careful further study. An ablation experiment was designed to demonstrate the effectiveness of the self-attention layer. Figure 7 shows the repeat block structure of this model in detail. In this repeat block, we remove the self-attentive network layer. In this test, there is no change in the network except for this point.  Table 3, we can conclude that our addition of the self-attentive layer helps the network achieve more accuracy and better generalization, and it reduces the size of the number of model parameters. The number of parameters of the MTCAN model is about 25% that of the model, which greatly improves the model training efficiency. All components of the MTCAN model ensure the effectiveness and generalization of the model as well as reduce the size of the model and improve the learning efficiency.

Conclusions
The MTCAN model is proposed and the network structure design is highlighted, which is designed by incorporating a one-dimensional dilated convolutional network as the basic unit, asymmetric residual blocks, and a self-attentive network at the end to enhance the capture of the mapping relationships within the time series to improve the performance of multivariate time-series prediction. The model is validated by three datasets-PM2.5, ISO-NE, and Jane Climate that have been compared with existing models LSTM, GRU, ConvLSTM, and TCN. The prediction results show that the model proposed in this paper has significantly improved calculation efficiency, prediction accuracy, and generalization. This improvement is attributed to the explainability of the self-attention mechanism, which shall be further investigated.