A3T-GCN: Attention Temporal Graph Convolutional Network for Traffic Forecasting

Accurate real-time traffic forecasting is a core technological problem against the implementation of the intelligent transportation system. However, it remains challenging considering the complex spatial and temporal dependencies among traffic flows. In the spatial dimension, due to the connectivity of the road network, the traffic flows between linked roads are closely related. In terms of the temporal factor, although there exists a tendency among adjacent time points in general, the importance of distant past points is not necessarily smaller than that of recent past points since traffic flows are also affected by external factors. In this study, an attention temporal graph convolutional network (A3T-GCN) traffic forecasting method was proposed to simultaneously capture global temporal dynamics and spatial correlations. The A3T-GCN model learns the short-time trend in time series by using the gated recurrent units and learns the spatial dependence based on the topology of the road network through the graph convolutional network. Moreover, the attention mechanism was introduced to adjust the importance of different time points and assemble global temporal information to improve prediction accuracy. Experimental results in real-world datasets demonstrate the effectiveness and robustness of proposed A3T-GCN. The source code can be visited at https://github.com/lehaifeng/T-GCN/A3T.


INTRODUCTION
T RAFFIC forecasting is an important component of intelligent transportation systems and a vital part of transportation planning and management and traffic control [12,[15][16][17]. Accurate real-time traffic forecasting has been a great challenge because of complex spatiotemporal dependencies. Temporal dependence means that traffic state changes with time, which is manifested by periodicity and tendency. Spatial dependence means that changes in traffic state are subject to the structural topology of road networks, which is manifested by the transmission of upstream traffic state to downstream sections and the retrospective effects of downstream traffic state on the upstream section [10]. Hence, considering the complex temporal features and the topological characteristics of the road network is essential in realizing the traffic forecasting task.
Existing traffic forecasting models can be divided into parametric and non-parametric models. Common parametric models include historical average, time series [1,14], linear regression [27], and Kalman filtering models [23]. Although traditional parametric models use simple algorithms, they depend on stationary hypothesis. These models can neither reflect nonlinearity and uncertainty of traffic states nor overcome the interference of random events, such as traffic accidents. Non-parametric models can solve these problems well because they can learn the statistical laws of data automatically with adequate historical data. Common non-parametric models include k-nearest [2], support vector regression (SVR) [11,30], fuzzy logic [34], Bayesian network [28], and neural network models.
Recently, deep neural network models have attracted wide attention from scholars because of the rapid development of deep learning [22,26]. Recurrent neural networks (RNNs), long short-term memory (LSTM) [13], and gated recurrent units (GRUs) [7] have been successfully utilized in traffic forecasting because they can use self-circulation mechanism and model temporal dependence [20,25]. However, these models only consider the temporal variation of traffic state and neglect spatial dependence. Many scholars have introduced convolutional neural networks (CNNs) in their models to characterize spatial dependence remarkably. Wu et al. [31] designed a feature fusion framework for shortterm traffic flow forecasting by combining a CNN with LSTM. The framework captured the spatial characteristics of traffic flow through a one-dimensional CNN and explored short-term variations and periodicity of traffic flow with two LSTMs. Cao et al. [6] proposed an end-to-end model called ITRCN, which transformed the interactive network flow to images and captured network flows using a CNN. ITRCN also extracted temporal features by using GRU. An experiment proved that the forecasting error of this method was 14.3% and 13.0% higher than those of GRU and CNN, respectively. Yu et al. [36] captured spatial correlation and temporal dynamics by using DCNN and LSTM, respectively. They also proved the superiority of SRCN based on the investigation on the traffic network data in Beijing. Although CNN is actually applicable to Euclidean data [9], such as image and grids, it still has limitations in traffic networks, which possess non-Euclidean structures. In recent years, graph convolutional network (GCN) [18], which can overcome the abovementioned limitations and capture structural characteristics of networks, has rapidly developed [19,35,37]. In addition, RNNs and their variants use sequential processing over time and more apt to remember the latest information, thus are suitable to capture evolving short-term tendencies. While The importance of different time points cannot be distinguished only by the proximity of time. Mechanisms that are capable of learning global correlations are needed.
For this reason, an attention temporal GCN (A3T-GCN) was proposed for traffic forecasting task. The A3T-GCN combines GCNs and GRUs and introduces an attention mechanism [29,33]. It not only can capture spatiotemporal dependencies but also ajust and assemble global variation information. The A3T-GCN is used for traffic forecasting on the basis of urban road networks.

Definition of problems
In this study, traffic forecasting is performed to predict future traffic state according to historical traffic states on urban roads. Generally, traffic state can refer to traffic flow, speed, and density. In this study, traffic state only refers to traffic speed. Definition 1. Road network G: The topological structure of urban road network is described as G = (V, E),where V = {v 1 , v 2 , · · · , v N } is the set of road section, and N is the number of road sections. E is the set of edges, which reflects the connections between road sections. The whole connectivity information is stored in the adjacent matrix A ∈ R N ×N , where rows and columns are indexed by road sections, and the value of each entry indicates the connectivity between corresponding road sections. The entry value is 0 if there is no existed link between roads and 1 (unweighted graph) or non-negative (weighted graph) if otherwise. Definition 2. Feature matrix X N ×P : Traffic speed on a road section is viewed as the attribute of network nodes, and it is expressed by the feature matrix X ∈ R N ×P , where P is the number of node attribute features, that is, the length of historical time series. X i denotes the traffic speed in all sections at time i. Therefore, the traffic forecasting modelling temporal and spatial dependencies can be viewed as learning a mapping function f on the basis of the road network G and feature matrix X of the road network. Traffic speeds of future T moments are calculated as follows: [X t+1 , · · · , X t+T ] = f (G; (X t−n , · · · , X t−1 , X t )) (1) where n is the length of a given historical time series, and T is the length of time series that needs to be forecasted.

GCN model
GCNs are semi-supervised models that can process graph structures. They are an advancement of CNNs in graph fields. GCNs have achieved many progresses in many applications, such as image classification [5], document classification [9], and unsupervised learning [18]. Convolutional mode in GCNs includes spectrum and spatial domain convolutions [5]. The former was applied in this study. Spectrum convolution can be defined as the product of signal x on the graph and figure filter g θ (L),which is constructed in the Fourier domain:g θ (L) * x = U g θ (U T x), where θ is a model parameter, L is the graph Laplacian matrix, U is the eigenvector of normalized Laplacian matrix L = I N − D − 1 2 AD − 1 2 = U λU T , and U T x is the graph Fourier transformation of x. x can also be promoted to X ∈ R N ×C , where C refers to the number of features.
Given the characteristic matrix X and adjacent matrix A, GCNs can replace the convolutional operation in anterior CNNs by performing the spectrum convolutional operation with consideration to the graph node and first-order adjacent domains of nodes to capture the spatial characteristics of graph. Moreover, hierarchical propagation rule is applied to superpose multiple networks. A multilayer GCN model can be expressed as: where A = A + I N is an adjacent matrix with selfconnection structures, I N is an identity matrix, D is a degree matrix, D ii = j A ij , H (l) ∈ R N ×l is the output of layer l, θ (l) is the parameter of layer l, and σ(·) is an activation function used for nonlinear modeling.
Generally, a two-layer GCN model [18] can be expressed as: where X is a feature matrix; A is the adjacent matrix; and A = D − 1 2 A D − 1 2 is a preprocessing step, where A = A+I N is the adjacent matrix of graph G with self-connection structure. W 0 ∈ R P ×H is the weight matrix from the input layer to the hidden unit layer, where P is the length of time, and H is the number of hidden units. W 1 ∈ R H×T is the weight matrix from the hidden layer to the output layer. f (X, A) ∈ R N ×T denotes the output with a forecasting length of T , and ReLU ()is a common nonlinear activation function.
GCNs can encode the topological structures of road networks and the attributes of road sections simultaneously by determining the topological relationship between the central road section and the surrounding road sections. Spatial dependence can be captured on this basis. In a word, this study learned spatial dependence through the GCN model [18].

GRU model
Temporal dependence of traffic state is another key problem that hinders traffic forecasting. RNNs are neural network models that process sequential data. However, limitations in long-term forecasting are observed in traditional RNNs because of disadvantages in gradient disappearance and explosion [4]. LSTM [13] and GRUs [7] are variants of RNNs that mediate the problems effectively. LSTM and GRUs basically have the same fundamental principles. Both models use gated mechanisms to maintain long-term information and perform similarly in various tasks [8]. However, LSTM is more complicated, and it takes longer training time than GRUs, whereas GRU has a relatively simpler structure, fewer parameters, and faster training ability compared with LSTM.
In the present model, temporal dependence was captured by a GRU model. The calculation process is introduced as follows, where h t−1 is the hidden state at t-1, x t is the traffic speed at the current moment, and r t is the reset gate to control the degree of neglecting the state information at the previous moment. Information unrelated with forecasting can be abandoned. If the reset gate outputs 0, then the traffic information at the previous moment is neglected. If the reset gate outputs 1, then the traffic information at the previous moment is brought into the next moment completely. u t is the update gate and is used to control the state information quantity at the previous moment that is brought into the current state. Meanwhile, c t is the memory content stored at the current moment, and h t is the output state at the current moment.
GRUs determine traffic state at the current moment by using hidden state at previous moment and traffic information at current moment as input. GRUs retain the variation trends of historical traffic information when capturing traffic information at current moment because of the gated mechanism. Hence, this model can capture dynamic temporal variation features from the traffic data, that is, this study has applied a GRU model to learn the temporal variation trends of the traffic state.

Attention model
Attention model is realized on the basis of encoderdecoder model. This model is initially used in neural machine translation tasks [3]. Nowadays, attention models are widely applied in image caption generation [33], recommendation system [32], and document classification [24]. With the rapid development of such models, existing attention models can be divided into multiple types, such as soft and hard attention [3], global and local attention [21], and selfattention [29]. In the current study, a soft attention model was used to learn the importance of traffic information at every moment, and then a context vector that could express the global variation trends of traffic state was calculated for future traffic forecasting tasks.
Suppose that a time series x i (i = 1, 2, · · · , n),where n is the time series length, is introduced. The design process of soft attention models is introduced as follows. First, the hidden states h i (i = 1, 2, · · · , n) at different moments are calculated using CNNs (and their variants) or RNNs (and their variant), and they are expressed as H = {h 1 , h 2 , · · · , h n }.Second, a scoring function is designed to calculate the score/weight of each hidden state. Third, an attention function is designed to calculate the context vector (Ct) that can describe global traffic variation information. Finally, the final output results are obtained using the context vector. In the present study, these steps were followed in the design process, but a multilayer perception was applied as the scoring function instead.
Particularly, the characteristics (h i ) at each moment were used as input when calculating the weight of each hidden state based on f. The corresponding outputs could be gained through two hidden layers. The weights of each characteristic (α i ) are calculated by a Softmax normalized index function (eq. (8)), where w (1) and b (1) are the weight and deviation of the first layer and w (2) and b (2) are the weight and deviation of the second layer, respectively.
Finally, the attention function was designed. The calculation process of the context vector (C t ) that covers global traffic variation information is shown in Equation (10).

A3T-GCN model
The A3t-GCN is a improvement of our previous work named T-GCN [37]. The attention mechanism was introduced to re-weight the influence of historical traffic states and thus to capture the global variation trends of traffic state. The model structure is shown in Fig. 2.5. A temporal GCN (T-GCN) model was constructed by combining GCN and GRU. n historical time series traffic data were inputted into the T-GCN model to obtain n hidden states (h) that covered spatiotemporal characteristics:{h t−n , · · · , h t−1 , h t }.The calculation of the T-GCN is shown in eq. (11), where h t−1 is the output at t-1. GC is the graph convolutional process. u t and r t are the update and reset gates at t, respectively. c t is the stored content at the current moment. h t is the output state at moment t, and W and b are the weight and the deviation in the training process, respectively.
Then, the hidden states were inputted into the attention model to determine the context vector that covers the global traffic variation information. Particularly, the weight of each h was calculated by Softmax using a multilayer perception:{a t−n , · · · , a t−1 , a t }.The context vector that covers global traffic variation information is calculated by the weighted sum. Finally, forecasting results were outputted using the fully connected layer.
In sum, we proposed the A3T-GCN to realize traffic forecasting. The urban road network was constructed into a graph network, and the traffic state on different sections was described as node attributes. The topological characteristics of the road network were captured by a GCN to obtain spatial dependence. The dynamic variation of node attributes was captured by a GRU to obtain the local temporal tendency of traffic state. The global variation trend of the traffic state was then captured by the attention model, which was conducive in realizing accurate traffic forecasting.

Loss function
Training aims to minimize errors between real and predicted speed in the road network . Real and predicted speed on different sections at t are expressed by Y and Y , respectively. Therefore, the objective function of A3T-GCN is shown as follows. The first term aims to minimize the error between real and predicted speed. The second term L reg is a normalization term, which is conducive to avoid overfitting. λ is a hyper-parameter.

Evaluation Metrics
To evaluate the prediction performance of the model, the error between real traffic speed and predicted results is evaluated on the basis of the following metrics: (1) Root Mean Squared Error (RMSE): (2) Mean Absolute Error (MAE): (3) Accuracy: (4) Coefficient of Determination (R 2 ): (5) Explained Variance Score (var): where y j i and y j i are the real and predicted traffic information of temporal sample j on road i, respectively. N is the number of nodes on road. M is the number of temporal samples. Y and Y are the set of y j i and y j i respectively, and Y is the mean of Y .
Particularly, RMSE and MAE are used to measure prediction error. Small RMSE and MASE values reflect high prediction precision. Accuracy is used to measure forecasting precision, and high accuracy value is preferred. R 2 and var calculate the correlation coefficient, which measures the ability of the prediction result to represent the actual data: the larger the value is, the better the prediction effect is.

Experimental result analysis
The hyper-parameters of A3T-GCN include learning rate, epoch, and number of hidden units. In the experiment, learning rate and epoch were manually set on the basis of experiences as 0.001 and 5000 for both datasets. As for the number of hidden units, we set it to 64 and 100 for SZ taxi and Los loop, respectively.
In the present study, 80% of the traffic data are used as the training set, and the remaining 20% of the data are used as the test set. The traffic information in the next 15, 30, 45, and 60 min is predicted. The predicted results are compared with results from the historical average model (HA), auto-regressive integrated moving average model (ARIMA), SVR, GCN model, and GRU model. The A3T-GCN is analyzed from perspectives of precision, spatiotemporal prediction capabilities, long-term prediction capability, and global feature capturing capability.
(1) High prediction precision. Table 1 shows the comparisons of different models and two real datasets in terms of the prediction precision of various traffic speed lengths. The prediction precision of neural network models (e.g., A3T-GCN and GRU) is higher than those of other models (e.g., HA, ARIMA, and SVR). With respect to 15-minute time series, the RMSE and accuracy of HA are approximately 9.22% higher and 4.24% lower than those of A3T-GCN, respectively. The RMSE and accuracy of ARIMA are approximately 46.15% higher and 39.01% lower than those of A3T-GCN, respectively. The RMSE and accuracy of SVR are approximately 5.95% higher and 2.81% lower than those of A3T-GCN, respectively. Compared with GRU, The RMSE and accuracy of HA is approximately 6.88% higher and 3.32% lower than those of GRU, respectively. The RMSE and accuracy of ARIMA are approximately 44.76% and 38.07%, respectively. The RMSE and accuracy of SVAR are approximately 3.52% and 1.87%, respectively. These results are mainly caused by the poor nonlinear fitting abilities of HA, ARIMA, and SVAR to complicated changing traffic data. Processing long-term non-stationary data is difficult when ARIMA is used. Moreover, ARIMA is gained by averaging the errors of different sections. The data of some sections might greatly fluctuate to increase the final error. Hence, ARIMA shows the lowest forecasting accuracy. Similar conclusions could be drawn for Los loop. In a word, A3T-GCN model can obtain the optimal prediction performance of all metrics in two real datasets, thereby proving the validity and superiority of A3T-GCN model in spatiotemporal traffic forecasting tasks.
(2) Effectiveness of modelling both spatial and temporal dependencies. To test the benefits brought by depicting the spatiotemporal characteristics of traffic data simultaneously in A3T-GCN, the model is compared with GCN and GRU. Fig. 2 shows the results based on SZ taxi. Compared with GCN (considering spatial characteristics only), A3T-GCN achieves approximately 31.11%, 31.08%, 30.94%, and 30.78% lower RMSEs in 15, 30, 45, and 60 minutes of traffic forecasting time series, respectively. In sum, the prediction error of A3T-GCN is kept lower than that of GCN in 15, 30, 45, and 60 minutes of traffic forecasting. Therefore, the  A3T-GCN can capture spatial characteristics.
Compared with GRU (considering temporal characteristics only), A3T-GCN achieves approximately 2.51% lower RMSE in 15 minutes traffic forecasting, approximately 4.19% lower RMSE in 30 minutes traffic forecasting, approximately 4.99% lower RMSE in 45 minutes time series, and approximately 2.55% lower RMSE in 60 minutes time series. In sum, the prediction error of A3T-GCN is kept lower than that of GRU in 15,30,45, and 60 minutes traffic forecasting. Therefore, the A3T-GCN can capture temporal dependence.
Results based on Los loop, which are similar with those based on SZ taxi, are shown in Fig. 3. In short, the A3T-GCN has good spatiotemporal prediction capabilities. In other  words, A3T-GCN model can capture the spatial topological characteristics of urban road networks and the temporal variation characteristics of traffic state simultaneously.
(3) Long-term prediction capability. Long-term prediction capability of A3T-GCN was tested by traffic speed forecasting in 15, 30, 45, and 60 minutes prediction horizon. Forecasting results based on SZ-taxi are shown in Fig. 4. The RMSE comparison of different models under different lengths of time series is shown in Fig. 4(a). The RMSE of the A3T-GCN is the lowest under all lengths of time series. The variation trends of RMSE and accuracy, which reflects prediction error and precision, respectively, of the A3T-GCN under different lengths of time series are shown in Fig. 4(b). RMSE increases as the length of time series increases, whereas accuracy declines slightly and shows certain stationary.
The forecasting results based on Los loop are shown in Fig. 5, and consistent laws are found. In sum, A3T-GCN has good long-term prediction capability. It can obtain high accuracy by training for 15, 30, 45, and 60 minutes prediction horizon. Forecasting results of A3T-GCN change slightly with changes in length of time series, thereby showing certain stationary. Therefore, the A3T-GCN is applicable to short-term and long-term traffic forecasting tasks.
(4) Effectiveness of introducing attention to capture global variation. A3T-GCN and T-GCN were compared to test the superiority of capturing global variation. Results are shown in Table 2. A3T-GCN model shows approximately 0.86% lower RMSE and approximately 0.32% higher accuracy than T-GCN model under 15  Hence, the prediction error of A3T-GCN is lower than that of T-GCN, but the accuracy of the former is higher

Perturbation analysis
Noise is inevitable in real-world datasets. Therefore, perturbation analysis is conducted to test the robustness of A3T-GCN. In this experiment, two types of random noises are added to the traffic data. Random noise obeys Gaussian distribution N ∈ (0, σ 2 ), where σ ∈ (0.      Fig. 6(b). The values of different evaluation metrics remain basically the same regardless of the changes in σ/λ. Hence, the proposed model can remarkably resist noise and process strong noise problems.
The experimental results based on Los loop are consistent with experimental results based on SZ taxi (Fig. 7). Therefore, the A3T-GCN model can remarkably resist noise and still obtain stable forecasting results under Gaussian and Poisson perturbations.

Visualized analysis
The forecasting results of A3T-GCN model based on two real datasets are visualized for a good explanation of the model.
(1) SZ-taxi: We visualize the result of one road on January 27, 2015. Visualization results in 15, 30, 45, and 60 minutes of time series are shown in Fig. 8.  (2) Los-loop: Similarly, we visualize one loop detector data in Los-loop dataset. Visualization results in 15, 30, 45, and 60 minutes are shown in Fig. 9.
In sum, the predicted traffic speed shows similar variation trend with actual traffic speed under different time series lengths, which suggest that the A3T-GCN model is competent in the traffic forecasting task. This model can also capture the variation trends of traffic speed and recognize the start and end points of rush hours. The A3T-GCN model forecasts traffic jam accurately, thereby proving its validity in real-time traffic forecasting.

CONCLUSIONS
A traffic forecasting method called A3T-GCN is proposed to capture global temporal dynamics and spatial correlations simultaneously and facilitates traffic forecasting. The urban road network is constructed into a graph, and the traffic speed on roads is described as attributes of nodes on the graph. In the proposed method, the spatial dependencies are captured by GCN based on the topological characteristics of the road network. Meanwhile, the dynamic variation of the sequential historical traffic speeds is captured by GRU. Moreover, the global temporal variation trend is captured and assembled by the attention mechanism. Finally, the proposed A3T-GCN model is tested in the urban road network-based traffic forecasting task using two real datasets, namely, SZ-taxi and Los-loop. The results show that the A3T-GCN model is superior to HA, ARIMA, SVR, GCN, GRU, and T-GCN in terms of prediction precision under different lengths of prediction horizon, thereby proving its validity in real-time traffic forecasting.