Dynamic Correlation Adjacency-Matrix-Based Graph Neural Networks for Traffic Flow Prediction

Modeling complex spatial and temporal dependencies in multivariate time series data is crucial for traffic forecasting. Graph convolutional networks have proved to be effective in predicting multivariate time series. Although a predefined graph structure can help the model converge to good results quickly, it also limits the further improvement of the model due to its stationary state. In addition, current methods may not converge on some datasets due to the graph structure of these datasets being difficult to learn. Motivated by this, we propose a novel model named Dynamic Correlation Graph Convolutional Network (DCGCN) in this paper. The model can construct adjacency matrices from input data using a correlation coefficient; thus, dynamic correlation graph convolution is used for capturing spatial dependencies. Meanwhile, gated temporal convolution is used for modeling temporal dependencies. Finally, we performed extensive experiments to evaluate the performance of our proposed method against ten existing well-recognized baseline methods using two original and four public datasets.


Introduction
The transportation system is one of the most critical infrastructures in modern cities, supporting the daily life of cities for millions of people to commute and travel. With rapid urbanization and population growth, the transportation system has become more complex. It includes road vehicles, rail transit, and various modes of shared travel (i.e., bike sharing) that have emerged in recent years.
However, urban expansion also faces many corresponding problems, such as air pollution, traffic congestion, etc. Early intervention based on traffic prediction is considered to be key to improving the transportation system's efficiency and alleviating the above problems. With the development of smart cities and smart transportation systems, sensors installed on roads (with regard to loop detectors) can sense traffic conditions and record transactions on subway and bus systems, deriving from traffic surveillance videos and even smartphones equipped with GPS receivers. Traffic forecasts are based on these historical traffic data, as well as external factors that influence traffic conditions, such as weather and holidays. Figure 1a, the road network in the map can be considered as a graph. Sensors are nodes with time series of their states as shown in Figure 1b, and the road segments are edges. Therefore, the datasets of the sensors can be thought of as multivariate time series that have long time spans. These kinds of data can easily print a heat map that shows the features of the road segments. Current navigation applications are able to show a real-time heat map, but, using the prediction of multivariate time series, we can create the heat map of the next hour. The feature can be the flow, speed, or travel time that helps the driver to choose the faster route. It can also be an advanced and composite index such as the passenger car unit (PCU) and saturation for management control. There are many practical application scenarios that can be studied in the field of traffic prediction, such as the travel time [1][2][3], speed [4][5][6][7][8], congestion [9,10], and traffic flow, which can be divided into taxi flow [11][12][13][14], bicycle flow [15,16], and highway flow [17][18][19][20][21], which are all studied in this paper.

Example 1. As shown in
The traditional way to predict time series uses the whole datasets as the input, which is slower and difficult to converge. Deep learning models such as graph convolutional networks use the historical data to train themselves, and then use fewer data as the input for the prediction. For example, if we train a model using 1 month of data, the model is able to predict the next hour using only 1 hour of data. Once the training is carried out, the model can be both faster and more accurate than the traditional method. Graph convolutional networks have proved to be effective in predicting multivariate time series. In addition to physical-level traffic networks, other latent graph structures can be found to improve the prediction of multivariate time series that do not contain an obvious graph structure. Graph structures that come with datasets are not easy to build, and are probably not the most helpful way to express the graph. The DCRNN [22] uses diffusion convolution on the graph to extract spatial dependencies and GRU to extract temporal dependencies, and it also uses PeMS-BAY and METR-LA datasets for the first time. STGCN [23] uses spectral graph convolution and gated one-dimension convolution to extract spatial and temporal dependencies, respectively. They provide fundamental methods for this area in the early stages.
Graph convolution requires an adjacency matrix, which is predefined in the datasets. However, the predefined graph structure has its limitations: it makes the process of information transfer on the graph fixed, but the spatial dependence is dynamic and changes with time. Therefore, some methods, such as the attention mechanism, are introduced to capture the dynamic dependence. The ASTGCN [24] adds spatial and temporal attention to capture dynamics based on former studies. Graph Wavenet [25] constructs an adaptive adjacency matrix through node embedding to assist the original matrix in the graph convolution. Furthermore, the MTGNN [26] and AGCRN [27] abandon the original graph structure and introduce a graph learning module to generate an adjacency matrix with random node embedding.
When the learning environment is harsh, such as an extensive range of data and insignificant node features, the above methods may face failure. Thus, we introduce a new way to construct dynamic adjacency matrices and put them in the encoder-decoder structure. The main contributions of this paper can be summarized as follows: • We propose a novel method for constructing an adjacency matrix using a correlation coefficient for graph convolution. • We constructed the adjacency matrix using input data accordingly and used a dynamic correlation convolution module to capture spatial and temporal dependencies with an improved GCN and TCN. • We made new datasets based on raw traffic data, and experimental results on both original and public datasets show that our model outperforms the baseline methods.
Organization. The remainder of this paper is organized as follows. First, we discuss the related works in Section 2. Then, we present the preliminaries and formally define our research problem in Section 3. We further propose the DCGCN model for traffic flow prediction in Section 4. After that, the experimental evaluation and results are reported in Section 5. Finally, we conclude this work in Section 6.

Multivariate Time Series Prediction
After years of effort, the research on traffic prediction has made great progress. According to the processing, these methods can be roughly divided into two categories: classical methods and deep-learning-based methods. Classical methods include statistical methods and traditional machine learning methods. The statistical method is to build a data-driven statistical model for prediction. The most representative algorithms are historical average (HA), auto-regressive comprehensive moving average ARIMA [28], and vector auto-regression [29]. However, these methods require data to satisfy certain assumptions, and time-varying traffic data are too complex to satisfy these assumptions. In addition, these methods are only applicable to relatively small datasets. For traffic prediction, some traditional machine learning methods are proposed. These methods have the ability to process high-dimensional data and capture complex nonlinear relations.
The deep0learning-based approaches studies how to learn a layer model and transform the original input directly into the expected output. Typically, deep learning models stack up basic learnable blocks, or layers, to form a deep architecture that trains the entire network end-to-end. Several architectures have been developed to handle large and complex spatialtemporal data. Generally, a convolutional neural network (CNN) is used to extract the spatial correlation of grid structure data described in images or videos. Graph convolutional networks (GCNs) extend the convolution operation to more general graph structure data, which are more suitable for representing the traffic network structure. In addition, recurrent neural networks (RNNs) and their variants, LSTM [30] or GRU [31], are often used to model temporal correlation.

Graph Adjacency Matrix for Traffic Forecasting
In former studies, the most common way to construct a predefined adjacency matrix needs the distance between each pair of nodes using a threshold Gaussian kernel [32]: where d ij is the geometric distance or distance on the road network between node i and j, σ is the standard deviation of distances, and α is the threshold that is used to control the sparsity of the matrix. Although a predefined graph structure can help the model converge to a good result quickly, it also limits the further improvement of the model due to its stationary state, which has two meanings: the existence of edges is fixed, as well as the weights of the current edges. To address this problem, many methods have been brought up in former studies that are able to construct an adjacency matrix in the model. These methods can be divided into two parts, partly independent methods and completely independent methods, according to whether they need the predefined existences of edges. Partly independent methods include those brought in the ASTGCN, STSGCN [33], and STFGNN [34], which preserve the existences of edges and learn the weights with a masked matrix or attention mechanism. Using a learnable matrix W ∈ R N×N is an obvious choice for completely independent methods, but it is extremely difficult for a random matrix to converge to an adjacency matrix that shows the graph structure correctly, so various methods that use node embedding to construct the adjacency matrix have been brought up in AGCRN, Graph Wavenet [25] and MTGNN, as shown in Table 1.
Using node embedding can decrease the number of learning parameters, thus lowering the difficulty. It also uses a threshold to control the sparsity of the adjacency matrix. These methods are able to learn the existences of edges, as well as the weight. Although a certain effect is made, an assumption is also made, which is that the graph structures of all samples are the same. The physical meaning of the differences between the samples is that the beginning time of the time series is different, which means that the above assumption can be interpreted as suggesting that the graph structure of multivariate time series is not time-variant. This goes against well-known facts. For example, peak or off-peak hours, day or night, sunny or rainy day, and workday or weekend and holidays are all time-variant external factors that affect the features in the datasets of traffic domain; thus, it is probably a better method for generating a dynamic adjacency matrix. Several methods that can generate the adjacency matrix by multiplying input data with a smaller matrix in the middle are brought up in the MTGNN and SLCNN [35], as shown in Table 1, where σ is one of the more commonly used activation functions, such as ReLU, tanh, sigmoid, or their combinations, and hyper-parameter α is used as a threshold to control the sparsity of the matrix. Ws are learnable parameters, W 1 , W 2 ∈ R N×K in the AGCRN, GWN, and MTGNN, and W 1 , W 2 ∈ R K×K in the SLCNN and MTGNN, where K N is the preset dimension. Table 1. Methods for constructing adjacency matrix.

Model Method Equation
When the learning environment is harsh, such as when there is a large range of data and insignificant node features, the above methods that generate the adjacency matrix by learning the node feature with node embedding may face failure. In fact, the adjacency matrix is just used to present spatial connections among nodes. Besides the given geological information and learning from ground zero, we can also use a statistic method to measure the correlation among nodes.

Preliminary
In this section, we introduce the statistical method that we used to construct an adjacency matrix for graph convolution, and the problem definition of multi-time-series forecasting using graph neural networks.

Correlation Matrix in Multiple Regression Analysis
In statistics, the Pearson correlation coefficient r can be used to measure the similarity between two variables X, Y ∈ R N , where its value is between −1 to 1. The two variables are positively correlated when r > 0, and negatively correlated when r < 0. The closer |r| gets to 1, the stronger the correlation. Normally, it can be divided into three parts: the variables have low correlation when |r| < 0.4, significant correlation when 0.4 ≤ |r| < 0.7, and high correlation when 0.7 ≤ |r| < 1.

Note that
Then, correlation coefficient r can be defined as When extended from two to multiple variables, the correlation coefficient between each pair of variables X i , X j ∈ R N can be formulated as a correlation matrix. Firstly, it can be defined as Apparently, |r ij | < 1. Then, the correlation matrix can be defined as

Problem Definition
If the variable changes through time, its value can be formulated as a time series. This variable can be one of the useful features in the traffic domain, such as the traffic flow of specific road segments, the average speed of moving vehicles, the occupancy of the road segment, etc. Each monitor or sensor is usually used to collect data for a unique road segment in real life, and thus can be considered as a node, and the whole road network can be considered as multiple time-variant variables X with a graph G. Denote the whole time span of the datasets as T ; then, X ∈ R N×T ×F , where N, F stand for spatial nodes and features, separately. Denote the history horizon of every sample as T in , and predict the horizon as T out . Then, sample sets can be generated from datasets X , which contain T − T in samples noted as X = X t−T in +1:t , where t stands for the latest time slice of each sample. The graph can be defined as G = (V, E, A), where V stands for spatial node sets, E stands for edge sets, and A stands for the adjacency matrix of the graph, where the values of the elements of A define both the existence and weight of the edges in E.

Definition 1.
Given multiple time-variant variables X with a graph G, the problem of T out steps forecasting using T in steps historical data of V time series can be defined aŝ

Dynamic Correlation Graph Convolutional Neural Networks
In this section, we introduce the DCGCN model proposed in this paper. First, we construct a correlation adjacency matrix using the statistic method mentioned in Section 3.1. Then, we put it in the graph convolution in the framework of the graph convolutional neural networks.

Method Overview
When using graph convolutional neural networks to deal with traffic prediction problems, a graph adjacency matrix is required to be input into graph convolution first. The core idea of this paper is also related to the construction of a graph adjacency matrix.
Due to the lack of a graph adjacency matrix in some data sets or the limitations of the adjacency matrix itself, we extended the method of calculating the correlation coefficient in statistics to constructing the adjacency matrix, and make it related to the input sample to maintain its dynamics during construction. The above work is represented in the overall structure by "dynamic correlation matrices". After obtaining the adjacency matrix, dynamic graph convolution and gated temporal convolution were carried out in the corresponding modules in the dynamic correlation layer. Finally, the results of each layer were processed through skip connection to complete the whole process.

Construction of Multiple Regression Dynamic Correlation Adjacency Matrix
The correlation matrix in multivariate time series can be formulated as where t is the time slice in samples. Then, the correlation can be interpreted as similarity among different trends of time series of different nodes. We also used α as a threshold to control the sparsity.
Since it is generated using input data X, A DR ∈ R B×N×N is sample-variant, which can be called a dynamic correlation adjacency matrix. The generated progress is in the dynamic correlation module (DCM) as shown in Figure 2.

Graph Convolutional Neural Networks for Traffic Forecasting
The whole model has an encoder-decoder structure. It has a dynamic correlation module to generate the adjacency matrices as shown in Figure 2, and a skip connection to prevent smoothing. The encoder and decoder are simple linear layers used to transform the shape of data. The hidden layers are several DCLayers, where each one contains a dynamic graph convolution module (DGCM, as shown in Figure 3), the difference between dynamic convolution and static convolution is shown in Figure 4. The DClayer also contains a gated temporal convolution module (GTCM, as shown in Figure 5), and the DClayer itself is shown in Figure 6. In addition, the whole structure is shown in Figure 7. . DGCM has a dynamic correlation graph convolution and an adaptive graph convolution that use adjacency matrices A DR that DCM generated and adaptive adjacency matrix A, respectively. DGCM combines the result of the two operations so that the shape of the input and output stays the same.

Dynamic Graph Convolution Module
After generating the dynamic correlation adjacency matrix, it can be used for graph convolution to extract the spatial dependency between nodes. This kind of dynamic graph convolution can be defined as where A ∈ R B×N×N , H ∈ R B×N×T×C , and B is the batch size, which is the number of samples that are actually put in the model in a single iteration. W and b are learnable parameters. In comparison, as shown in Figure 4b, normal graph convolution (with regard to not being dynamic) can be defined as where A ∈ R N×N , which means that it is the same in every sample in the iteration due to lacking the B dimension.  DGCM as shown in Figure 3 can be defined as where A A is generated using the methods in Table 1.

Gated Temporal Convolution Module
After using graph convolution to extract the spatial dependency, gated temporal convolution was used to extract temporal dependency as shown in Figure 5. It can be defined as where σ 1 , σ 2 activate functions tanh and sigmoid, ⊗ means the Hadamard product, and W and b are learnable parameters. During gated temporal convolution, parameter matrix W 1 , W 2 ∈ R T l ×T l+1 are used to transform the amount of time slice T l in the current hidden layer to the next hidden layer T l+1 . Without padding, T l+1 = T l − 2 after 1-D convolution.

Layer Model and Overall Structure
After a graph convolution through the spatial dimension and gated convolution through the temporal dimension, we added a residual connection to avoid over smoothing, constituting a DCLayer in hidden layers as shown in Figure 6, which can be defined as where H l ∈ R B×N×T l+1 ×C is the middle T l+1 time slices from H l ∈ R B×N×T l ×C . The overall structure of the model is the encoder-decoder structure. The encoder is an input layer that transforms the input feature F to hidden channel C, which can be defined as After the encoder, L DCLayers were used to extract spatial and temporal dependencies. To acquire the result from different time scales to prevent over smoothing, a skip connection was added, which can be defined as where H l ∈ R B×N×T L ×C is the middle T l+1 time slices from H l ∈ R B×N×T l ×C . Finally, the decoder has two output layers, used to transform the time slice T and hidden channel C, which can be defined as where Thus, the whole model as shown in Figure 7 can be defined as

Experiment
We conducted an extensive experimental study to evaluate the effectiveness and efficiency of the proposed DCGCN models. The following sections are introduced in the following order. An introduction to the generated original data set is described in Section 5.1. In Section 5.2, experimental indicators, a comparison model, experimental results, and a comparative analysis are introduced in detail. All of these experiments were tested on Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz and NVIDIA GeForce RTX 2060. The algorithms were implemented using Python 3.7 with PyTorch 1.7 and CUDA 10.1.

Calculation of Features from Traffic Stream
The traffic stream has a certain pattern that is influenced by the roads' status, space, and time. Typically, speed, flow, and density are used to describe traffic streams. These basic features can decrease the influence of roads, thus focusing more on space and time. The public datasets from PEMS [22] have all of these features, where the occupancy is similar to the density. The most common experiment from the traffic domain are speed prediction and flow prediction. Due to the easier and more accurate method of calculation, we chose flow as the feature of datasets.
The raw data of the original datasets are from an expressway networking toll data transmission system. The whole system produces millions of data per day, and each item has a large amount of specific information of vehicles, gantries, time stamps, etc. Thus, we cannot publish the datasets due to customers' and geometric information. The calculation of traffic flow does not need specific data, but only the number of items, because flow is defined as the number of vehicles passing through a cross-section in the interval of time. The system has gantries to play the role of the cross-section, and the interval can be set to 5 min, which is the same as PEMS datasets. We named them HBD2 and HBD5.
The datasets of the experiment consisted of original datasets and public datasets. We introduce how we generated the original datasets in the next section. The statistics of the datasets are given in Table 2. As shown in Table 2, we constructed the original dataset according to the standards of public data sets, and compared it in terms of several aspects: time slices, spatial vertices, features, the time span of the data source, included days, and the source. The number of time slices of the original data set is twice that of the public data set. This is because the data source of the original data set has a maximum time span of one month, but we make full use of it. They maintain the same order of magnitude in space vertices and the same feature.

Calculation of Adjacency Matrix
Regarding the matrix or tensor of features, datasets used for graph convolution may also need an adjacency matrix. Although we do not have the distance on the graph to calculate the connectivity shown in Formula (6), we have the location of the gantries to calculate the distance between each pair of gantries using the haversine formula: where δ θ = |δ i − θ j | is the longitude difference, δ φ = |φ i − φ j | is the latitude difference, and R is the radius of the Earth, which is set to 6371 km. After calculating the distance matrix, we used Formula (19) to generate adjacency matrix A.

Experimental Setting
Mean absolute error (MAE): Root mean square error (RMSE): where Ω is the set that participates in the averages, and changes in different situations, such as |Ω| = N or |Ω| = N × T, which confuses the prediction of whether it is the average of all nodes from a single time slice or the average of nodes from multiple time slices. The experiments in this paper belong to the latter situation. Another commonly used evaluation matrix is the mean average percentage error (MAPE): However, the feature of the datasets is the flow, which contains many zeros caused by rainy days, holidays, and other special circumstances (e.g., COVID-19 lockdown). Some of the codes of baseline models have not managed the dividing zero error, so the results only use the former two evaluation matrices.

Baseline Models
Baseline models are listed by the published time as follows: • DCRNN [22]: diffusion convolution recurrent neural network, which integrates graph convolution into an encoder-decoder gated recurrent unit. • STGCN [23]: spatio-temporal graph convolutional network, which integrates graph convolution into a 1D convolution unit. • ASTGCN [24]: attention-based spatial temporal graph convolutional network, which introduces spatial and temporal attention mechanisms into a model. • GWN [25]: Graph WaveNet, which combines an adaptive adjacency matrix and 1D dilated convolution, which can handle long sequences. • STSGCN [33]: spatial-temporal synchronous graph convolutional network, which uses localized spatial-temporal graphs to model localized correlations independently. • MTGNN [26]: multivariate time series graph neural network, which uses graph learning, graph convolution, and temporal convolution modules to extract uni-directed relations in an end-to-end framework. • AGCRN [27]: adaptive graph convolutional recurrent network, which uses node adaptive parameter learning module to capture node-specific patterns, and data-adaptive graph generation module to infer the inter-dependencies among different series. • STFGNN [34]: spatial-temporal fusion graph neural network, which generates a "temporal graph" to compensate for correlations that spatial graphs may not reflect, and uses the fusion operation of various spatial and temporal graphs to learn hidden spatial-temporal dependencies. • ST-Norm [36]: spatial and temporal normalization, which separately refine the highfrequency component and the local component underlying the raw data. Both modules can be integrated into other architectures. • STGODE [37]: spatial-temporal graph ordinary differential equation network, which captures spatial-temporal dynamics through a tensor-based ordinary differential equation.
All of the results on the PEMS datasets of the above models come from papers in AAAI-21 and AAAI-22 [34,38].

Main Results
The results of the original datasets, HBD2 and HBD5, are in Table 3. The results on public datasets, PEMS datasets, are in Table 4, where 15/30/60 means the number of minutes after the current time in the prediction, and the results in Table 4 are all predictions after one hour. We ran the codes that were provided by papers with original datasets; those with * mean that their model does not need a predefined adjacency matrix. GWN* is a special version of the ablation study in its paper. Although those models with * may have worse results, their requirements for datasets are lower, and the range of applications is expanded. Unless conditions such as AGCRN with HBD2 occur, the model is unavailable in this datasets because the results do not converge. In this case, the possible disadvantages of learning the graph structure completely through randomly initialized node embedding are reflected, which is not as stable as a predefined or calculated graph structure. Our model has two ways to choose from, so we can determine the approach based on the results. The STGCN and ST-Norm achieved better results on the HBD5 dataset than other datasets, comparing HBD2 with the public datasets in their respective papers, but this may be a special case. Their codes split the datasets by days, resulting in the need to drop part of the dataset to make the number of items divisible in order to run correctly. In addition, it can be divided into a training set, validation set, and test set with a correct ratio of 7:1:2. However, the economy and traffic of HBD5 may be easily affected by short-term external factors, or the excluded data may be be unstable data with large fluctuations. In short, this phenomenon makes this part of the results less credible, but should not affect the overall experimental results. The results on original datasets show that our model is better than baseline models, and the results on public datasets show that our model is 2% to 14% better than baseline models except for the AGCRN. Due to the condition mentioned above, the DCGCN has a wide application range, and is thus a better model. To summarize the results, we recognize that the DCGCN did not perform as well in the Table 3 original dataset as it did in the Table 4 public dataset with the best results from the DCGCN* or DCGCN in all datasets. However, the advantage of our model is that it has two choices. On the one hand, it can achieve the best accuracy on two original data sets, whereas the GWN, AGCRN, and ST-Norm can only achieve it on one data set. On the other hand, it can achieve good results even without an input of the adjacency matrix. Those models with * mean that they don't need a predefined adjacency matrix. Those models with * mean that they don't need a predefined adjacency matrix.

Component Analysis
To verify the effectiveness of the components of the proposed model, we conducted a component analysis on the PEMS08 dataset to validate the effectiveness of key components. The models without different components are as follows: • Without PA: the model without a predefined matrix uses a dynamic correlation matrix in the DGCM.  Table 5. We can draw three conclusions: (1) the model without PA performs better than the model without DC, which means that the dynamic correlation matrix can play the role of a predefined matrix when the latter is unavailable under certain circumstances without affecting the performance of the model; (2) the models without GC or TC perform far worse than other models because these two are key components in most GNN models in the traffic domain. Furthermore, we can see from the comparison of the results of the train loss, valid loss and test MAE that the former model is under-smooth and the latter is over-smooth; (3) the train loss and valid loss of the model without SP are similar to the final model, but the test MAE is approximately 17% higher, which means that the model is severely over-smooth. This verifies the effectiveness of special connections to prevent over-smoothing.  More details of the results of the above models are shown in Figure 9. (1) The curves of the model without PA are similar to the final model, which shows that the DCGCN can achieve a certain effect without a predefined adjacency matrix. In comparison, the model without DC only use a predefined matrix, which limits the spatial dependencies to have inferior results. (2) Both models without GC or TC cannot have good results, but their curves are very different due to the different circumstances of under-smoothing and oversmoothing. The former's curves are gentle and slow to converge whereas the latter's curves are fast to converge in training loss and validation loss but unstable in test MAE. (3) The model without SP is severely over-smooth; thus, its curve in test MAE becomes unstable after only 30 epochs, which shows that special connections are crucial as well.

Conclusions
In this paper, we studied the multivariate time series prediction problem and present a novel framework for spatial-temporal traffic data forecasting. The core idea of the model is to construct different similarity adjacency matrices according to different dynamic samples to carry out graph convolution, so as to capture the dynamics of space variably. Our model can deal with the dataset regardless of whether it has a predefined graph structure or not. By combining a predefined adjacency matrix, dynamic correlation matrix, and adaptive matrix, the DCGCN can learn global and localized spatial-temporal dependencies through spatial graph convolution and temporal convolution. The extensive tests on two original datasets and four public datasets verified the superiority of the proposed solutions in this paper. In the future, we can try to extend this method to the time dimension to further capture the dynamics of space-time in all aspects.