Spatiotemporal Adaptive Fusion Graph Network for Short-Term Trafﬁc Flow Forecasting

: Trafﬁc ﬂow forecasting is challenging for us to analyze intricate spatial–temporal dependencies and obtain incomplete information of spatial–temporal connection. Existing frameworks mostly construct spatial and temporal modeling based on a ﬁxed graph structure and given time series. However, a ﬁxed adjacency matrix is limited to learn effective spatial–temporal correlations of the network because it represents incomplete information for missing genuine relation. To solve the difﬁculty, we design a novel spatial–temporal adaptive fusion graph network (STFAGN) for trafﬁc prediction. First, our model combines fusion convolution layers with a novel adaptive dependency matrix by end-to-end training to capture the hidden spatial-temporal dependency on the data to complete incomplete information. Second, STFAGN could, in parallel, acquire hidden spatial–temporal dependencies by a fusion operation and temporal trend by fast-DTW. Meanwhile, we use ReZero connection as a simple change of deep residual networks to facilitate deep signal propagation and faster converge. Lastly, we conduct comparative experiments on two public trafﬁc network datasets, whose results demonstrate the superiority of our algorithm compared to state-of-the-art baseline types. Ablation experiments also prove the rationality of the framework of STFAGN.


Introduction
The number of vehicles increases on roads with the fast development of urbanization and the improvement of people's living standards. Ubiquitous deployment of intelligent traffic systems (ITS) is one of the effective ways to alleviate urban traffic congestion [1]. Intelligent traffic systems are fast growing with the development of sensor technology, which enable dynamic traffic data collection to predict the future traffic flow of a road network [2]. Accurate traffic flow forecasting is promising in prompting urban traffic transportation [3,4].
Traffic flow forecasting is a challenging task. The methods in traffic forecasting can be divided into two categories: knowledge-driven methods and data-driven methods. In the early stage, queuing theory and simulating behaviors are applied in knowledge-driven methods [5]. With the rapid growing of traffic data collection and storage technologies, datadriven methods become increasingly popular. The statistical and machine learning method • We design the temporal adjacency matrix to effectively capture temporal distances of the traffic flow, and the adaptive matrix to exploit hidden spatial dependency in the static graph structure. • We propose the spatial-temporal adaptive fusion graph network (STAFGN) to exploit spatial-temporal dependencies simultaneously by fusing the spatial and temporal graphs into a large adjacency matrix. • We evaluate our model on two real-word traffic datasets with extensive experiments. The case study demonstrates that the STAFGN outperforms the state-of-the art methods.

Traffic Flow Forecasting
A road network can be represented as G = (V, E, A), where V is the set of nodes |V| = N 2 , and E is the set of edges in the network. The spatial adjacency matrix is represented as A ∈ R N×N . If there is an edge between v i and v j , A ij is 1 and otherwise 0. At each time step t, X t ∈ R N×D denotes the traffic status, e.g., road network occupancy, traffic speed, and capacity, in the road network. Traffic flow forecasting is to learn function f to map the historical traffic flow X (t−P+1):t to that of the future X (t+1): (t+Q) .
where X (t−P+1):t = (X t−P+1 , X t−P+2 , . . . , X t ) ∈ R P×N×D andX (t+1):(t+Q) = (X t+1 ,X t+2 , . . . ,X t+Q ) ∈ R P×N×D . Traffic flow forecasting focuses on spatial-temporal forecasting [25]. The methods in spatial-temporal forecasting are classified in two categories, RNN-based [17,26] and CNN-based [21,24]. Now, many research studies employ the graph convolutional network in spatial-temporal forecasting. It has prompted the development of spatial-temporal forecasting, exploiting the spatial-temporal dependencies more effectively. The dynamic graph convolutional recurrent network (DGCRN) [17] model, a hyper-network to generate the dynamic adjacency matrix, was integrated with the static graph in GCN model to train. Graph WaveNet [24] uses the self-adaptive adjacency to preserve the implicit spatial dependencies and stacked dilated casual convolutions to exploit the temporal dependencies. Spatial-temporal fusion graph neural networks (STFGNN) [21] construct several graphs, which are integrated as a spatial-temporal fusion graph to explore the spatial-temporal relationship simultaneously.

Graph Convolution Networks
Graph convolution networks can be viewed as the process of graph-based presentation learning, aiming to utilize deep learning in structured data. It is widely applied in node classification [27], graph classification [28], and link prediction [29]. Spectral domain based and spatial domain based are the two main approaches in GCN. The spectral-domain-based method uses graph Fourier transform on the graph signal to deconstruct the graph signal in the spatial domain. The graph spectral filtering by decomposition of the Laplace matrix to exploit irregular graph data is as follows: where U ∈ R n×n is eigenvectors of the Laplacian matrix L = I n − D − 1 2 = UλU ∈ R n×n , (I n is the identify matrix, D ∈ R n×n is the diagonal degree matrix with D ii = ∑ j W ij ), Λ ∈ R n×n is the diagonal matrix of eigenvectors of L, and γ(Λ) is the spectral filter, which is also a diagonal matrix.
Different from the spectral domain, the spatial-based method aggregates features from the spatial neighbor to learn a high-dimension representation. GraphSAGE [30] focuses on node central mini-batch training by the aggregation of its neighbors, enabling distributed training on large-scale data. GAT [31] uses the attention mechanism to aggregate neighbor nodes, realizing adaptive allocation to different neighbor weights.

Fast-DTW
Fast dynamic time warping (fast-DTW) is the modified algorithm based on dynamic time warping (DTW) [32]. DTW is a classical algorithm to measure the time series similarity, as well as the Euclidean distance [33]. However, in most situations, two times series have very similar shapes as a whole; these shapes are not aligned on the x axis. So, before comparing time series similarity, one of the time series needs to be warped under the timeline for better alignment. DTW is an effective way to achieve this warping distor-tion [34]. It calculates the similarity between two time series by extending and shortening the time series.
While calculating state transition dp n×m , the warping path Φ can be generated from it. The warping path Φ is denoted as follows: where w λ means the matchup between x i and y j . However, since the real traffic time series is large, utilizing DTW to the general series similarity based on real traffic data is a challenging task. The computational complexity is up to O(n 2 ). To address this problem, STFGCN [21] limits the search length T to improve the DTW algorithm, which is named fast-DTW. The searching range is restricted as follows: In Equation (3), we can see that the computation complexity is declined from O(n 2 ) to O(Tn), making it possible to calculate the large and long traffic data.

ReZero
Trainability is related to dynamic isometry [35]. ReZero (residual with zero initialization) is a way to ensure initial dynamical isometry in deep networks [36]. It add learnable parameters to the architecture of deep residual network in order to dynamically promote well-behaved gradients and arbitrarily deep signal propagation. A skip connection and residual weights α i are used to realize the non-trivial transformation of a layer F(x). The propagation is shown below:

Methodology
As shown in Figure 1, our spatial-temporal adaptive fusion graph convolution network consists of three modules, including the spatial-temporal adaptive fusion construction module (STAFCM), the spatial-temporal adaptive fusion graph neural module (STAFM) and the gated convolution module (GCM). First, the STAFCM constructs the spatialtemporal fusion adjacency matrix M F to integrate spatial-temporal information. The proposed M F contains the temporal adjacency matrix A t computed by fast-DTW [21], the spatial adjacency matrix A s and the temporal connectivity graph A c to represent given spatial-temporal connections in the traffic graph. The combination of M F is displayed in the STAFCM of Figure 1, with blue for A s , orange for A t , and gray for A c . Second, the STAFM completes the incomplete fusion adjacency matrix M F in the fusion self-adaptive convolution layer for hidden spatial-temporal features in gated multiplication layers. Basically, the STAFM is composed of a fusion self-adaptive convolution layer and stacked gated multiplication layers with a max pooling layer. The fusion self-adaptive convolution layers learn the adjacency matrix from data through an end-to-end supervised training to construct the self-adaptive fusion adjacency matrixM F . The gated multiplication layers aggregate the spatial-temporal dependencies by matrix multiplication fromM F . Third, the GCM extracts long-range spatial-temporal dependencies by a large dilation rate as a gating mechanism in recurrent neural networks. We stacked k STFAGN layers to capture hidden spatial-temporal dependencies.

Spatial-Temporal Adaptive Fusion Convolution Layer
In this paper, we design a spatial-temporal adaptive fusion convolution layer to extract long-term spatial-temporal dependencies in the GCM module. The layer also can establish the self-adaptive fusion adjacency matrix to supplement spatial-temporal connections in the STAFM module. In Figure 1, the STFAGN layer represents the spatialtemporal adaptive fusion convolution layer. The layer mainly consists of the GCM and STAFM module.
A fixed adjacency matrix is not relevant to prediction tasks, which may cause considerable biases. However, it is pricey to collecting complete and precise road information by sensors. To adjust the incomplete adjacency matrix, Wu et al. [24] introduced a selfadaptive adjacency matrix to construct and complement the adjacency matrix without prior knowledge by learnable node embedding. Given two introduced node embedding M i , M j ∈ R N×D , the self-adaptive adjacency matrix isM = σ(ϕ(M i M T j )), where σ(·). ϕ(·) respectively denotes the softmax and ReLU activation function. Supposing that a graph is directed, the diffusion direction is double, comprising forward and backward directions. They gave a definition to a forward transition matrix as S f = A/ ∑ n i=0 A ij and a backward transition matrix as [24]. They integrated the diffusion convolution layer with the self-adaptive adjacency matrix and defined the diffusion self-adaptive convolution layer as

Spatial-Temporal Adaptive Fusion Construction Module
We present the STAFCM module to aggregate spatial-temporal dependencies in the spatial-temporal fusion adjacency matrix. As displayed in Figure 1, the spatial-temporal fusion adjacency matrix M F ∈ R KN×KN consists of the temporal adjacency matrix A t ∈ R N×N computed by fast-DTW [21], the spatial adjacency matrix A s ∈ R N×N and the temporal connectivity graph matrix A c ∈ R N×N [21]. Fast-DTW is applied to construct the temporal adjacency matrix A t . We add the similarity of temporal trends into an adjacency matrix with fast-DTW from Equations (3) and (4). A c represents the connection of the same node belonging to the recent time step. In each node l ∈ {1, 2, . . . , N}, when i = t × N + l and j = (t + 1) × N + l, M F,ij = 1, where t is the current time step. Each node could integrate the spatial relevance from A s , temporal pattern information from A t and its own approximate correlation of the proximal time step from A c by matrix multiplication with M F . In this paper, K defaults to 4. Let A t denote the temporal adjacency matrix to obtain the temporal information of the time sequence. A s is given from the fixed dataset. Finally, as Figure 1 shows, the spatial-temporal fusion adjacency matrix M F is constructed.

The input of the STAFM module is formulated into
R T×N×D×C . C denotes the number of channels in the STAFM module.

Gated Convolution Module
We design the GCM module to capture the long-term spatial-temporal information with a large dilation rate. Gating mechanisms in recurrent neural networks make a difference to extracting the long-term relevance of traffic flow with gated temporal convolutions [37]. Gated TCN with dilation factor k (e.g., 1, 2, and 4) can learn complicated temporal correlation [24]. However, because the GCM employs a larger dilation rate, the GCM module is distinct from TCN in Graph WaveNet [24] and STGCN [38] by extracting more long-range spatial-temporal dependencies. Given the whole input data X ∈ R T×N×d×C , it takes the following form: where σ(·) and φ(·) are sigmoid and tanh functions. Importantly, Φ 1 and Φ 2 stand for two 1D convolution with dilation factors = K − 1, which controls the skipping distance to enlarge the receptive field along the time axis [24]. Figure 2 displays the STAFM module serving as a deep convolution learnable model, capturing hidden spatial-temporal features to complement the incomplete spatial-temporal connection. The STAFM module is made up of the fusion self-adaptive convolution layer, and stacks the gated multiplication layers followed by the max pooling layer with ReZero connection. Li et al. [21] introduced the STFGN module in the STFGNN to capture complicated correlations by multiplying the STFGN modules independently in parallel. However, the STFGN module cannot adaptively complete an incomplete connection of the graph. Compared with STFGNN, the novel STAFM module constructs the fusion adjacency matrix of the spatial-temporal relationship to complete the incomplete spatial-temporal adjacency matrix. Fusion Self-Adaptive Convolution layer: Based on the spatial-temporal fusion adjacency matrix and the diffusion self-adaptive convolution layer, we propose a fusion selfadaptive convolution layer (FSAC) to adaptively learn the self-adaptive fusion adjacency matrixM F ∈ R KN×KN by an end-to-end learnable convolution layer. M F = so f tmax(ReLU(M i M T j )), where M i ∈ R KN×d and M j ∈ R KN×d are spatial-temporal fusion node embeddings of source and target nodes. M i and M j are learnable parameters. We adopt the ReLU activation function to alleviate the occurrence of overfitting. We define that S f = M F /rowsum(M F ) and S b = M T F /rowsum(M T F ). Lastly, the graph convolution layer withM F can be summed up as

Spatial-Temporal Adaptive Fusion Graph Neural Module
Gated Multiplication Layer: In a gated multiplication layer, we replace matrix multiplication for a spectral filter to integrate complicated spatial-temporal correlations in the graph multiplication layer. The gated multiplication layer can capture hidden spatialtemporal correlations by matrix multiplication. Therefore, in the STAFM layer, a graph multiplication layer aggregates matrix multiplication and GCM. In the graph multiplication layer, a gated linear unit can summarize global characteristics after nonlinear activation. We introduce the parameters of GLU with W 1 , W 2 ∈ R C×C ,M F ∈ R KN×C and b 1 , b 2 ∈ R C×C . Let H l , H l+1 ∈ R KN×C denote the l-th hidden feature. Then, we formulate gated multiplication as where is the Hadamard product in GLU and σ is the sigmoid. Different from residual connection [39,40], stacking layers with ReZero connections [36] are fast to obtain the complex spatial correlation of each layer. The next layer is the max pooling layer, which concatenates each hidden state H P ∈ maxPool([H 1 , H 2 , . . . , H L ]) ∈ R K×N×D×C . As illustrated in Figure 2, after the max pooling layer, H P is cut into the shape of R 1×N×D×C , which can represent complicated anisotropy [21]. There are T K−1 − 1 layers stacked in the STAFM. As a consequence, the cropped connection of the intermediate time step is organized into H p = H m K 2 , K 2 + 1, :, :, : ∈ R 1×N××D×C .
ReZero connection: We adopt ReZero connection in the STFAGN for faster training and convergence. Bachlechner et al. [40] demonstrated that ReZero has several benefits, including wide usability, deeper learning and faster convergence. Compared with residual connection, ReZero (residual with zero initialization) is a simple change in deep residual networks to facilitate dynamical isometry. It further enables the efficient training of extremely deep networks [36]. So, we substitute ReZero for residual connection [40]. Given F[W i ], which includes the STAFM layer and STFAGN layer and so on, the signal now propagates according to where θ i represents i-th learnable parameters, named residual weights [36]. Multiple STAFM layers operate the input signal in parallel to extract spatial-temporal dependencies by gated multiplication. The shape of output data R T×N×D×C is transformed into R (T−K+1)×N×D×C . Let ϑ denote a hyperparameter to control the sensitivity of the squared error. We apply Huber loss as the loss function, whose specific calculation is shown as

Datasets and Baseline
Under the same hardware environment and the same datasets, we conduct comparative experiments to facilitate comparison with other advanced baselines. We testify the effectiveness of the STFAGN based on two traffic signal datasets consisting of METR-LA and PEMS-BAY [18]. METR-LA is constructed from records of highways in Los Angeles County, which tests traffic speeds with a sensor over a period of four months. PEMS-BAY comprises traffic speeds of the Bay area over a period of six months. For both datasets, sensors calculate traffic speed every 5 mins, so the adjacent time series differ by 5 mins. The sensors of METR-LA and PEMS-BAY add up to 207 and 325, respectively, with 1515 and 2369 edges. Before training data, there is a requirement to pre-process data, the same as in [18]. The adjacency matrix of both datasets is established on a distance-based graph with the threshold of a Gaussian kernel [17]. The datasets are separated in 70% for training, 20% for validation, and 10% for testing. For more details, see Table 1. We compare STFAGN with the following models.
• Graph WaveNet: Graph WaveNet, a spatial-temporal graph model with a stacked dilated 1D convolution component and self-adaptive adjacency layers [24]. • STFGNN: Spatial-temporal fusion graph neural networks, with a gated dilated CNN module and spatial-temporal fusion graph module in parallel [21]. • ARIMA: Autoregressive integrated moving average [6,41], with Kalman filter, widely used in time series analysis, which fits time series data to predict future points in the series. • SVR: Support vector regression, using a support vector machine to regress traffic sequence, characterized by the use of kernels, sparse solution ,VC control of the margin and the number of support vectors [8].

Experiments Results and Analysis
Our experiments are launched under an environment with Intel(R) Xeon(R) Gold 6139 CPU @ 2.30GHz. The edition of NVIDIA is NVIDIA-SMI 455.45.01, driver with version 455.45.01, and CUDA with Version 11.1. The temporal adjacency matrix At respectively generated from fast-DTW in Alg1. The sparsity of the temporal adjacency matrix A t is 0.01. The batch size is 32. The epoch of training is 100. The using learning rate of the Adam optimizer is 1.0 × 10 −3 . In the STAFM, the number of gated multiplication layers is 3. The STFAGN includes 8 parallel STAFM layers with dilation rate 3 and 1 gated multiplication layer. The size of the filter in the model is R 3×3 with all elements filled with 64. Then, we let the dilation rate equal 3, because the size K of the spatial-temporal fusion adjacency matrix is 4 to aggregate information through the neighbor. For all experiments, we use 12 past time steps of traffic signal to predict 3, 6 and 12 time steps in the future. Table 2 indicates a comparison of the prediction validity of each model. The experiment is conducted to input and train the data of the past 60 min to predict the next 15, 30, and 60 min of traffic speed in the METR-LA and PEMS-BAY datasets. On the mean result of 60 min horizons in METR-LA, STFAGN is optimized by 3.56% more than Graph WaveNet, 1.00% more than STFGNN, 7.30% more than ARIMA and 6.60% more than SVR. On the mean result of PEMS-BAY in three horizons, STFAGN is probably increased by 0.1% to 2.5% compared to other baselines. So, STFAGN surpasses data-driven approaches, such as ARIMA and SVR. Moveover, STFAGN outperforms the previous convolution-based models, including Graph WaveNet and STFGNN.
Traffic flow is nonlinear data with complex spatiotemporal correlation. ARIMA only captures linear relationships. SVR fails to adopt the spatial correlation of the traffic network. So, ARIMA and SVR perform poorly in traffic prediction. Graph WaveNet is conducted with poor performance, because it cannot construct the adaptive fusion spatial-temporal adjacency matrix with incomplete spatial-temporal relevance. Compared with STFGNN, the second best framework, the effectiveness of STFAGN is slightly better than it on 15 min and 30 min horizons, but significantly exceeds STFGNN on 60 min horizons. We consider that there are two reasons to explain the effectiveness of STFAGN. First, our model can be more adaptive to adjust the spatial-temporal adjacency matrix by constructing the spatialtemporal adaptive fusion layer. Second, STFAGN is more effective in extracting long-term temporal correlation by integrating the STAFM module with a gated CNN module.
In general, the average result of the presented STFAGN model is superior compared to the baselines in performance of extracting spatial-temporal relevance. From the standard deviation, the training results are also relatively stable.

Ablation Experiments
To verify the significance of different components in STFAGN, we conduct ablation experiments on METR-LA and PEMS-BAY. "Model Elements" denote different configurations. From the experiments in Table 3 and Figure 3, we draw the following conclusions concerning the proposed ideas: • For the ingredient ofM, the fusion self-adaptive convolution layer is used to construct the adaptive fusion adjacency matrix, which can complete incomplete information of the adjacency matrix in the traffic network. Traffic networks based on distance do not mean that adjacent nodes have a traffic information association. The self-adaptive fusion adjacency matrixM just makes up for this information and achieves a good effect of accelerating convergence. • ReZero, a simple architectural modification, facilitates signal propagation in deep networks and helps the network maintain dynamical isometry. Applying ReZero to the STFAGN, significantly improved convergence speeds can be observed. • STFAGN with the adaptive fusion adjacency matrix and ReZero connection not only adjusts spatiotemporal dependency, but trains efficiently. Therefore, the design of the component is reasonable.

Conclusions
In this paper, we propose an innovative spatial-temporal network to forecast traffic data. We design the spatial-temporal adaptive fusion graph network to capture hidden spatial-temporal heterogeneity effectively. First, learnable spatial-temporal fusion adjacency adaptively adjusts the spatiotemporal connections. Second, we integrate the STAFM module with a gated CNN module, which effectively broadens the receptive field in the time dimension. Lastly, we replace the ReZero connection with a residual connection to enable faster convergence. Ablation experiments show that the design of the fusion adjacency matrix and ReZero connection is reasonable and effective. Executive experiments and analysis reveal the advantages and weaknesses of previous models, which in turn demonstrate STFAGN to be of great effectiveness and superiority.
We found that it is still challenging to effectively extract the dynamics of traffic data in both temporal and spatial dimensions. The proposed spatial-temporal graph convolution network fails to capture many dynamic spatial relations hiding in the traffic data. In the future, we plan to further analyze the dynamic characteristics of traffic networks to capture the dynamic spatial-temporal correlation.

Data Availability Statement:
The data that support the findings of this study can be required from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: fast-DTW fast dynamic time warping STAFCM spatial-temporal adaptive fusion construction module STAFM spatial-temporal adaptive fusion graph neural module GCM gated convolution module STFAGN spatial-temporal adaptive fusion graph network FSAC fusion self-adaptive convolution layer