Spatial–Temporal Fusion Gated Transformer Network (STFGTN) for Traffic Flow Prediction

: Traffic flow prediction is essential for smart city management and planning, aiding in opti‑ mizing traffic scheduling and improving overall traffic conditions. However, due to the correlation and heterogeneity of traffic data, effectively integrating the captured temporal and spatial features remains a significant challenge. This paper proposes a model spatial–temporal fusion gated trans‑ former network (STFGTN), which is based on an attention mechanism that integrates temporal and spatial features. This paper proposes an attention mechanism‑based model to address these issues and model complex spatial–temporal dependencies in road networks. The self‑attention mechanism enables the model to achieve long‑term dependency modeling and global representation of time series data. Regarding temporal features, we incorporate a time embedding layer and a time trans‑ former to learn temporal dependencies. This capability contributes to a more comprehensive and accurate understanding of spatial–temporal dynamic patterns throughout the entire time series. As for spatial features, we utilize DGCN and spatial transformers to capture both global and local spatial dependencies, respectively. Additionally, we propose two fusion gate mechanisms to effectively ac‑ commodate to the complex correlation and heterogeneity of spatial–temporal information, resulting in a more accurate reflection of the actual traffic flow. Our experiments on three real‑world datasets illustrate the superior performance of our approach.


Introduction
Intelligent transportation systems (ITS) [1] are integral to the development of smart cities.Within ITS, traffic flow prediction [2] plays a crucial role by accurately forecasting future traffic conditions based on historical observations.Studies have demonstrated that precise traffic flow prediction is essential for tasks such as alleviating congestion and forecasting taxi demand [3] In the transportation domain, "demand" typically refers to the travel or transportation needs of people or goods between different locations.This encompasses factors like the quantity of trips or the volume of goods transported from one location to another.Short-term origin-destination (OD) flow prediction [4,5] is particularly significant for urban rail transit operation planning, control, and freight transport management.
The typical spatial and temporal characteristics observed in traffic flow prediction data pertain to the variations in data across time and location.This trait underscores the prevalence of correlation and heterogeneity within traffic flow data.Correlation primarily entails autocorrelation across both temporal and spatial dimensions.For instance, as depicted in Figure 1, a traffic incident occurring at a specific road node may have a prolonged impact on adjacent road segments' traffic flow, persisting over multiple time intervals.Heterogeneity, conversely, is evidenced by diverse patterns observed at different temporal or spatial scales.For example, certain holidays or highly trafficked areas may exhibit distinct traffic flow features.So, the primary challenge in predicting traffic flow is effectively capturing and modeling the complex and dynamic correlation and heterogeneity of traffic data.Traditional methods, such as support vector regression (SVR) [6], Bayesian methods [6,7], and vector autoregressive models [8], often rely on complex feature engineering and have poor generalization capabilities.
different temporal or spatial scales.For example, certain holidays or highly traffic areas may exhibit distinct traffic flow features.So, the primary challenge in predic traffic flow is effectively capturing and modeling the complex and dynamic correla and heterogeneity of traffic data.Traditional methods, such as support vector regres (SVR) [6], Bayesian methods [6,7], and vector autoregressive models [8], often rely complex feature engineering and have poor generalization capabilities.In recent years, the field of traffic flow prediction has seen widespread use of hy neural networks based on convolutional neural networks (CNNs) and recurrent ne networks (RNNs) [8][9][10][11][12] due to the development of deep learning techniques.Exam of such networks include ConvLSTM [13] and PredRNN [14].However, these meth have a limitation in that they cannot directly address non-Euclidean data inheren urban systems, such as vehicle flows on road networks.Graph neural networks h rapidly developed to fill the gap in traffic flow prediction.For complex spatial-temp dependencies, some approaches take into account the existence of multiple sp relationships by constructing multiple graphs, such as STMGCN [15] and STFGNN Graph WaveNet [17] utilizes adaptive graph learning to learn spatial dependen Regarding temporal dependencies, there are various approaches, including TCNs that receptive fields of different sizes, such as MTGNN [18].There are also approaches capture spatial-temporal dependencies by integrating different learning networks, s as ASTGCN [19].Additionally, spatial-temporal synchronization graphs can constructed to establish unified spatial-temporal dependencies between time.
But this adjacency matrix is based on road adjacency or time series similarity only takes into account the static spatial dependence between roads.However, sp relationships on real roads undergo dynamic changes influenced by factors suc weather conditions, holidays, and emergencies.Capturing this dynamic change is diff with a single temporal or spatial module.Furthermore, spatial-temporal relationship traffic flow tasks are often complex and diverse, involving multiple patterns at diffe time scales and spatial locations.To accurately and comprehensively fuse this com information, it is crucial to design model structures that can adapt to different modes dynamic changes.Consequently, enhancing the accuracy and robustness of traffic f prediction hinges on the model's adeptness at effectively fusing spatial-temp information.In recent years, the field of traffic flow prediction has seen widespread use of hybrid neural networks based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs) [8][9][10][11][12] due to the development of deep learning techniques.Examples of such networks include ConvLSTM [13] and PredRNN [14].However, these methods have a limitation in that they cannot directly address non-Euclidean data inherent in urban systems, such as vehicle flows on road networks.Graph neural networks have rapidly developed to fill the gap in traffic flow prediction.For complex spatial-temporal dependencies, some approaches take into account the existence of multiple spatial relationships by constructing multiple graphs, such as STMGCN [15] and STFGNN [16].Graph WaveNet [17] utilizes adaptive graph learning to learn spatial dependencies.Regarding temporal dependencies, there are various approaches, including TCNs that use receptive fields of different sizes, such as MTGNN [18].There are also approaches that capture spatial-temporal dependencies by integrating different learning networks, such as ASTGCN [19].Additionally, spatial-temporal synchronization graphs can be constructed to establish unified spatial-temporal dependencies between time.
But this adjacency matrix is based on road adjacency or time series similarity and only takes into account the static spatial dependence between roads.However, spatial relationships on real roads undergo dynamic changes influenced by factors such as weather conditions, holidays, and emergencies.Capturing this dynamic change is difficult with a single temporal or spatial module.Furthermore, spatial-temporal relationships in traffic flow tasks are often complex and diverse, involving multiple patterns at different time scales and spatial locations.To accurately and comprehensively fuse this complex information, it is crucial to design model structures that can adapt to different modes and dynamic changes.Consequently, enhancing the accuracy and robustness of traffic flow prediction hinges on the model's adeptness at effectively fusing spatial-temporal information.
Compared to previous traffic flow prediction models based on encoder-decoder transformer architectures, STFGTN has made the following improvements: • We introduce a novel spatial-temporal dependency fusion model (STFGTN) for traffic flow prediction, leveraging an attention mechanism.This model effectively captures spatial-temporal correlations, aggregates relevant information, and notably enhances traffic flow prediction accuracy.

Deep Learning for Traffic Prediction
In recent years, there has been a surge in the development of deep learning frameworks aimed at addressing traffic flow prediction challenges, with the primary goal of enhancing prediction accuracy.Initially, spatial regions were divided into two or threedimensional grids, serving as input windows for convolutional neural networks to forecast traffic flow.ST-ResNet [9] utilized residual convolution techniques for crowd flow prediction.Yao et al. [12] employed convolutional neural networks (CNNs) in the spatial domain and long short-term memory (LSTM) in the temporal domain to capture spatial-temporal dependencies in traffic flow data.These methods operate on gridded traffic data and apply convolution operations in a structured manner with respect to spatial dimensions to capture spatial correlations.However, they do not account for non-Euclidean dependencies between nodes.
Graph neural networks (GNNs) have showcased remarkable performance in modeling graph data, rendering them a favored choice for various graph-related tasks like graph classification [20], node classification [21], and recommender systems [22].Recent studies have integrated spatial graphs into traffic prediction by employing spatial-temporal graph models to handle the graph structure of spatial-temporal data.This approach has been explored by numerous researchers [18,[23][24][25][26][27].The traffic data is organized as a graph using a spatial-temporal graph neural network (STGNN) and utilized for prediction.The STGNN model is divided into two methods: RNN-based and CNN-based.RNNs and CNNs are utilized in the respective methods to conduct forward computation along the temporal dimension.Attentional mechanisms have gained popularity in this domain due to their efficacy in capturing dynamic dependencies in traffic data [19,[28][29][30][31].However, these models do not comprehensively address the dynamic spatial-temporal dependencies between nodes within the road network, at both local and global scales.
Additionally, deep learning techniques aid in origin-destination (OD) estimation for traffic flow prediction [32], using deep learning methods and global sensitivity analysis to solve OD estimation and sensor location issues.

Graph Convolution Networks
Graph convolutional neural networks (GCNs) are powerful tools for capturing spatial dependencies in non-Euclidean spaces.The crux of GCNs lies in the adjacency matrix, which provides inherent topological information to the model.However, real-world traffic dynamics can vary significantly, even among locations with the same central point, due to factors such as temporal variations.Therefore, the static adjacency matrix constructed by GCNs fails to accurately capture the dynamic nature of traffic diffusion.In recent years, popular GCNs can be classified into three main types: spectral GCN [33,34], ChebNet [35], and GAT [36].With regard to spectral GCN, Bruna et al. used Fourier transform to convert graph signals in the spatial domain to the spectral domain for convolution computations.In terms of ChebNet, to overcome the dependence on the graph Laplace matrix, Kipf et al. performed message passing in the spatial domain to simplify the graph convolution operation.In terms of GAT, to illustrate the importance of neighboring nodes in learning spatial dependencies, GAT integrates an attention mechanism into the node aggregation operation.

Transformer
The concept of the attention mechanism aims to enhance model performance by efficiently assigning weights and focusing on different parts of the information, thus making it more adaptable to various tasks and data contexts.This concept has been successfully applied in several deep learning models, including the Transformer architecture in natural language processing [37], and the Swin Transformer in computer vision [38], which achieved remarkable image classification performance.Additionally, numerous variants of the Transformer have demonstrated promising results in computer vision tasks.
Recent studies have shown that introducing the Transformer architecture to traffic flow prediction [27,30,39] addresses the limitations of static structures.The original Transformer architecture employs an encoder-decoder structure, utilizing encoder and decoder stacks to extract deep features, along with a multi-head mechanism to capture long-term dependencies in sequences.For example, in TFormer [40], the K-hop adjacency matrix is used to guide the model to focus on the nearby neighboring nodes and ignore the distant nodes.It describes the adjacency between nodes in the graph and helps the model to accurately capture local spatial features.Traffic Transformer [41] consists of a global encoder and a global-local decoder, integrating global and local spatial features through multi-head attention.It utilizes temporal embedding blocks to extract temporal features, positional encoding and embedding blocks to understand node locations, and concludes with a linear layer for prediction.In our approach, we leverage the Transformer solely as a spatial-temporal dependency extractor, deviating from the original Transformer's encoder-decoder structure.

Problem Definition
Definition 1. (Road network) We denote the road network as G = (V, E, A), where V = {v 1 , v 2 . . . ,v n } denotes the set of N nodes (|V| = N) in the road network, E is the set of connectivity between nodes, and A ∈ R N×N is the adjacency matrix of the road network used to describe the spatial distance between nodes.Where N is denoted as the number of nodes in the road network.Definition 2. (Traffic Signal Matrix) Therefore, the traffic state at any time step t can be regarded as a graph signal X t ∈ R N×D , where D is the dimension of the traffic state, and the state includes traffic flow, speed, etc.We use X = (X 1 , X 2 , . . . ,X T ) ∈ R T×N×D to denote the traffic flow tensor of all nodes on the total T time step.

Problem Formalization
Traffic flow prediction is the prediction of traffic flow in a future time period by analyzing the observed historical traffic data.In a transportation system, we have observed , and a known spatial graph A. Our objective is to learn the mapping function f from the observed historical traffic flow at the T-steps, in order to predict the traffic flow at the future T ′ steps. [

Methods
Figure 2 illustrates the architecture of STFGTNs, comprising a data embedding layer, multiple stacked spatial-temporal modules interconnected in sequence, and an output layer.We will provide a detailed description of each module below.

Methods
Figure 2 illustrates the architecture of STFGTNs, comprising a data embedding layer, multiple stacked spatial-temporal modules interconnected in sequence, and an output layer.We will provide a detailed description of each module below.

Data Embedding Layer
In this study, we introduce an input embedding, a widely employed and effective technical approach.The data embedding layer's main task is to transform input data into a high-dimensional representation.Specifically, it first transforms the original input  into   ∈  ××  which is a vector in a higher dimensional space through a fully connected layer.
where   is the dimension of the embedding,  −+1: is the traffic series in the previous T time timestamp, and the (•) indicates a fully connected layer.
In order to better model the periodicity of traffic flow, we specifically design an embedding mechanism which can effectively incorporate the time-periodicity information into the model.The urban traffic flow is highly periodic as influenced by people's traveling mode and lifestyle.For the time periodicity information, we introduce two embeddings to represent the weekly and daily periodicity, respectively, denoted as   ∈    ×  and   ∈    ×  , where   = 7 is the number of days in a week and   = 288 is the number of timestamp in a day.  ∈   and   ∈   denote the day of the week data and, for traffic flow sequences, the timestamped data, respectively.We extract the corresponding temporal embeddings   ∈  ×  and   ∈  ×  by using them as indexes.The periodicity embedding   ∈  ××2  for the traffic time series is obtained by concatenating and broadcasting them.
In addition, we use the temporal position encoding   ∈  ×  from the original Transformer to introduce the position information of the input sequence.

Data Embedding Layer
In this study, we introduce an input embedding, a widely employed and effective technical approach.The data embedding layer's main task is to transform input data into a high-dimensional representation.Specifically, it first transforms the original input X into E f ∈ R T×N×d f which is a vector in a higher dimensional space through a fully connected layer.
where d f is the dimension of the embedding, X t−T+1:t is the traffic series in the previous T time timestamp, and the FC(•) indicates a fully connected layer.
In order to better model the periodicity of traffic flow, we specifically design an embedding mechanism which can effectively incorporate the time-periodicity information into the model.The urban traffic flow is highly periodic as influenced by people's traveling mode and lifestyle.For the time periodicity information, we introduce two embeddings to represent the weekly and daily periodicity, respectively, denoted as T w ∈ R N w ×d f and T d ∈ R N d ×d f , where N w = 7 is the number of days in a week and N d = 288 is the number of timestamp in a day.W t ∈ R T and D t ∈ R T denote the day of the week data and, for traffic flow sequences, the timestamped data, respectively.We extract the corresponding temporal embeddings E w ∈ R T×d f and E d ∈ R T×d f by using them as indexes.The peri- odicity embedding E p ∈ R T×N×2d f for the traffic time series is obtained by concatenating and broadcasting them.
In addition, we use the temporal position encoding E tpe ∈ R T×d f from the original Transformer to introduce the position information of the input sequence.
Eventually, by connecting the embeddings above, we obtain the hidden spatial-temporal representation X emb ∈ R T×N×d , as follows: where the final embedding dimension d = 3d f .

Spatial-Temporal Block Layer
We present a spatial-temporal block comprised of parallel components in Figure 3, which includes a temporal transformer, a spatial transformer, and a dynamic spatial convolutional network.Within the spatial-temporal block, two gating mechanisms, namely the Spatial Fusion Gate (SFG) and the Spatial-Temporal Fusion Gate (STFG), are positioned for the fusion of spatial-temporal features.
temporal representation   ∈  , as follows: where the final embedding dimension  = 3  .

Spatial-Temporal Block Layer
We present a spatial-temporal block comprised of parallel components in Figure 3 which includes a temporal transformer, a spatial transformer, and a dynamic spatia convolutional network.Within the spatial-temporal block, two gating mechanisms namely the Spatial Fusion Gate (SFG) and the Spatial-Temporal Fusion Gate (STFG), ar positioned for the fusion of spatial-temporal features.

Temporal Transformer
Traffic flow in cities typically exhibits a wide range of variations over extende periods, including daily traffic patterns, weekly periodic fluctuations, and other change under diverse conditions.To capture these variations effectively, we utilize the tempora transformer, which excels at extracting long-term dependencies within traffic flow data By leveraging global information, this component adeptly discerns trends and periodi variations, offering a critical advantage for traffic flow prediction.Formally, within th multi-heads self-attention mechanism [42], the core operation is the scaled dot-produc attention.Here, queries, keys, and values represent equivalently time sequences o identical sliding windows.In other words,  =  = .The input sequence  ′ ∈  T×× is projected into a high-dimensional subspace   () ∈  × ′ ,   () ∈  × ′ ,   () ∈  × using linear mapping to learn the complex time dependence.A time sequence   ∈  × with a  timestamp and  dimension is inputted into the temporal transformer.Th subspace is generated by the linear transformation, as follows:

Temporal Transformer
Traffic flow in cities typically exhibits a wide range of variations over extended periods, including daily traffic patterns, weekly periodic fluctuations, and other changes under diverse conditions.To capture these variations effectively, we utilize the temporal transformer, which excels at extracting long-term dependencies within traffic flow data.By leveraging global information, this component adeptly discerns trends and periodic variations, offering a critical advantage for traffic flow prediction.Formally, within the multiheads self-attention mechanism [42], the core operation is the scaled dot-product attention.Here, queries, keys, and values represent equivalently time sequences of identical sliding windows.In other words, ∈ R T×d ′ using linear mapping to learn the complex time dependence.A time sequence X T ∈ R T×d with a T timestamp and d dimension is inputted into the temporal transformer.The subspace is generated by the linear transformation, as follows: where are learnable parameters and d ′ is the dimension of query, key, and value matrix.
To capture the temporal dependencies between all of the time slices of the node, the self-attention operation is applied in the time dimension, as follows: where d ′ are queries, keys, values, and their dimension, respectively, and softmax is an activation function.The TSA refers to the weights obtained by scaling the dot product.Temporal self-attention has been demonstrated to effectively detect dynamic temporal patterns across various nodes in traffic data.Moreover, it exhibits global adaptability and can capture long-range temporal dependencies spanning all time slices.The output of the temporal self-attention module can be expressed as: where , h is the number of attention heads, and W O is the final output projection matrix.
Furthermore, we employ a position-wise fully connected feedforward network on the output of the temporal multi-head self-attention block to produce the final output.To retain information from the original inputs, we integrate layer normalization and residual connections by combining the output of the temporal attention module with the original inputs.
During the final stage, layer normalization and residual connections are once again applied to the output of the temporal transformer.The residual connection aids in facilitating information flow throughout the entire temporal attention module, thereby ensuring model stability.This process is illustrated below: where T output ∈ R T×N×d is the output of the temporal transformer after residual and T trans ∈ R T×N×d is the final output of the temporal transformer.W T 1 W T 2 and W T 3 are learnable parameters, LN is layer normalization, and ReLU is the activation function.

Spatial Transformer
Spatial transformers provide diverse representations of node relationships, enabling the model to flexibly learn various aspects of spatial features.This capability is particularly beneficial for handling correlated and heterogeneous node relationships in urban transportation scenarios.Therefore, we utilize a spatial self-attention module as a feature extractor to capture dynamic correlations in traffic time series.
Formally, given an input X ′S ∈ R T×N×d , with spatial features, we slice X ′S ∈ R T×N×d by node to obtain X S ∈ R N×d to introduce information about the topology of the traffic network, and we use the adjacency matrix A to generate an initialization matrix W f .Then we add W f to the spatial input sequence of the query so that the dynamic feature embedding is integrated into the inputs of the model.
where W f ∈ R N×d is a learnable feature embedding of nodes initialized by the adjacency matrix.In the previous multi-attention-based model, all queries, keys, and values are rep-resented as the same sequence, i.e., Q = K = V.However, this approach does not sufficiently consider the structural characteristics of the dynamical graphs, and the addition of feature embeddings in multi-head spatial attention serves to introduce additional node information, which helps to improve the ability of modeling relationships between nodes.The input data are linearly mapped to perform the same operation as the temporal transformer.This is performed by projecting the data into a high-dimensional subspace to learn complex spatial dependencies.We obtain the query, key, and value matrices for the selfattentive operation by linear transformation as follows: where and the dimension of the query, key, and value matrices in this work.We then apply the selfattention operation to the spatial dimension to model the spatial dependency between nodes and obtain the attention scores between all nodes as follows: where A (S) t ∈ R N×N captures the spatial relations in different spatial nodes.The d ′ are queries, keys, values, and their dimension, respectively.
It is evident that the spatial dependency matrix between nodes undergoes dynamic changes across different time segments.Therefore, the SSA module can effectively capture these dynamic spatial dependencies.Finally, by multiplying the attention scores by the value matrix, we obtain the output of the spatial self-attention module (SSA) for each head as follows: The final output is obtained by concatenating the outputs and projecting them further.Formally, where , h is the number of attention heads, and W O is the final output projection matrix.
The model is able to learn multiple potential subspaces to learn different spatial dependency patterns using the multi-head attention mechanism, and the feedforward neural network applies the dynamically learned representation of spatial relations between nodes to each node, represented by W f .This enables W f to dynamically impact the model learning of spatial features between nodes.To ensure model stability, we incorporate residual and layer normalization operations in the output of the model, similar to the temporal transformer.Finally, the output of the spatial transformer is obtained as follows: where S output and S trans ∈ R T×N×d , W s 1 , W s 2 and W s 3 are learnable parameters, LN is layer normalization, and ReLU is the activation function.

Dynamic Spatial Graph Convolution
The connectivity and global nature of road networks remain vital aspects of transportation infrastructure.To capture the spatial dynamics effectively, we employ a dynamic graph convolutional network (DGCN).The GCN derives node features by aggregating information from neighboring nodes [19,43,44], enabling a more thorough exploration of the transportation network's topological structure.Building upon this approach, we integrate a traditional convolution operation, transitioning from structured data processing to graph data analysis, thereby capturing unstructured patterns inherent in graphs.Specifically, the GCN initially gathers information surrounding each node to form an intermediate representation, which is then refined using linear projection and nonlinear activation functions.
The input to the GCN comprises two components: the raw time series input for multihead attention, and the feature matrix W f representing the relationships between nodes after conducting spatial multi-head attention across all nodes.When combined, these components yield a matrix denoted as: ϵR T×N×d (17) where N×N represents the interplay between nodes, defined as follows: ∼ A is the adjacency matrix of the graph, and Traditional graph convolution operations are static, whereas in traffic road networks, the relationships between nodes may change over time.Therefore, a simple application of static GCN cannot capture these dynamic changes.Hence, we introduce a node feature matrix W f , which dynamically changes in the multi-head spatial self-attention mechanism.This enables the model to learn a different weight matrix at each time step in the GCN, resulting in changes to the adjacency relationships.Consequently, different spatial relationships are captured and integrated, as defined below: In addition, we enhance the complexity of node representations by stacking two graph convolutional layers.Each GCN layer can be perceived as a mechanism for aggregating and propagating information concerning nodes and their neighboring nodes.By stacking multiple GCN layers, the model progressively extracts higher-level abstract features and enhances its representation of graph structures.In our case: With this design, the model can glean information about the spatial relationships between nodes from the data, rather than depending on a predefined static adjacency matrix.

Gate Mechanism for Feature Fusion
As shown in Figure 2, we employ two gating mechanisms to merge the local spatial features derived from spatial graph convolution with the global spatial features acquired through multi-head attention.Additionally, another gating mechanism is utilized to com-bine the spatial fusion features with the temporal features learned via temporal multihead attention.

Spatial Gate Mechanism
To comprehensively integrate the diverse spatial features learned by the model through spatial graph convolution and the multi-head attention mechanism, and to dynamically allocate weight shares between the DGCN and the multi-head attention for various scenarios, we employ a classical gating mechanism.
The outputs of the DGCN and the outputs of the multi-head attention mechanism, X S GCN and X S att ∈ R T×N×d , are each passed through a fused MLP unit.A weight is obtained as follows: The output Y′ s ∈ R N×d is obtained by weighting Ps and Rs with the gate y.
where the final output Y s ∈ R T×N×d is the result of the dynamic spatial graph convolution and spatial transformer in the collection of T time steps.

Spatial-Temporal Bilinear Gate Mechanism
To comprehensively capture both spatial and temporal dependencies in traffic flow prediction, we introduced gating nonlinearity.This design enables the model to more flexibly capture the intricate relationship between spatiotemporal features, which is crucial for addressing the complex spatiotemporal dependencies inherent in dynamic systems like urban traffic.Specifically, the outputs Y s of the fused spatial information and the output Y T of the temporal multi-attention mechanism are respectively passed through a neural network unit comprising a linear layer and an activation function to generate gating weights.
T gate = 2sigmoid(linear(Relu(linear(Y t )))) where the gating mechanism is constructed using a 2-sigmoid, which expands the range of gating weights to [0, 2] compared to the conventional sigmoid, thus increasing the sensitivity to the inputs, possibly to better account for the effect of spatial-temporal information.
where ⊙ denotes an element-by-element multiplication operation, and the output fused spatial-temporal information is used as input to the subsequent spatial-temporal blocks stacked by the model.

Output Layer
After passing through several stacks of spatial-temporal blocks, our input undergoes convolution with two additional layers to generate the final output.Specifically, the output features first undergo the first convolution operation and activation function to extract higher-level features and introduce nonlinearities.Then, a dimension substitution is performed to ensure that the tensor's dimension matches the expectation of the subsequent convolutional layers.Subsequently, a second convolutional operation captures more fea-ture information.Finally, another dimension substitution is performed to restore the tensor to its original order.This process is illustrated below: where the prediction results for T ′ steps, denoted as X ∈ R T ′ ×N×D involve 1 × 1 convolutions with Conv 1 and Conv 2 .In this approach, we opt for a direct method rather than a recursive one for multi-step prediction.This decision is made to account for cumulative errors while also prioritizing model efficiency.This series of operations helps properly extract features and integrate the Transformer output to prepare the model output for the final task.

Datasets
We conducted comparative experiments on four real-world highway traffic public datasets: PeMS04, PeMS07, PeMS08 and PEMS-BAY [44].The raw traffic data were aggregated into 5 min intervals and normalized to zero mean.In addition, a spatial neighborhood map was constructed for each dataset based on the actual road network.Table 1 shows more details about the datasets.

Baseline
We compare STFGTN with the following baseline: VAR: captures the relationship between two time series.SVR [6]: Support Vector Regression utilizes linear support vector machines to make predictions.DCRNN [45]: Diffusion graph convolutional network integrated into GRU to predict flow graph sequence data.STGCN [43]: uses ChebNet and 2D convolution to capture spatial and temporal correlations, respectively.GWNET [15]: combining graph convolution with temporal convolution while capturing spatial-temporal correlations.STSGCN [44]: captures both spatial and temporal correlations by constructing spatial-temporal synchronization maps.MTGNN [18]: an adaptive graph learning method for learning spatial correlation based on feature initialization.STFGNN [16]: Learning Hidden Spatial-Temporal Dependencies by Novel Fusion of Multiple Spatial and Temporal Graphs.GMAN [28]: learning spatial and temporal correlations and integrating them using self-attentive mechanisms.TFormer [40]: a transformer-based model where encoder and decoder are stacked to extract deep features.STGODE [46]: Applying Continuous Graph Neural Networks to Traffic Prediction in Multivariate Time Series Forecasting.STGNCDE [47]: developed a STGNN combined with neural control differential equations (neural CDE) for better continuous modeling.HDCFormer [48]: an evolved Transformer network based on hybrid dilated convolutions.DSTAGNN [30]: utilizes datadriven dynamic spatiotemporal-aware graphs instead of traditional static graph convolutions.EGFormer [49]: replaces dynamic decoding operations with a generative decoding mechanism to reduce time and memory complexity.

Experimental Settings
All experiments were trained and tested on windows server (CPU: Intel(R) Core (TM) i7-13700KF, GPU: NVIDIA GeForce GTX 4090).Based on the PyTorch1.11.0 framework, we divided the three public datasets into training, validation and testing sets in the ratio of 6:2:2.The PEMS-BAY dataset was divided in a ratio of 7:1:2.Additionally, we utilized data from the preceding hour (comprising 12 time steps) to conduct a multistep prediction, specifically forecasting the traffic flow for the subsequent hour (consisting of 12 time steps).We trained the model using the following hyperparameter configuration: the number of layers for both the spatial and temporal transformers was 3, and each layer contained 4 attention heads.Both input and prediction lengths were set to 1 h, i.e., T = T ′ = 12.We trained using the Adam optimizer with the learning rate set to 0.001.The batch size was 64 and we implemented an early stopping mechanism if the validation error converged within 30 consecutive steps.To evaluate model performance, we used three widely used metrics: mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE).

Experimental Results
Table 2 demonstrates the predictive performance of the different models for the three evaluation metrics on the three datasets.STFGTN shows optimal performance on many different types of baseline models.SVR solely considers temporal correlation and neglects spatial correlation, rendering it the least effective model.Conversely, while VAR accounts for both temporal and spatial correlations, it lacks the capability to effectively capture nonlinear and dynamic spatialtemporal relationships, resulting in predictions that are often highly unstable.DCRNN represents a typical RNN-based method for traffic flow prediction.However, its recursive structure necessitates sequence computations at each time step, leading to significantly increased computational costs and notably lower prediction accuracy compared to our STFGTN method, particularly for long-term predictions.
STGCN, Graph WaveNet, and STSGCN are representative CNN-based approaches employing 1D CNNs or TCNs along the time dimension to capture temporal correlations.Although they effectively mitigate the issue of heightened computational costs, the onedimensional convolutional kernel of 1D CNNs only slides along the time axis, making it challenging to handle dependencies over extended periods.Furthermore, while TCNs employ dilated convolutions to expand the convolutional receptive field, the time complexity logarithmically increases with the number of convolutional layers as the receptive field expands.This may present accuracy challenges in long-term prediction compared to the self-attentive mechanism utilized by TCNs to capture sequence dependencies through convolution.
In the temporal dimension, our approach leverages an attention mechanism that dynamically adjusts across time segments, facilitating effective capture of long-term correlations.Across all three datasets, STFGTN outperforms MTGNN and STGNCDE in terms of MAE, MAPE, and RMSE, underscoring the enhancement in the model's traffic flow prediction capability following integration of spatial-temporal features.
To validate the generalization ability of our model across different cities traffic conditions, we conducted experiments on traffic speed datasets with varying perspectives.The experimental results demonstrate that our model maintains good predictive performance compared to the baseline models, as shown in Table 3.In this comparative analysis, we evaluated the competitiveness of STFGTN against other Transformer-based models.Our study focused on comparing their performance in capturing spatiotemporal correlations.The results, as presented in Table 4, highlight STFGTN's competitive performance, achieved through the integration of temporal feature embeddings with both global and local spatial features.This underscores our approach's effectiveness in leveraging both temporal and spatial information for improved prediction accuracy.Unlike traditional models like GMAN, which employ traditional gate fusion mechanisms, we utilize a non-linear gate fusion approach.Furthermore, unlike Transformer-based frameworks like TFormer and HDCFormer, our approach ensures more comprehensive spatiotemporal feature extraction with simplified computations.Figure 4 illustrates the performance comparison of our model with different types of methods over 12 time steps on the PEMS04 and PEMS08 datasets.Generally, as the prediction interval increases, the task becomes more challenging, resulting in performance degradation across all models.However, our model, STFGTN, exhibits the smallest decrease in performance compared to both GNN-based and attention-based models, highlighting its superiority in long-term prediction among other models.This also demonstrates the effectiveness of the improvements made on our proposed Transformer-based model.

Ablation Study
In order to further assess the validity of each component of STFGTN, we conducted an ablation study of four variants of our model on the PEMS04 and PEMS08 datasets: Figure 5 illustrates a comparison of these variants, from which we draw the following conclusions: the integration of these components enables improved capture of spatialtemporal interaction information, validating the effectiveness of the overall model framework.The significant degradation in model performance upon removing the temporal and spatial attention mechanisms, respectively, indicates their crucial roles in capturing remote temporal dependencies across the temporal dimension and global spatial dependencies among different roads in the spatial dimension.In terms of spatial fusion, the effect of the spatial fusion gating mechanism is important for adjusting spatial information, and its absence leads to a decrease in model performance.By introducing the gating nonlinearity, the model can capture spatial and temporal dependence more flexibly, significantly improving the modeling effect.

Ablation Study
In order to further assess the validity of each component of STFGTN, we conducted an ablation study of four variants of our model on the PEMS04 and PEMS08 datasets: • w/o s_gate: this variant removes the spatial fusion gating and simply splices the GCN with the output of spatial attention.• w/o st_fusion_gate: this variant removes the spatial-temporal fusion gating mechanism and splices the output of fused spatial features with temporal attention.• w/o Ttrans: this variant removes the time transformer.
• w/o Strans: this variant removes the spatial transformer.
Figure 5 illustrates a comparison of these variants, from which we draw the following conclusions: the integration of these components enables improved capture of spatialtemporal interaction information, validating the effectiveness of the overall model framework.The significant degradation in model performance upon removing the temporal and spatial attention mechanisms, respectively, indicates their crucial roles in capturing remote temporal dependencies across the temporal dimension and global spatial dependencies among different roads in the spatial dimension.In terms of spatial fusion, the effect of the spatial fusion gating mechanism is important for adjusting spatial information, and its absence leads to a decrease in model performance.By introducing the gating nonlinearity, the model can capture spatial and temporal dependence more flexibly, significantly improving the modeling effect.By conducting a comparative analysis of replacing dynamic GCN with traditional GCN, as shown in Table 5, we are able to more clearly evaluate the role and effectiveness of dynamic GCN in our model.Traditional GCNs use static graphs to initialize relationships between nodes, while dynamic GCNs initialize them based on the road matrix during the initialization phase.Through attention mechanisms, the model's weight matrices can dynamically adjust, focusing attention on the nodes and features most relevant to the prediction task.Additionally, by stacking multiple GCN layers, the model can gradually extract higher-level feature representations, thereby improving the model's representation capability and prediction performance.

Visualization
To demonstrate the prediction performance of our model, we conducted predictions on the test sets of sensor 100 from the PeMS04 dataset and sensor 80 from the PeMS08 dataset for both daily and weekly traffic flows.Figure 6 depicts that our predicted By conducting a comparative analysis of replacing dynamic GCN with traditional GCN, as shown in Table 5, we are able to more clearly evaluate the role and effectiveness of dynamic GCN in our model.Traditional GCNs use static graphs to initialize relationships between nodes, while dynamic GCNs initialize them based on the road matrix during the initialization phase.Through attention mechanisms, the model's weight matrices can dynamically adjust, focusing attention on the nodes and features most relevant to the prediction task.Additionally, by stacking multiple GCN layers, the model can gradually extract higher-level feature representations, thereby improving the model's representation capability and prediction performance.

Visualization
To demonstrate the prediction performance of our model, we conducted predictions on the test sets of sensor 100 from the PeMS04 dataset and sensor 80 from the PeMS08 dataset for both daily and weekly traffic flows.Figure 6 depicts that our predicted sequences closely align with the actual traffic flow on both the fitting curves for the PEMS04 and PEMS08 datasets, maintaining consistency despite variations in the field of view.This indicates that our model accurately forecasts by comprehensively considering the traffic flow characteristics within the real road network.
Figures 7 and 8 display the absolute error of STFGTN for the 15 min, 30 min, 45 min, and 60 min prediction tasks on PEMS04 and PEMS08.
The model demonstrates proficiency in both short-term and long-term forecasting, effectively capturing temporal trends within traffic flow data.Nonetheless, its predictive efficacy diminishes with an extended prediction horizon, attributed to the heightened complexity and variability inherent in actual traffic conditions.

Effect of Hyperparameters
In our study, we explored the impact of variations in hyperparameters, including the number of attention heads, the number of layers in the spatiotemporal module, and changes in dimensionality.
In Figure 9, we observe a clear trend in traffic flow prediction performance as model dimensionality increases.Initially, there is a decrease followed by improvement.Notably, optimal performance is achieved at a dimensionality of 64, attributed to the model's enhanced capability in capturing intricate traffic data features while mitigating overfitting risks.However, escalating model dimensionality may exacerbate overfitting and computational complexity, adversely affecting overall performance and generalization ability.Similarly, for the number of layers of the spatial-temporal module, we find that the model achieves optimal performance at a layer number of three.However, as the number of layers increases to four, the model performance starts to show a decreasing trend.This suggests to us that increasing the number of layers in the spatial-temporal module does not lead to additional performance gains, but instead may introduce too much complexity and reduce computational efficiency, making the model too deep to train or generalize when dealing with spatial-temporal relationships.

Conclusions
In this study, we introduced a traffic flow prediction model that integrates spatialtemporal features using attention mechanisms.In this model, an embedding layer is utilized to incorporate periodic time features, and a spatial fusion gating module is proposed to integrate spatial dependencies and a spatial-temporal bilinear gating module to combine spatial-temporal dependencies.Further improvements were made to the original GCN by combining it with an attention mechanism to learn different spatial In Tables 6 and 7, we looked at the effects of the number of attention heads and the number of layers of the spatial-temporal module on the performance of the STFGTN (h, l), where h is the number of heads of attention and l is the number of layers of the ST-block.The (*) indicates the parameter settings at which our model achieved optimal performance.The observed results show a gradual increase in model performance at the beginning of the increase in the number of attention heads.However, when the number of heads reaches eight, we observe a significant decrease in model performance.This phenomenon suggests that multi-head attention does not significantly improve the accuracy of traffic flow prediction.Conversely, an excessive number of attention heads introduces redundant information, potentially leading to overfitting or underutilization of the attention mechanism when dealing with traffic flow.Similarly, for the number of layers of the spatial-temporal module, we find that the model achieves optimal performance at a layer number of three.However, as the number of layers increases to four, the model performance starts to show a decreasing trend.This suggests to us that increasing the number of layers in the spatial-temporal module does not lead to additional performance gains, but instead may introduce too much complexity and reduce computational efficiency, making the model too deep to train or generalize when dealing with spatial-temporal relationships.

Conclusions
In this study, we introduced a traffic flow prediction model that integrates spatialtemporal features using attention mechanisms.In this model, an embedding layer is utilized to incorporate periodic time features, and a spatial fusion gating module is proposed to integrate spatial dependencies and a spatial-temporal bilinear gating module to combine spatial-temporal dependencies.Further improvements were made to the original GCN by combining it with an attention mechanism to learn different spatial dependency patterns.The experiments were performed on four real datasets, experiments on ablation and parametric aspects of the individual modules were performed, and the results demonstrate the superiority of our model.In future work, we plan to explore the replacement of attention mechanisms and their broader application in the field of traffic flow prediction.We will particularly focus on investigating the impact of different types of attention mechanisms on model performance and generalization ability, aiming to further enhance the adaptability and prediction accuracy of the models in new environments.

Figure 1 .
Figure 1.(a) This is a real-time road condition map from the high-speed highway traffic dete system in the Los Angeles area.The system installs detectors in various areas of the road to co real-time traffic flow data.These data include metrics such as vehicle speed, vehicle density, tr volume, etc., along with their variations over time and spatial locations.(b) Scenario of traffic correlations and heterogeneity in the road network.The traffic condition at one node will influ other nodes over time and space.

Figure 1 .
Figure 1.(a) This is a real-time road condition map from the high-speed highway traffic detection system in the Los Angeles area.The system installs detectors in various areas of the road to collect real-time traffic flow data.These data include metrics such as vehicle speed, vehicle density, traffic volume, etc., along with their variations over time and spatial locations.(b) Scenario of traffic flow correlations and heterogeneity in the road network.The traffic condition at one node will influence other nodes over time and space.

Figure 2 .
Figure 2. Framework of the STFGTN.Comprising a data embedding layer, several stacked spatiotemporal modules, and an output layer.

Figure 2 .
Figure 2. Framework of the STFGTN.Comprising a data embedding layer, several stacked spatiotemporal modules, and an output layer.

Figure 3 .
Figure 3.The structure of the spatial-temporal block.(a) The spatial-temporal block.(b) The tw gating modules used for fusing spatial-temporal features are respectively referred to as the Spatia Fusion Gate and Spatial-Temporal Fusion Gate.

Figure 3 .
Figure 3.The structure of the spatial-temporal block.(a) The spatial-temporal block.(b) The two gating modules used for fusing spatial-temporal features are respectively referred to as the Spatial Fusion Gate and Spatial-Temporal Fusion Gate.

Figure 4 .
Figure 4.The performance of different models on two datasets then varies at different time steps.(a-c) This represents the performance difference of STFGTN compared to other baseline models across different time steps in the PEMS04 dataset; (d-f) this represents the performance difference of STFGTN compared to other baseline models across different time steps in the PEMS08 dataset.

•
w/o s_gate: this variant removes the spatial fusion gating and simply splices the GCN with the output of spatial attention.• w/o st_fusion_gate: this variant removes the spatial-temporal fusion gating mechanism and splices the output of fused spatial features with temporal attention.• w/o Ttrans: this variant removes the time transformer.• w/o Strans: this variant removes the spatial transformer.

Figure 4 .
Figure 4.The performance of different models on two datasets then varies at different time steps.(a-c) This represents the performance difference of STFGTN compared to other baseline models across different time steps in the PEMS04 dataset; (d-f) this represents the performance difference of STFGTN compared to other baseline models across different time steps in the PEMS08 dataset.

Figure 5 .
Figure 5. Ablation study on PEMS04 and PEMS08.(a-c) This represents the performance variation of each key component across different evaluation metrics in the PEMS08 dataset; (d-f) this represents the performance variation of each key component across different evaluation metrics in the PEMS04 dataset.

Figure 5 .
Figure 5. Ablation study on PEMS04 and PEMS08.(a-c) This represents the performance variation of each key component across different evaluation metrics in the PEMS08 dataset; (d-f) this represents the performance variation of each key component across different evaluation metrics in the PEMS04 dataset.

Figure 6 .Figure 6 .
Figure 6.Visualization of traffic flow.(a) The fitting curve plot of sensor 100 in the PEMS04 dataset.(b) The fitting curve plot of sensor 80 in the

Figure 7 .Figure 8 .
Figure 7. Heatmap shows the absolute errors between true and predicted values for different prediction horizons on PEMS04.

Figure 7 .
Figure 7. Heatmap shows the absolute errors between true and predicted values for different prediction horizons on PEMS04.

Figure 7 .Figure 8 .
Figure 7. Heatmap shows the absolute errors between true and predicted values for different prediction horizons on PEMS04.

Figure 8 .
Figure 8. Heatmap shows the absolute errors between true and predicted values for different prediction horizons on PEMS08.Figure 8. Heatmap shows the absolute errors between true and predicted values for different prediction horizons on PEMS08.

Figure 9 .
Figure 9. Different dimensions' performance on different datasets: (a) Performance graph of different dimensions on PEMS04 dataset.(b) Performance graph of different dimensions on PEMS08 dataset.

Figure 9 .
Figure 9. Different dimensions' performance on different datasets: (a) Performance graph of different dimensions on PEMS04 dataset.(b) Performance graph of different dimensions on PEMS08 dataset.
• A novel dynamic graph convolutional neural network (DGCN) is employed to capture evolving spatial dependencies among traffic flow data nodes, complemented by an attention mechanism.This network adeptly mines spatial data correlations by dynamically adjusting node correlation coefficients and aggregating high-nodecorrelation information.• Two gating mechanisms are incorporated to integrate various components within our model.Firstly, we fuse local spatial features from DGCNs with global spatial features from spatial multi-attention.Secondly, we introduce a gating nonlinearity to fuse previously integrated spatial features with temporal features obtained through temporal multi-head attention.

Table 1 .
Summary of datasets.

Table 4 .
Performance of the transformer-based model on the PEMS04 and PEMS08 datasets.

Table 5 .
Performance of traditional GCN and DGCN on the PEMS04 and PEMS08 datasets.

Table 5 .
Performance of traditional GCN and DGCN on the PEMS04 and PEMS08 datasets.

Table 7 .
Examines the fluctuations in MAE, MAPE (%), and RMSE acquired by STFGTN (h, l) with varying numbers of attention heads and layers of ST-block in the PEMS08 dataset.

Table 6 .
Examines the fluctuations in MAE, MAPE (%), and RMSE acquired by STFGTN (h, l) with varying numbers of attention heads and layers of ST-block in the PEMS04 dataset.

Table 7 .
Examines the fluctuations in MAE, MAPE (%), and RMSE acquired by STFGTN (h, l) with varying numbers of attention heads and layers of ST-block in the PEMS08 dataset.