1. Introduction
Traffic flow prediction is one of the core technologies of Intelligent Transportation Systems (ITSs). By analyzing historical data and utilizing prediction models, it helps traffic managers forecast future traffic volume and road conditions [
1]. With accurate traffic predictions, traffic management departments can develop strategies for road condition optimization, plan and navigate in advance, thereby improving traffic operational efficiency, ensuring travel safety, and optimizing urban transportation networks [
2]. However, due to the complex temporal and spatial dependencies of traffic flow, achieving accurate predictions remains challenging.
Traffic flow data, typically collected from road sensors as time-series data, has traditionally been analyzed using linear methods such as Autoregressive Integrated Moving Average (ARIMA) [
3], Vector Autoregression (VAR) [
4], and Support Vector Regression (SVR) [
5]. However, these methods struggle to effectively capture the nonlinear characteristics of traffic patterns and the spatial attributes of the transportation network, limiting their application in modern traffic flow prediction. With advancements in deep learning techniques and improvements in hardware capabilities, an increasing number of neural network-based models have been employed to more effectively capture the dynamic temporal and spatial dependencies in traffic flow prediction. In the temporal dimension, Recurrent Neural Networks (RNNs) and their variants, such as Long Short-Term Memory (LSTM) [
6] and Gated Recurrent Units (GRU) [
7], are employed to capture the dynamic temporal features of traffic flow. In the spatial dimension, some researchers utilize Convolutional Neural Networks (CNNs) to partition the road network into distinct grids to extract spatial correlations. Researchers have combined these models to extract spatial–temporal features [
8]. However, this approach is only applicable to grid-like Euclidean spatial graphs, while traffic road networks are typically non-Euclidean. Subsequently, Graph Neural Networks (GNNs) have been shown to be more suitable for modeling non-Euclidean traffic road networks [
9]. For instance, Zhao et al. proposed the T-GCN model [
10], which combines Graph Convolutional Networks (GCNs) and GRU to simultaneously capture both spatial and temporal dependencies in traffic data, achieving significant results.
Despite the significant achievements in exploring the spatial–temporal features of traffic networks, several challenges still remain to be addressed. Firstly, the spatial–temporal features of traffic flow are highly complex and intertwined. In the input embedding stage, previous studies typically input spatial and temporal data into separate spatial and temporal modules for feature extraction, and then fuse the extracted results in subsequent stages. However, this approach fails to effectively integrate spatial–temporal features, leading to insufficient capture and utilization of deeper information correlations. Additionally, existing multidimensional embedding methods have notable limitations, particularly the lack of effective fusion strategies. These methods typically treat data from different dimensions equally, without considering the relative importance and interdependencies of each dimension, thereby limiting the model’s performance to some extent. Moreover, most early models primarily rely on a single data source (such as historical traffic flow data) for predictions, neglecting other factors that significantly influence traffic flow. For example, an increase in traffic occupancy during peak hours may lead to congestion, significantly affecting the distribution of traffic flow within the region. Ignoring these critical influencing factors may prevent the model from fully addressing the complexity of real-world scenarios, thereby reducing its reliability and practical applicability.
Secondly, traffic networks exhibit significant spatial heterogeneity. Although some nodes are spatially adjacent (e.g., different locations on the same road segment), their traffic flow and variation trends may be similar. However, as shown in
Figure 1b, even when two nodes are far apart and not on the same road segment, they may still exhibit similar traffic patterns, often driven by shared traffic characteristics or demand patterns. This indicates that spatial correlations in traffic networks are not solely dependent on physical distance. However, numerous studies have demonstrated that GNNs tend to suffer from over-smoothing during node feature aggregation [
11], which undermines the model’s ability to capture long-range dependencies and limits its performance in modeling spatial heterogeneity.
Finally, traffic flow exhibits significant temporal heterogeneity. For example,
Figure 1c shows that the traffic flow correlation between Location A and Location B is relatively low during the morning peak period, reflecting a disparity in their traffic patterns [
12]. However, during the evening peak period, their traffic patterns tend to align. Conversely, as shown in
Figure 1d, the traffic flow correlation between Node B and Node C is stronger during the morning peak, but weaker at other times. Furthermore,
Figure 1a shows that even at the same location, traffic patterns on weekdays may differ significantly from those on non-working days. Weekday traffic flows exhibit clear periodicity, whereas non-working days lack such regularity, reflecting differences in travel behaviors on different types of days. Notably, given the significant temporal heterogeneity in traffic flow across different periods, a single temporal processing method may be insufficient to comprehensively capture its complex dynamic variations.
To address the aforementioned complex traffic flow scenarios, this paper proposes a model based on the Transformer encoder architecture—MMHFormer. To provide enriched input representations for subsequent spatial–temporal encoding, the input embedding layer utilizes gated convolution to integrate traffic features, temporal characteristics, spatial attributes, and external environmental factors. To capture spatial dependencies among traffic nodes from multiple perspectives, we introduce a hierarchical multi-view spatial attention module that explicitly models both global and local spatial dependencies. Additionally, to address cross-node temporal pattern variations, we propose a hierarchical two-stage temporal attention module. In the first stage, a temporal attention mechanism captures global temporal dependencies. In the second stage, convolution is performed at the node level as a secondary query, followed by an additional round of temporal attention to emphasize traffic pattern differences across nodes. The main contributions of MMHFormer are summarized as follows:
This paper proposes a novel MMHFormer model, which effectively addresses challenges in multi-source information fusion, multi-view interaction, and dynamic temporal dependency modeling.
MMHFormer is designed as a unified framework that integrates multi-source feature fusion, multi-view spatial reasoning, and hierarchical temporal modeling. A multi-source gated embedding layer fuses spatial Laplace, time-period, and traffic occupancy embeddings to represent complex traffic conditions. On this basis, a hierarchical multi-view spatial attention mechanism captures global, geospatial, and dynamic-similarity dependencies, while a two-stage temporal attention module learns both global temporal correlations and node-specific dynamics, enabling adaptive modeling of evolving traffic patterns across time and space.
Experimental results on four real-world traffic datasets show that MMHFormer outperforms current state-of-the-art methods, validating the model’s effectiveness and generalization capability. Furthermore, we conducted a single-step evaluation of the hourly forecast, demonstrating the model’s effectiveness in long-term prediction.
4. Methodology
Figure 2 illustrates the framework of MMHFormer, which primarily consists of three modules: the multi-source gated embedding layer, the spatial–temporal encoder layer, and the output layer.
To effectively capture the spatial–temporal features of traffic flow while considering road conditions. The multi-source gated embedding layer integrates multidimensional data inputs, including raw traffic flow, spatial Laplacian embeddings, temporal periodic embeddings, and traffic occupancy embeddings, using gated convolutions.
The spatial–temporal encoder layer employs a hierarchical multi-view spatial attention module. A global spatial attention mechanism captures global spatial dependencies between nodes. To enhance sensitivity to critical local information, a geospatial attention mechanism focuses on the local dynamic features of neighboring nodes while discarding connections to distant nodes. Additionally, dynamic similarity spatial attention is incorporated, allowing the model to ignore distance and extract long-range spatial features by capturing the dynamic similarity of traffic patterns between nodes.
To adapts to the varying traffic patterns across different nodes. In the hierarchical two-stage temporal attention module, the first stage captures global temporal dependencies, while the second stage refines the query focus and re-applies temporal attention to emphasize traffic pattern differences between nodes.
The output layer employs skip connections and convolutional layers to transform the outputs into the final dimensions required for prediction. In this section, we provide a detailed description of the MMHFormer architecture.
4.1. Multi-Source Gated Embedding Layer
At each time interval, traffic flow in different urban areas is influenced by various factors, including traffic flow in neighboring areas and external environmental conditions. For instance, congestion in one area can lead to a sudden decrease in traffic flow across the city. Similarly, during holidays, urban traffic flow often increases significantly compared to weekdays. Based on this observation, we propose a multi-source gated embedding layer that integrates multidimensional data to capture the spatial–temporal dependencies and the impact of road conditions on traffic flow.
Specifically, the raw traffic flow input X is transformed into via a Linear layer and combined with three types of embeddings: spatial Laplacian embeddings to encode the spatial structural features of the road network, temporal periodic embeddings to capture the periodic variations in traffic flow, and traffic occupancy embeddings to reflect external environmental impacts. Finally, a gated convolutional network achieves efficient fusion of multi-source information, effectively integrating the spatial, temporal, and external influences on traffic flow.
Spatial–temporal embedding: The Laplacian matrix is used to learn the correlations between nodes in the road network, embedding the graph into Euclidean space to obtain the spatial embedding representation
. Considering the periodicity of urban traffic flow, we introduce weekly and daily periodic embeddings [
30], represented as
and
, where
and
convert time
t into week and minute indices. Temporal embeddings
and
are added to the spatial embeddings
to obtain the final spatial–temporal representation
.
Traffic occupancy embedding: Variations in traffic flow and occupancy across different nodes may reflect the characteristics of the roads in those regions (e.g., major or minor roads), providing the model with the ability to recognize differences between nodes. To achieve this, we introduce the traffic occupancy embedding mechanism. Initially, raw traffic occupancy data is processed to extract key features, and a linear layer is applied to generate the traffic occupancy embedding representation .
Information fusion: The traffic flow input representation
, spatial–temporal embedding
, and traffic occupancy embedding
are concatenated to produce the fused representation
, where
is the hidden dimension:
The gated convolution mechanism dynamically adjusts feature weights based on different input data, enabling more comprehensive fusion. Here, denotes convolution, represents temporal positional encoding, denotes the Sigmoid Linear Unit, and ⊙ represents the Hadamard product.
4.2. Hierarchical Multi-View Spatial Attention Module
As illustrated in
Figure 3, global spatial attention is first introduced to capture the dependencies between traffic nodes and all other nodes. For the input feature representation
, query, key, and value matrices are generated at each time step
t through convolutional operations as follows:
Here,
are learnable parameter matrices, where
is the dimensionality of the queries, keys, and values. The attention score matrix
is computed as:
The softmax function is applied to
, generating attention weights for each node relative to all other nodes. The global spatial representation
is then obtained:
In practical traffic scenarios, the dynamic features of a target node are often significantly influenced by its neighboring nodes, particularly during abrupt changes. To better capture the local dynamic characteristics of nodes, geospatial attention is introduced. This mechanism focuses on information from neighboring nodes while discarding connections with distant nodes. Based on the global spatial attention output
, a new query
is generated at time step
t using a
convolution and treated as
. The secondary attention weight
is computed as:
To ensure that the attention mechanism focuses only on nearby nodes, a geospatial mask matrix
is defined. The matrix is undirected, and only nodes with distances less than
are considered for extracting important features. The geospatial representation
is obtained as:
Although some traffic nodes are not geographically adjacent, they may exhibit similar traffic patterns driven by shared traffic characteristics or demand patterns. Dynamic similarity spatial attention is introduced to dynamically adjust attention weights based on the similarity of traffic patterns between nodes, identifying potential correlations among non-adjacent nodes. We use a sliding 12-step time window to analyze traffic data and construct a dynamic similarity mask matrix using the fast Fourier transform (FFT). Specifically, we apply FFT to the past 12 steps of each node to extract short-term periodicity and local oscillation patterns, and then compute the Euclidean distances between the resulting frequency-domain features. This matrix encodes the relationships between each node and its K-most similar nodes, with a weight of 1 assigned to the most similar nodes and 0 to others.
The choice of a 12 step window is driven by the forecasting task, which involves predicting the next 12 steps based on the previous 12 steps. A window of 12 steps corresponds to one hour of traffic data, offering a sufficient segment for capturing stable frequency components, while also being short enough to account for transient variations and peak-hour fluctuations.
Based on
, a new query
is generated at time step
t using a
convolution and treated as
. Under the dynamic similarity mask, the dynamic similarity attention weight
is calculated as:
The dynamic similarity representation
is given by:
Finally, the model integrates global spatial, geospatial, and dynamic similarity attention mechanisms to model multi-perspective spatial dependencies. The aggregated spatial representation is expressed as:
Here,
is a learnable parameter. The feed-forward network (FFN) performs nonlinear transformations with two linear layers and a GELU activation function, while LN denotes the layer normalization operation.
4.3. Hierarchical Two-Stage Temporal Attention
4.3.1. Global Temporal Attention
In the first stage, to capture global dynamic temporal patterns, we first project the input tensor
into temporal query, key, and value matrix using learnable
convolutions:
Here,
are learnable parameters.
Next, we compute the scaled dot-product between the queries and keys at each node to obtain the temporal attention scores. The output of the first-stage temporal attention module is then calculated by applying the attention scores to the values:
4.3.2. Spatially-Informed Temporal Attention
Time series data not only have long-term dependence, but also show complex cross-node changes due to the heterogeneity and dynamics of time series characteristics of each node. It is difficult to describe this heterogeneous evolution with a single time attention. To this end, as shown in
Figure 4, we propose the spatially-informed temporal attention Module. Attention is combined with adaptive graph convolution to model the dynamic changes between nodes. Specifically, adaptive graph convolution dynamically adjusts the adjacency matrix by learning, fuses spatial topology information with the dynamic interaction between nodes, and can flexibly model the interdependence between different nodes. In the case of large cross-time changes between nodes, it can effectively distinguish and strengthen the timing pattern at critical moments. This mechanism enables the model not only to capture the dynamic evolution in time, but also to identify the temporal pattern differences between nodes, which further enhances the sensitivity to cross-node changes.
In the second stage, based on the output of the first stage, we capture temporal patterns that span multiple nodes by incorporating adaptive graph learning to infer latent inter-node dependencies. These learned dependencies are then used to compute the query matrix through a graph convolution step.
Here,
is a learnable projection matrix, and
denotes the learnable node embeddings.
In this stage, the first stage key
and output
serve as the key
K and value
V, respectively, to compute the secondary temporal attention weights. The weight matrix is normalized and applied to
, which passes through a linear layer and an FFN to generate the final temporal context
, completing the two stage aggregation of temporal features:
4.4. Output Layer
To resize the spatial–temporal encoder output, we apply 1 × 1 convolution skip connections, summing each layer’s output to obtain
.
For multi-step prediction, and adjusts to the prediction window dimension, reducing error and enhancing model effectiveness.
5. Experiments
5.1. Datasets
The datasets used for the experiments in this paper include PeMS03, PeMS04, PeMS07, and PeMS08, which are all derived from the California State Traffic Management System (CSTMS) and collected by the California Department of Transportation (Caltrans) through its traffic monitoring network. The data cover highway traffic conditions across various regions, collected every 5 min by real-time sensors deployed on the roadways, providing multidimensional features such as traffic volume, speed, and roadway occupancy for time-efficient monitoring of traffic flow. These multidimensional features serve as inputs to the traffic flow prediction model, which effectively captures the traffic patterns and dynamic changes in each region, revealing regional traffic differences. Due to its large size and high frequency, the PeMS dataset has become a cornerstone of ITS research, providing a valuable testbed for data-driven traffic decision-making and system optimization. Specific information about each dataset is detailed in
Table 1.
5.2. Baselines
We compare MMHFormer with the following baseline models: traditional time series forecasting models, graph neural network-based models, and transformer-based models.
VAR [
4]: VAR is based on the assumption that traffic flow follows an autoregressive pattern, meaning that future values can be predicted using past data in the series.
SVR [
5]: SVR leverages the principles of Support Vector Machines (SVMs) to effectively model traffic flow in a nonlinear way, unlike traditional linear regression methods.
DCRNN [
34]: DCRNN models traffic flow as a diffusion process on a directed graph, capturing spatial dependencies through bidirectional random walks. It also models temporal dependencies using a sequence-to-sequence structure combined with scheduled sampling.
GraphWaveNet [
20]: GraphWaveNet automatically generates the graph adjacency matrix by adaptively learning node embeddings, which enables more accurate capturing of hidden spatial dependencies. Additionally, it utilizes stacked dilated causal convolutions to effectively handle long-term temporal dependencies.
AGCRN [
21]: AGCRN introduces the NAPL and DAGG modules, which are designed to capture node-specific patterns and automatically infer interdependencies across different traffic series. This enables AGCRN to effectively capture fine-grained spatial and temporal correlations within traffic series data, enhancing its ability to model complex traffic dynamics.
STGCN [
18]: STGCN is a model that combines graph convolutions to capture spatial dependencies between traffic nodes and 1D convolutions to model temporal dynamics, effectively handling the complex spatio-temporal correlations in traffic flow prediction.
MTGNN [
35]: MTGNN exploits the underlying spatio-temporal dependencies by automatically learning the relationships between variables. The framework includes a graph learning layer, a graph convolution module, and a temporal convolution module, which adaptively learns the graph structure to capture spatial dependencies between variables and integrates multi-frequency temporal patterns to enhance prediction performance.
GMAN [
36]: GMAN adopts an encoder-decoder structure, with both the encoder and decoder comprising multiple spatial–temporal attention blocks to model the complex spatio-temporal correlations of traffic systems. By incorporating a transformer attention mechanism, GMAN aims to reduce error propagation in long-term forecasting.
ASTGCN [
19]: ASTGCN consists of three independent components that model the recent, daily, and weekly dependencies of traffic flow. Each component includes spatial–temporal attention mechanisms and spatial–temporal convolution modules to capture the dynamic spatio-temporal correlations and features of traffic data.
STFGNN [
37]: STFGNN is a novel spatiotemporal fusion graph neural network for traffic flow prediction. Using data-driven “temporal graphs” to complement traditional spatial graphs, it effectively captures hidden spatiotemporal dependencies. STFGNN’s fusion operation processes multiple spatial and temporal graphs in parallel, integrating with a gated convolution module to handle long sequences and capture richer dependencies.
STID [
38]: By adding spatial and temporal identity information, this model addresses the issue of sample indistinguishability in the spatio-temporal dimensions, thereby enhancing the model’s predictive capability.
GDGCN [
39]: GDGCN systematically explores spatial, temporal, and feature dimensions of data by combining parameter-sharing and independent modules. It designs a novel temporal graph convolution block to process the dynamic relationships of historical time slices in graph form. Additionally, a dynamic graph constructor is introduced to model time-specific spatial dependencies and the dynamic interaction relationships between different time slices.
PDFormer [
30]: PDFormer addresses the limitations of current GNN-based models in static modeling, short-range spatial dependencies, and the neglect of propagation delays. PDFormer introduces a dynamic spatial self-attention module to capture dynamic spatial dependencies and employs both geographic and semantic graph mask matrices to simultaneously capture short-range and long-range dependencies.
DDGformer [
33]: DDGformer captures the directional and relative positional relationships in traffic data using a direction- and distance-aware self-attention module. It also uses a dynamically enhanced adaptive graph convolution network to capture dynamic patterns in traffic systems.
STGAFormer [
32]: STGAFormer effectively integrates both local and global dynamic spatio-temporal features and employs a distance-based self-attention module to capture critical features between different regions. The model incorporates multidimensional inputs, including traffic flow attributes, periodicity, proximity adjacency matrices, and adaptive adjacency matrices, to better capture the spatio-temporal characteristics of traffic flow.
5.3. Experimental Settings
In this experiment, data from the past 12 time steps (one hour) are used to predict the traffic flow for the next 12 time steps, based on current mainstream traffic flow prediction methods. The dataset is divided into training, validation, and test sets in the ratio of 6:2:2 to fully evaluate the generalization ability of the model. The experiments were conducted on an NVIDIA RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA). The batch size was set to 16 for PeMS03, PeMS04, and PeMS08, and to 6 for PeMS07, with 200 epochs of training to ensure sufficient model convergence and to avoid overfitting. The model’s spatial–temporal encoder comprises 6 layers (L), with the hidden dimension (d) set to 64. The optimizer uses AdamW with an initial learning rate of 0.001 and combines the weight decay mechanism to effectively prevent overfitting. In addition, overfitting is further prevented during training by employing an early stopping criterion, which halts training once the validation performance stops improving.
5.4. Evaluation Metrics
To comprehensively assess the model performance, we introduced three commonly used evaluation metrics: mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE). To ensure the accuracy of the results, we filtered the missing data.
Here, and denote the actual and predicted traffic flow values at node i for a given time step, respectively. The variable n represents the total number of nodes.
5.5. Comparison Results
Table 2 presents the performance of our proposed MMHFormer model compared to other baseline models on next-hour predictions across four real-world datasets. The best-performing results are highlighted in bold, while the second-best results are underlined. To validate the superiority of the model, most baseline results were obtained from official documentation and related studies [
28]. Our model achieved the best performance across all metrics on the four datasets: PeMS03, PeMS04, PeMS07, and PeMS08.
To evaluate the long-term prediction capability of the model, we visualized its MAE, RMSE, and MAPE across the next 12 steps, as shown in
Figure 5. Compared to other models, our model demonstrates smaller variations in all metrics throughout the prediction horizon, highlighting its superior stability and robustness. This consistent performance further confirms its suitability for long-term prediction tasks.
DDGformer, PDFormer, and STGAFormer are three representative attention-based models. However, they fuse features across different dimensions through simple addition during data input embedding. In contrast, MMHFormer employs a concatenation approach across dimensions, which not only preserves the complete representation of spatial–temporal features but also incorporates traffic occupancy as an additional data source, enabling a more accurate representation of real-world traffic conditions. Additionally, the introduction of a gating mechanism allows the model to dynamically adjust its focus on each dimension under different scenarios, thereby effectively retaining and utilizing information from each dimension.
Although PDFormer and STGAFormer are designed to address short-range and long-range spatial correlations, their focus is limited to local spatial relationships, failing to adequately capture global spatial features. In contrast, MMHFormer introduces two masking matrices to capture global, local, and long-distance dynamic similarity patterns of nodes simultaneously, enabling multi-view modeling of spatial features. Furthermore, in the temporal dimension, other models use single temporal attention to capture temporal features, which may be insufficient to adapt to significant temporal heterogeneity. In contrast, MMHFormer employs a hierarchical two-stage temporal attention mechanism to flexibly adjust to the temporal features of different nodes, dynamically adapting to varying traffic patterns.
To further demonstrate the effectiveness and practicality of MMHFormer, the computational complexity of the model is also analyzed in this section. The computational complexity of each module in STGAFormer’s spatial–temporal encoder is analyzed as follows: the distance-based spatial self-attention module requires operations, the gated temporal self-attention module consumes , and the position-wise feed-forward network contributes . Consequently, the overall time complexity of STGAFormer’s spatial–temporal encoder amounts to , where l denotes the number of encoding layers, T represents the number of time steps, N indicates the number of sensors in the traffic network, and corresponds to the hidden dimension size. In the MMHFormer framework, the hierarchical multi-view spatial attention module has a complexity of . The hierarchical two-stage temporal attention mechanism requires operations. The feed-forward network maintains complexity, resulting in a total time complexity of .
While MMHFormer demonstrates a moderately higher theoretical complexity compared to the best-performing baseline, STGAFormer, primarily due to its multi-view spatial attention design, this increased computational overhead is strategically justified. The hierarchical architecture enables more comprehensive spatial–temporal representation learning by simultaneously capturing global, local, and dynamic similarity patterns. The model remains practically feasible for real-world applications, as the polynomial complexity is manageable for typical urban traffic networks where N and T are constrained by physical infrastructure. The additional computational investment results in significant improvements in prediction accuracy and model interpretability, as shown by the experimental results across multiple benchmark datasets.
5.6. Long-Range Forecasting
To further evaluate the long-horizon forecasting capability of different models, we conduct experiments on the PEMS04 and PEMS08 datasets under an extended prediction setting. Specifically, we increase the prediction horizons to 36, 48, and 60 steps and report the performance at each horizon as well as their average.
When the horizon is extended to 60 steps, models that perform competitively at short horizons—such as DCRNN, Graph WaveNet, and MTGNN—exhibit pronounced degradation in MAE, RMSE, and MAPE. In contrast, attention-based architectures such as GMAN and the Transformer-based PDFormer show greater robustness and consistently outperform the recurrent and convolutional baselines, highlighting the advantage of explicitly modeling global temporal dependencies for long-range traffic forecasting. The detailed numerical results are summarized in
Table 3.
Compared with these baselines, MMHFormer achieves the lowest errors across all long horizons on both PEMS04 and PEMS08. On PEMS04, it reduces the average long-horizon MAE by about 4% compared with PDFormer, while on PEMS08 it yields relative improvements of around 8–9% at the most challenging 60-step horizon. These results demonstrate that the multi-source gated embedding, hierarchical multi-view spatial attention, and two-stage temporal attention enable MMHFormer to effectively capture long-range spatial–temporal dependencies, resulting in consistently superior long-horizon prediction performance.
5.7. Ablation Study
The MMHFormer model comprises three key modules: the multi-source gated embedding layer, the hierarchical multi-view spatial attention module, and the hierarchical two-stage temporal attention module. To validate the effectiveness of each component, we conducted ablation experiments on the PeMS08 dataset.We compared MMHFormer against the following variants:
MMHFormer w/o multi-source embedding: This variant removes all auxiliary spatial, temporal, and traffic occupancy embeddings, using only raw traffic flow data as input.
MMHFormer w/o gated embedding: This variant retains all the embedding sources but removes the gating mechanism, instead fusing the features through simple addition.
MMHFormer w/o hierarchical multi-view spatial attention: The multi-view spatial attention mechanism is replaced by a global spatial attention mechanism for direct spatial feature extraction.
MMHFormer w/o hierarchical two-stage temporal attention: The second stage temporal attention structure is removed, and only the first stage temporal attention mechanism is retained.
Figure 6 illustrates the comparison results of these variants. We can observe that removing the multi-source embedding leads to a significant increase in MAE, RMSE, and MAPE, highlighting the importance of integrating diverse embeddings, such as spatial Laplacian, temporal periodic, and traffic occupancy, in effectively capturing spatial–temporal dependencies. Replacing the gated embedding mechanism with simple feature addition results in a slight performance decline, suggesting that the gating mechanism is crucial for dynamically adjusting feature contributions, thereby improving model accuracy.
The inclusion of the hierarchical multi-view spatial attention module enables MMHFormer to account for global spatial dependencies while also capturing geographic and dynamic similarity-based spatial dependencies. When this module is removed, the RMSE, MAE, and MAPE metrics increase, underscoring the importance of multi-view spatial information for modeling the spatial dependencies of traffic flow.
The hierarchical two-stage temporal attention module also proves highly impactful. Removing the second stage temporal attention structure results in increased RMSE, MAE, and MAPE values, demonstrating its effectiveness in capturing temporal dependencies within traffic flows. This module is particularly advantageous in addressing sudden traffic flow fluctuations and varying temporal patterns across nodes. The second-stage temporal attention mechanism, in particular, enhances the model’s adaptability to diverse traffic patterns at different nodes.
5.8. Parameter Sensitivity Analysis
To further investigate the influence of different parameter settings on our proposed model for the traffic forecasting task, we conduct a parameter sensitivity analysis for MMHFormer. Specifically, we explored various values for each hyperparameter within predefined search spaces: [
4,
5,
6,
7] for the geographic distance threshold
, and [
6,
7,
8,
9] for the number of nearest neighbors
K based on dynamic similarity mask matrix. These hyperparameters are pre-trained using only the training set data, ensuring no information leakage during the training or inference processes. This analysis allowed us to evaluate the impact of different configurations on the performance of our MMHFormer model.
The results, as shown in
Figure 7, reveal the following observations: (1) The geographic distance threshold
best preserves spatial dependencies, enabling the model to capture relevant relationships without unnecessary complexity. Smaller values (e.g.,
) place too much emphasis on local dependencies, resulting in a loss of broader spatial context, while larger values (e.g.,
) lead to overfitting and reduce efficiency by including irrelevant or weak connections. (2) Increasing the number of nearest neighbors
K based on FFT-calculated similarity improves model performance up to
. Beyond this point, further increases provide minimal gains. Specifically, when
K is smaller than 7 (e.g.,
), the model may fail to capture long-range dependencies adequately, leading to underfitting and a less accurate prediction of traffic flow. On the other hand, when
K exceeds 7 (e.g.,
or
), the model becomes overly sensitive to noise, which can lead to overfitting and reduced generalization ability. A value of
strikes a balance, capturing both local and global patterns effectively while avoiding unnecessary complexity.
5.9. Traffic Occupancy Embedding
To highlight the importance of traffic occupancy data as an input embedding, this study conducts an in-depth analysis of the daily variations in both traffic flow and traffic occupancy. Twenty specific sensors were selected from the PeMS08 dataset, and heatmaps of both traffic flow and traffic occupancy were generated.
Figure 8 visually demonstrates that, during peak periods (such as 7–9 a.m. and 5–7 p.m.), both traffic flow and traffic occupancy reach peak values at multiple nodes simultaneously. These significant peak regions indicate strong synchronization between flow and road occupancy during peak hours. Moreover, data from Node 16 shows that, although traffic flow decreases during certain periods, traffic occupancy remains relatively high. This may be due to factors such as reduced speed and increased vehicle density, leading to localized congestion. In such cases, relying solely on traffic flow may not accurately reflect congestion conditions. Traffic occupancy, as a supplementary indicator, can assist the model in identifying abnormal features under varying congestion states. Additionally, variations in traffic flow and occupancy across different nodes may reflect the features of the roads in the region, providing the model with the ability to identify differences between nodes. Therefore, embedding traffic occupancy into the model helps capture the complex spatial–temporal dependencies of traffic flow, thereby improving the model’s prediction accuracy and adaptability.
5.10. Case Study
To gain insights into the model’s spatial reasoning capabilities, We visualized the normalized attention weights of the geographic spatial attention during the morning peak (9:00–10:00) and the dynamic similarity spatial attention during the off-peak period (15:00–16:00).
Figure 9 demonstrates that two attention mechanism captures distinct spatial patterns: During the morning peak (9:00–10:00), traffic congestion is typically higher on major arterial roads and key intersections, as people commute to work or school. The Geographic Spatial Attention mechanism captures this phenomenon by emphasizing the nodes (roads or intersections) with immediate neighbors that are highly trafficked, thus modeling the flow patterns in densely populated areas. As seen in
Figure 9a, the attention heatmap predominantly highlights these localized clusters of road nodes, reflecting the high traffic density and the strong temporal dependencies at this time. The corresponding geospatial mask (
Figure 9b) further refines this attention by suppressing connections that are distant or irrelevant to the immediate traffic flow. Dynamic similarity spatial attention (
Figure 9c) reveals interesting long-range dependencies between nodes sharing similar traffic patterns, such as residential areas that experience synchronized morning outbound traffic flows. Similarly, by comparing the dynamic similarity attention heatmap (
Figure 9c) with its mask matrix (
Figure 9d), we observe that the mask strategically preserves connections between functionally similar nodes regardless of geographical distance, while filtering out irrelevant correlations. This enables the model to identify and leverage synchronized traffic behaviors across the network, capturing complex urban mobility patterns that transcend physical connectivity.
5.11. Forecasting Results and Visualization
After selecting two specific nodes from the PeMS08 dataset, we visualized the prediction results for the test set (horizon = 12), as shown in
Figure 10a. The true traffic flow curve exhibits relatively smooth variations with a clear trend. In this scenario, the model performs exceptionally well, almost perfectly capturing the overall trend of traffic flow, especially at peak and trough points. Furthermore, as shown in
Figure 10b, despite the larger and more frequent fluctuations in the actual traffic flow, the model successfully identifies various abrupt changes and responds quickly, closely approximating the true values even in high-frequency fluctuation intervals. This stable prediction of extreme fluctuations further highlights the model’s superiority in handling complex and unstable traffic patterns.
6. Conclusions
In this paper, we proposed MMHFormer, a novel multi-source and multi-view hierarchical Transformer model designed for traffic flow prediction, effectively tackling the challenges of complex spatial–temporal dependencies, dynamic variations, and external influences in traffic networks. MMHFormer incorporates a multi-source gated embedding layer to dynamically fuse multidimensional features, including spatial, temporal, and external conditions, enhancing the representation of traffic scenarios. Additionally, it employs a hierarchical multi-view spatial attention module to capture global, local, and similarity-based spatial dependencies, effectively addressing spatial heterogeneity. To further improve adaptability, the model leverages a hierarchical two-stage temporal attention mechanism, which models global temporal patterns while adapting to node-specific variations. Extensive experiments on four benchmark datasets (PeMS03, PeMS04, PeMS07, and PeMS08) demonstrate that MMHFormer consistently outperforms state-of-the-art methods across various evaluation metrics, including MAE, RMSE, and MAPE. The model also exhibits remarkable long-term prediction stability and effectiveness, making it a reliable tool for intelligent transportation systems. Future work will aim to enhance MMHFormer by incorporating a broader range of external contextual features, such as various weather conditions and traffic incidents, and extending its capabilities to support multimodal data inputs.