Next Article in Journal
The Influence of Hitting Locations on Shot Outcomes in Professional Men’s Padel
Next Article in Special Issue
A Time and Frequency Domain Based Dual-Attention Neural Network for Tropical Cyclone Track Prediction
Previous Article in Journal
HL7 FHIR-Based Open-Source Framework for Real-Time Biomedical Signal Acquisition and IoMT Interoperability
Previous Article in Special Issue
KOSLM: A Kalman-Optimal Hybrid State-Space Memory Network for Long-Term Time Series Forecasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MMHFormer: Multi-Source and Multi-View Hierarchical Transformer for Traffic Flow Prediction

1
School of Computer Science and Engineering, Chongqing University of Science and Technology, Chongqing 400054, China
2
School of Electrical and Electronic Engineering, North China Electric Power University, Beijing 102206, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(23), 12804; https://doi.org/10.3390/app152312804
Submission received: 23 October 2025 / Revised: 21 November 2025 / Accepted: 1 December 2025 / Published: 3 December 2025
(This article belongs to the Special Issue Advanced Methods for Time Series Forecasting)

Abstract

Traffic flow prediction is a vital component of Intelligent Transportation Systems (ITSs), playing a key role in proactive traffic management and the optimization of urban mobility. However, the complex spatial–temporal dependencies, dynamic variations, and external factors in traffic networks present significant challenges for accurate predictions. In this paper, we propose MMHFormer, a novel multi-source, multi-view hierarchical Transformer model specifically designed for traffic flow prediction. MMHFormer incorporates three key innovations: (1) a multi-source gated embedding layer that integrates diverse multidimensional inputs, including spatial Laplacian embeddings, temporal periodic embeddings, and traffic occupancy embeddings, to better capture the complex dynamics of traffic conditions; (2) a hierarchical multi-view spatial attention module that models global, local, and dynamic similarity-based spatial dependencies, effectively addressing the spatial heterogeneity of traffic flows; (3) a hierarchical two-stage temporal attention mechanism that captures global temporal dependencies while adapting to node-specific temporal variations. Extensive experiments conducted on four benchmark traffic datasets demonstrate that MMHFormer outperforms state-of-the-art methods, achieving significant improvements in prediction accuracy.

1. Introduction

Traffic flow prediction is one of the core technologies of Intelligent Transportation Systems (ITSs). By analyzing historical data and utilizing prediction models, it helps traffic managers forecast future traffic volume and road conditions [1]. With accurate traffic predictions, traffic management departments can develop strategies for road condition optimization, plan and navigate in advance, thereby improving traffic operational efficiency, ensuring travel safety, and optimizing urban transportation networks [2]. However, due to the complex temporal and spatial dependencies of traffic flow, achieving accurate predictions remains challenging.
Traffic flow data, typically collected from road sensors as time-series data, has traditionally been analyzed using linear methods such as Autoregressive Integrated Moving Average (ARIMA) [3], Vector Autoregression (VAR) [4], and Support Vector Regression (SVR) [5]. However, these methods struggle to effectively capture the nonlinear characteristics of traffic patterns and the spatial attributes of the transportation network, limiting their application in modern traffic flow prediction. With advancements in deep learning techniques and improvements in hardware capabilities, an increasing number of neural network-based models have been employed to more effectively capture the dynamic temporal and spatial dependencies in traffic flow prediction. In the temporal dimension, Recurrent Neural Networks (RNNs) and their variants, such as Long Short-Term Memory (LSTM) [6] and Gated Recurrent Units (GRU) [7], are employed to capture the dynamic temporal features of traffic flow. In the spatial dimension, some researchers utilize Convolutional Neural Networks (CNNs) to partition the road network into distinct grids to extract spatial correlations. Researchers have combined these models to extract spatial–temporal features [8]. However, this approach is only applicable to grid-like Euclidean spatial graphs, while traffic road networks are typically non-Euclidean. Subsequently, Graph Neural Networks (GNNs) have been shown to be more suitable for modeling non-Euclidean traffic road networks [9]. For instance, Zhao et al. proposed the T-GCN model [10], which combines Graph Convolutional Networks (GCNs) and GRU to simultaneously capture both spatial and temporal dependencies in traffic data, achieving significant results.
Despite the significant achievements in exploring the spatial–temporal features of traffic networks, several challenges still remain to be addressed. Firstly, the spatial–temporal features of traffic flow are highly complex and intertwined. In the input embedding stage, previous studies typically input spatial and temporal data into separate spatial and temporal modules for feature extraction, and then fuse the extracted results in subsequent stages. However, this approach fails to effectively integrate spatial–temporal features, leading to insufficient capture and utilization of deeper information correlations. Additionally, existing multidimensional embedding methods have notable limitations, particularly the lack of effective fusion strategies. These methods typically treat data from different dimensions equally, without considering the relative importance and interdependencies of each dimension, thereby limiting the model’s performance to some extent. Moreover, most early models primarily rely on a single data source (such as historical traffic flow data) for predictions, neglecting other factors that significantly influence traffic flow. For example, an increase in traffic occupancy during peak hours may lead to congestion, significantly affecting the distribution of traffic flow within the region. Ignoring these critical influencing factors may prevent the model from fully addressing the complexity of real-world scenarios, thereby reducing its reliability and practical applicability.
Secondly, traffic networks exhibit significant spatial heterogeneity. Although some nodes are spatially adjacent (e.g., different locations on the same road segment), their traffic flow and variation trends may be similar. However, as shown in Figure 1b, even when two nodes are far apart and not on the same road segment, they may still exhibit similar traffic patterns, often driven by shared traffic characteristics or demand patterns. This indicates that spatial correlations in traffic networks are not solely dependent on physical distance. However, numerous studies have demonstrated that GNNs tend to suffer from over-smoothing during node feature aggregation [11], which undermines the model’s ability to capture long-range dependencies and limits its performance in modeling spatial heterogeneity.
Finally, traffic flow exhibits significant temporal heterogeneity. For example, Figure 1c shows that the traffic flow correlation between Location A and Location B is relatively low during the morning peak period, reflecting a disparity in their traffic patterns [12]. However, during the evening peak period, their traffic patterns tend to align. Conversely, as shown in Figure 1d, the traffic flow correlation between Node B and Node C is stronger during the morning peak, but weaker at other times. Furthermore, Figure 1a shows that even at the same location, traffic patterns on weekdays may differ significantly from those on non-working days. Weekday traffic flows exhibit clear periodicity, whereas non-working days lack such regularity, reflecting differences in travel behaviors on different types of days. Notably, given the significant temporal heterogeneity in traffic flow across different periods, a single temporal processing method may be insufficient to comprehensively capture its complex dynamic variations.
To address the aforementioned complex traffic flow scenarios, this paper proposes a model based on the Transformer encoder architecture—MMHFormer. To provide enriched input representations for subsequent spatial–temporal encoding, the input embedding layer utilizes gated convolution to integrate traffic features, temporal characteristics, spatial attributes, and external environmental factors. To capture spatial dependencies among traffic nodes from multiple perspectives, we introduce a hierarchical multi-view spatial attention module that explicitly models both global and local spatial dependencies. Additionally, to address cross-node temporal pattern variations, we propose a hierarchical two-stage temporal attention module. In the first stage, a temporal attention mechanism captures global temporal dependencies. In the second stage, convolution is performed at the node level as a secondary query, followed by an additional round of temporal attention to emphasize traffic pattern differences across nodes. The main contributions of MMHFormer are summarized as follows:
  • This paper proposes a novel MMHFormer model, which effectively addresses challenges in multi-source information fusion, multi-view interaction, and dynamic temporal dependency modeling.
  • MMHFormer is designed as a unified framework that integrates multi-source feature fusion, multi-view spatial reasoning, and hierarchical temporal modeling. A multi-source gated embedding layer fuses spatial Laplace, time-period, and traffic occupancy embeddings to represent complex traffic conditions. On this basis, a hierarchical multi-view spatial attention mechanism captures global, geospatial, and dynamic-similarity dependencies, while a two-stage temporal attention module learns both global temporal correlations and node-specific dynamics, enabling adaptive modeling of evolving traffic patterns across time and space.
  • Experimental results on four real-world traffic datasets show that MMHFormer outperforms current state-of-the-art methods, validating the model’s effectiveness and generalization capability. Furthermore, we conducted a single-step evaluation of the hourly forecast, demonstrating the model’s effectiveness in long-term prediction.

2. Related Works

2.1. Traffic Flow Prediction

Traffic flow prediction is a critical component of intelligent transportation systems, enabling the forecasting of future traffic conditions based on historical data, which significantly aids traffic management and control. Researchers typically use statistical methods, machine learning approaches, and deep learning techniques to predict traffic flow. Traditional statistical models, such as exponential smoothing, ARIMA, VAR, and historical average (HA) [13] approach traffic forecasting by assuming linear dependencies. However, they are less effective at capturing the complex nonlinear variations inherent in traffic flow.
Recognizing the limitations of linear models, researchers shifted their focus to treating traffic prediction as a nonlinear problem. Machine learning techniques, such as support vector machines (SVM) [14], Bayesian networks, autoencoders, and k-nearest neighbors (KNN) [15], were introduced to extract the correlation between data.However, these methods relied on manually crafted features, making feature engineering extremely time-consuming for large datasets, which further limited their effectiveness for large-scale traffic flow prediction.
In recent years, deep learning has gradually become the mainstream method for traffic flow prediction due to its powerful modeling capabilities. For example, Zhang et al. proposed a hybrid deep learning framework combining CNN and LSTM for short-term traffic flow forecasting [15]. The 1D CNN captures spatial features, while LSTM learns short-term variations and periodic features. Liu et al. introduced a module combining convolution and LSTM, referred to as the Conv-LSTM module, to extract spatial–temporal features from traffic flow data [16]. Furthermore, the model includes a Bidirectional LSTM (Bi-LSTM) module to capture the periodic characteristics of traffic flow. Lv et al. proposed the LC-RNN model, which employs a CNN to extract dynamic features from surrounding areas, while an RNN learns long-term temporal patterns [17]. Additionally, LC-RNN integrates periodic information and contextual factors (such as weather and holidays) and adaptively fuses this information to improve prediction accuracy. These models, primarily based on CNNs and RNNs designed for Euclidean data such as grids or sequences, struggle to model traffic networks effectively, as these networks are inherently non-Euclidean and exhibit complex, irregular topologies. Consequently, these methods fail to fully capture the spatial dependencies of road networks, leading to suboptimal performance.

2.2. Graph Convolution Network

Graph Convolutional Networks are designed to analyze non-Euclidean data. Unlike CNNs, which are limited to regular grid-based data, GCNs excel at aggregating feature information from neighboring nodes to compute node representations, effectively leveraging graph topology. Consequently, researchers have applied GCNs to traffic flow prediction to extract spatial features from non-Euclidean traffic networks. For example, Yu et al. proposed the STGCN, which leverages graph convolutional layers to capture spatial topology and temporal convolutional layers to model temporal dynamics, thereby overcoming the limitations of recurrent networks [18]. Guo et al. proposed the ASTGCN [19], which consists of three independent components that model recent, daily, and weekly dependencies in traffic flow, respectively. Each component includes a spatial–temporal attention mechanism to effectively capture dynamic spatial–temporal correlations and a spatial–temporal convolution module that combines graph convolution and standard convolution to capture spatial patterns and temporal features. Although previous studies acknowledge the importance of capturing both spatial and temporal features for traffic forecasting, most methods still rely on predefined adjacency matrices to model spatial correlations, neglecting dynamic features that evolve over time. Therefore, in recent years, some researchers have focused more on the dynamic construction of adjacency matrices to better capture spatial–temporal dependencies between nodes. Wu et al. proposed Graph WaveNet [20], which uses an adaptive dependency matrix to capture hidden spatial dependencies in the data, thereby avoiding reliance on predefined adjacency matrices. In addition, Graph WaveNet employs stacked dilated causal convolutions to capture temporal dependencies, enabling it to handle long time series. For example, Bai et al. proposed the NAPL module and the DAGG module to enhance the capabilities of GCNs [21]. The NAPL module is used to learn node-specific patterns, while the DAGG module automatically generates dependencies between nodes based on the data.
With the advancement of research, the limitations of a single adjacency matrix in capturing complex spatial dependencies have become increasingly apparent. Researchers have begun to explore the use of multiple graph adjacency matrices to more comprehensively capture spatial dependencies. Guo et al. proposed the HGCN [22], which combines micro-level and macro-level graph layers. The micro-level layer represents nodes in a road network and their connections, while the macro-level layer is constructed by clustering the micro-level layer to form larger traffic regions. HGCN uses pooling methods to achieve a hierarchical structure and introduces a dynamic transfer module that facilitates interaction between micro and macro features, thereby enhancing the model’s accuracy and generalization ability in traffic forecasting. Ye et al. proposed the DMGNN [23], which constructs multiple types of spatial–temporal graphs to represent different relationships between road nodes, providing richer prior knowledge and better capturing complex relationships within the traffic system. To dynamically reflect changes between nodes, DMGNN includes a dynamic graph adjustment module that updates the adjacency matrix during each training iteration, allowing the model to adapt to temporal variations in dependency relationships. Yin et al. introduced the M-SDCGCN [24], which incorporates multiple types of graphs (including adaptive, dynamic, and static graphs) to model complex dependencies between road nodes, thus capturing interactions at different levels. The model leverages meta-learning to enhance connections between spatial–temporal features and jointly models both static and dynamic factors. Huang et al. presented the MVDGCN [25], which integrates multi-view encoder-decoder modules, dynamic relationship matrix generation, and coupled graph convolutions to capture spatial–temporal dependencies at different time scales. The model extracts traffic flow patterns from hourly, daily, and weekly perspectives, and uses a dynamic fusion module to integrate these features for more accurate traffic flow forecasting.
Beyond the aforementioned graph structures, several studies have increasingly focused on spatial–temporal modeling of heterogeneous traffic flow with heterogeneous spatial–temporal graphs. Zhong et al. [26] proposed a heterogeneous spatial–temporal graph convolution framework that constructs multiple graphs to jointly encode geographical and dynamic correlations, while simultaneously handling missing traffic data via spatial–temporal completion. Xu et al. [27] built a global heterogeneous traffic spatial–temporal graph and introduced HTSTGC to capture interactions among different traffic elements within a unified framework. More recently, Wu et al. [28] presented MHGNet, a multi-heterogeneous graph neural network that jointly represents multiple heterogeneous spatial–temporal graphs and performs prediction on clustered heterogeneous subgraphs. Collectively, these works underscore the importance of explicitly modeling heterogeneous nodes, edges, and relations in traffic networks. However, they mainly focus on structural heterogeneity at the graph level and rarely integrate rich multi-source traffic attributes or explicitly address temporal heterogeneity.

2.3. Attention Mechanism

Transformers [29], known for their powerful sequence modeling capabilities and multi-head attention mechanisms, have achieved significant success in natural language processing and have recently been introduced into the domain of traffic flow forecasting. Compared to traditional models such as GCNs and RNNs, Transformers have a greater advantage in capturing long-term spatial–temporal dependencies, especially when handling dynamic changes and complex patterns in traffic flow. For instance, Jiang et al. proposed PDFormer [30], which employs a spatial self-attention module to capture dynamic spatial dependencies and combines short-range and long-range graph masks to capture local and remote dependencies. Furthermore, PDFormer introduces a delay-aware feature transformation module to explicitly model the delay characteristics of spatial information propagation. Cai et al. introduced LCDFormer [31], which incorporates a temporal aggregation approach that aggregates and compresses historical data over long time windows, effectively retaining long-term historical information while reducing the impact of redundant data. Additionally, LCDFormer proposes a novel spatial–temporal attention module that combines topology-based local attention and node similarity-based global attention, enabling the model to capture local short-term spatial features as well as to discover long-distance dynamic spatial correlations. Geng et al. presented STGAFormer [32], which adopts a Transformer encoder architecture combined with a gated temporal self-attention module and a distance-based spatial self-attention module to extract complex spatial–temporal features. The gated temporal self-attention module enhances the extraction of both local and global temporal features, while the distance-based spatial self-attention module employs thresholding to selectively extract crucial spatial features. Li et al. proposed DDGFormer [33], which introduces direction and distance-aware self-attention modules to capture relative position and directionality in traffic flow sequences. In addition, the model includes a dynamically enhanced adaptive graph convolutional network module, which improves the capture of dynamic spatial correlations in traffic systems.
Compared with the above existing studies, our proposed MMHFormer offers the following advantages and distinctions: (1) A multi-source gated embedding layer adaptively fuses spatial Laplacian, temporal periodic, and occupancy features, effectively mitigating the limitations of single-source or uniformly weighted embeddings. (2) A hierarchical multi-view spatial attention module explicitly captures global, local, and dynamic-similarity dependencies, enabling comprehensive modeling of spatial heterogeneity. (3) A two-stage temporal attention mechanism jointly models global dependencies and node-specific variations, substantially improving robustness to diverse temporal dynamics.

3. Problem Formalization

Traffic flow is defined as X = { X 1 , X 2 , , X T } R T × N × C , where X t R N × C represents the observed values at time step t for N nodes with C feature dimensions. Here, C = 1 corresponds to the traffic flow feature dimension. Our goal is to train a function F that predicts the sequence for the next Q time steps. This prediction is based on the past P time steps of traffic flow data X . Therefore, the traffic prediction problem can be formulated as:
Y ( t + 1 ) , , Y ( t + Q ) = F ( X ( t P + 1 ) , , X t )

4. Methodology

Figure 2 illustrates the framework of MMHFormer, which primarily consists of three modules: the multi-source gated embedding layer, the spatial–temporal encoder layer, and the output layer.
To effectively capture the spatial–temporal features of traffic flow while considering road conditions. The multi-source gated embedding layer integrates multidimensional data inputs, including raw traffic flow, spatial Laplacian embeddings, temporal periodic embeddings, and traffic occupancy embeddings, using gated convolutions.
The spatial–temporal encoder layer employs a hierarchical multi-view spatial attention module. A global spatial attention mechanism captures global spatial dependencies between nodes. To enhance sensitivity to critical local information, a geospatial attention mechanism focuses on the local dynamic features of neighboring nodes while discarding connections to distant nodes. Additionally, dynamic similarity spatial attention is incorporated, allowing the model to ignore distance and extract long-range spatial features by capturing the dynamic similarity of traffic patterns between nodes.
To adapts to the varying traffic patterns across different nodes. In the hierarchical two-stage temporal attention module, the first stage captures global temporal dependencies, while the second stage refines the query focus and re-applies temporal attention to emphasize traffic pattern differences between nodes.
The output layer employs skip connections and convolutional layers to transform the outputs into the final dimensions required for prediction. In this section, we provide a detailed description of the MMHFormer architecture.

4.1. Multi-Source Gated Embedding Layer

At each time interval, traffic flow in different urban areas is influenced by various factors, including traffic flow in neighboring areas and external environmental conditions. For instance, congestion in one area can lead to a sudden decrease in traffic flow across the city. Similarly, during holidays, urban traffic flow often increases significantly compared to weekdays. Based on this observation, we propose a multi-source gated embedding layer that integrates multidimensional data to capture the spatial–temporal dependencies and the impact of road conditions on traffic flow.
Specifically, the raw traffic flow input X is transformed into X f R T × N × d f via a Linear layer and combined with three types of embeddings: spatial Laplacian embeddings to encode the spatial structural features of the road network, temporal periodic embeddings to capture the periodic variations in traffic flow, and traffic occupancy embeddings to reflect external environmental impacts. Finally, a gated convolutional network achieves efficient fusion of multi-source information, effectively integrating the spatial, temporal, and external influences on traffic flow.
Spatial–temporal embedding: The Laplacian matrix is used to learn the correlations between nodes in the road network, embedding the graph into Euclidean space to obtain the spatial embedding representation X s p e R N × d s t . Considering the periodicity of urban traffic flow, we introduce weekly and daily periodic embeddings [30], represented as t w ( T ) and t d ( T ) R d , where w ( T ) and d ( T ) convert time t into week and minute indices. Temporal embeddings X w and X d R T × d s t are added to the spatial embeddings X s p e to obtain the final spatial–temporal representation X s t R T × N × d s t .
Traffic occupancy embedding: Variations in traffic flow and occupancy across different nodes may reflect the characteristics of the roads in those regions (e.g., major or minor roads), providing the model with the ability to recognize differences between nodes. To achieve this, we introduce the traffic occupancy embedding mechanism. Initially, raw traffic occupancy data is processed to extract key features, and a linear layer is applied to generate the traffic occupancy embedding representation X o R T × N × d o .
Information fusion: The traffic flow input representation X f , spatial–temporal embedding X s t , and traffic occupancy embedding X o are concatenated to produce the fused representation X e R T × N × d , where d = d f + d s t + d o is the hidden dimension:
X f u s = c o n c a t ( X f , X s t , X o )
X e = C o n v ( C o n v ( X f u s ) s i l u ( C o n v ( X f u s ) ) )
X e = X e + P E ( X e )
The gated convolution mechanism dynamically adjusts feature weights based on different input data, enabling more comprehensive fusion. Here, C o n v denotes 1 × 1 convolution, P E represents temporal positional encoding, s i l u denotes the Sigmoid Linear Unit, and ⊙ represents the Hadamard product.

4.2. Hierarchical Multi-View Spatial Attention Module

As illustrated in Figure 3, global spatial attention is first introduced to capture the dependencies between traffic nodes and all other nodes. For the input feature representation X e R T × N × d , query, key, and value matrices are generated at each time step t through convolutional operations as follows:
Q t ( S 1 ) = X e ( t : : ) W Q , K t ( S 1 ) = X e ( t : : ) W K , V t ( S 1 ) = X e ( t : : ) W V
Here, W Q , W K , W V R d × d are learnable parameter matrices, where d is the dimensionality of the queries, keys, and values. The attention score matrix A ( S 1 ) is computed as:
A ( S 1 ) = Q t ( S 1 ) K t ( S 1 ) T d
The softmax function is applied to A ( S 1 ) , generating attention weights for each node relative to all other nodes. The global spatial representation X s ( G l o ) is then obtained:
X s ( G l o ) = softmax ( A ( S 1 ) ) V t ( S 1 )
In practical traffic scenarios, the dynamic features of a target node are often significantly influenced by its neighboring nodes, particularly during abrupt changes. To better capture the local dynamic characteristics of nodes, geospatial attention is introduced. This mechanism focuses on information from neighboring nodes while discarding connections with distant nodes. Based on the global spatial attention output X s ( G l o ) , a new query Q t ( S 2 ) is generated at time step t using a 1 × 1 convolution and treated as V t ( S 2 ) . The secondary attention weight A ( S 2 ) is computed as:
A ( S 2 ) = Q t ( S 2 ) ( Q t ( S 1 ) ) T d
To ensure that the attention mechanism focuses only on nearby nodes, a geospatial mask matrix M geo is defined. The matrix is undirected, and only nodes with distances less than λ are considered for extracting important features. The geospatial representation X s ( G e o ) is obtained as:
X s ( G e o ) = softmax ( A ( S 2 ) M g e o ) V t ( S 2 )
Although some traffic nodes are not geographically adjacent, they may exhibit similar traffic patterns driven by shared traffic characteristics or demand patterns. Dynamic similarity spatial attention is introduced to dynamically adjust attention weights based on the similarity of traffic patterns between nodes, identifying potential correlations among non-adjacent nodes. We use a sliding 12-step time window to analyze traffic data and construct a dynamic similarity mask matrix M d y n using the fast Fourier transform (FFT). Specifically, we apply FFT to the past 12 steps of each node to extract short-term periodicity and local oscillation patterns, and then compute the Euclidean distances between the resulting frequency-domain features. This matrix encodes the relationships between each node and its K-most similar nodes, with a weight of 1 assigned to the most similar nodes and 0 to others.
The choice of a 12 step window is driven by the forecasting task, which involves predicting the next 12 steps based on the previous 12 steps. A window of 12 steps corresponds to one hour of traffic data, offering a sufficient segment for capturing stable frequency components, while also being short enough to account for transient variations and peak-hour fluctuations.
Based on X s ( G l o ) , a new query Q t ( S 3 ) is generated at time step t using a 1 × 1 convolution and treated as V t ( S 3 ) . Under the dynamic similarity mask, the dynamic similarity attention weight A ( S 3 ) is calculated as:
A ( S 3 ) = Q t ( S 3 ) ( Q t ( S 1 ) ) T d
The dynamic similarity representation X s ( D y n ) is given by:
X s ( D y n ) = softmax ( A ( S 3 ) M d y n ) V t ( S 3 )
Finally, the model integrates global spatial, geospatial, and dynamic similarity attention mechanisms to model multi-perspective spatial dependencies. The aggregated spatial representation is expressed as:
X s = Concat X s ( Dyn ) , X s ( Geo ) , X s ( Glo ) W F
X s = FFN ( LN ( X s ) ) + X s
Here, W F R 3 d × d is a learnable parameter. The feed-forward network (FFN) performs nonlinear transformations with two linear layers and a GELU activation function, while LN denotes the layer normalization operation.

4.3. Hierarchical Two-Stage Temporal Attention

4.3.1. Global Temporal Attention

In the first stage, to capture global dynamic temporal patterns, we first project the input tensor X s R T × N × d into temporal query, key, and value matrix using learnable 1 × 1 convolutions:
Q t ( T ) = X s ( t , : , : ) W Q ( T ) , K t ( T ) = X s ( t , : , : ) W K ( T ) , V t ( T ) = X s ( t , : , : ) W V ( T )
Here, W Q ( T ) , W K ( T ) , W V ( T ) R d × d are learnable parameters.
Next, we compute the scaled dot-product between the queries and keys at each node to obtain the temporal attention scores. The output of the first-stage temporal attention module is then calculated by applying the attention scores to the values:
X t ( T 1 ) = softmax ( Q t ( T ) ( K t ( T ) ) d ) V t ( T ) .

4.3.2. Spatially-Informed Temporal Attention

Time series data not only have long-term dependence, but also show complex cross-node changes due to the heterogeneity and dynamics of time series characteristics of each node. It is difficult to describe this heterogeneous evolution with a single time attention. To this end, as shown in Figure 4, we propose the spatially-informed temporal attention Module. Attention is combined with adaptive graph convolution to model the dynamic changes between nodes. Specifically, adaptive graph convolution dynamically adjusts the adjacency matrix by learning, fuses spatial topology information with the dynamic interaction between nodes, and can flexibly model the interdependence between different nodes. In the case of large cross-time changes between nodes, it can effectively distinguish and strengthen the timing pattern at critical moments. This mechanism enables the model not only to capture the dynamic evolution in time, but also to identify the temporal pattern differences between nodes, which further enhances the sensitivity to cross-node changes.
In the second stage, based on the output of the first stage, we capture temporal patterns that span multiple nodes by incorporating adaptive graph learning to infer latent inter-node dependencies. These learned dependencies are then used to compute the query matrix through a graph convolution step.
G = softmax ReLU ( E E )
Q t ( T 2 ) = G X t ( T 1 ) W Q ( T 2 )
Here, W Q ( T 2 ) R d × d is a learnable projection matrix, and E R N × a denotes the learnable node embeddings.
In this stage, the first stage key K t ( T ) and output X t ( T 1 ) serve as the key K and value V, respectively, to compute the secondary temporal attention weights. The weight matrix is normalized and applied to X t ( T 1 ) , which passes through a linear layer and an FFN to generate the final temporal context X t ( T 2 ) , completing the two stage aggregation of temporal features:
X t ( T 2 ) = softmax ( Q t ( T 2 ) ( K t ( T ) ) d ) X t ( T 1 ) .
X t = FFN ( LN ( X t ( T 2 ) ) ) + X t ( T 2 )

4.4. Output Layer

To resize the spatial–temporal encoder output, we apply 1 × 1 convolution skip connections, summing each layer’s output to obtain X h R T × N × d s k .
X ^ = C o n v 2 ( C o n v 1 ( X h ) )
For multi-step prediction, C o n v 1 and C o n v 2 adjusts X h to the prediction window dimension, reducing error and enhancing model effectiveness.

5. Experiments

5.1. Datasets

The datasets used for the experiments in this paper include PeMS03, PeMS04, PeMS07, and PeMS08, which are all derived from the California State Traffic Management System (CSTMS) and collected by the California Department of Transportation (Caltrans) through its traffic monitoring network. The data cover highway traffic conditions across various regions, collected every 5 min by real-time sensors deployed on the roadways, providing multidimensional features such as traffic volume, speed, and roadway occupancy for time-efficient monitoring of traffic flow. These multidimensional features serve as inputs to the traffic flow prediction model, which effectively captures the traffic patterns and dynamic changes in each region, revealing regional traffic differences. Due to its large size and high frequency, the PeMS dataset has become a cornerstone of ITS research, providing a valuable testbed for data-driven traffic decision-making and system optimization. Specific information about each dataset is detailed in Table 1.

5.2. Baselines

We compare MMHFormer with the following baseline models: traditional time series forecasting models, graph neural network-based models, and transformer-based models.
  • VAR [4]: VAR is based on the assumption that traffic flow follows an autoregressive pattern, meaning that future values can be predicted using past data in the series.
  • SVR [5]: SVR leverages the principles of Support Vector Machines (SVMs) to effectively model traffic flow in a nonlinear way, unlike traditional linear regression methods.
  • DCRNN [34]: DCRNN models traffic flow as a diffusion process on a directed graph, capturing spatial dependencies through bidirectional random walks. It also models temporal dependencies using a sequence-to-sequence structure combined with scheduled sampling.
  • GraphWaveNet [20]: GraphWaveNet automatically generates the graph adjacency matrix by adaptively learning node embeddings, which enables more accurate capturing of hidden spatial dependencies. Additionally, it utilizes stacked dilated causal convolutions to effectively handle long-term temporal dependencies.
  • AGCRN [21]: AGCRN introduces the NAPL and DAGG modules, which are designed to capture node-specific patterns and automatically infer interdependencies across different traffic series. This enables AGCRN to effectively capture fine-grained spatial and temporal correlations within traffic series data, enhancing its ability to model complex traffic dynamics.
  • STGCN [18]: STGCN is a model that combines graph convolutions to capture spatial dependencies between traffic nodes and 1D convolutions to model temporal dynamics, effectively handling the complex spatio-temporal correlations in traffic flow prediction.
  • MTGNN [35]: MTGNN exploits the underlying spatio-temporal dependencies by automatically learning the relationships between variables. The framework includes a graph learning layer, a graph convolution module, and a temporal convolution module, which adaptively learns the graph structure to capture spatial dependencies between variables and integrates multi-frequency temporal patterns to enhance prediction performance.
  • GMAN [36]: GMAN adopts an encoder-decoder structure, with both the encoder and decoder comprising multiple spatial–temporal attention blocks to model the complex spatio-temporal correlations of traffic systems. By incorporating a transformer attention mechanism, GMAN aims to reduce error propagation in long-term forecasting.
  • ASTGCN [19]: ASTGCN consists of three independent components that model the recent, daily, and weekly dependencies of traffic flow. Each component includes spatial–temporal attention mechanisms and spatial–temporal convolution modules to capture the dynamic spatio-temporal correlations and features of traffic data.
  • STFGNN [37]: STFGNN is a novel spatiotemporal fusion graph neural network for traffic flow prediction. Using data-driven “temporal graphs” to complement traditional spatial graphs, it effectively captures hidden spatiotemporal dependencies. STFGNN’s fusion operation processes multiple spatial and temporal graphs in parallel, integrating with a gated convolution module to handle long sequences and capture richer dependencies.
  • STID [38]: By adding spatial and temporal identity information, this model addresses the issue of sample indistinguishability in the spatio-temporal dimensions, thereby enhancing the model’s predictive capability.
  • GDGCN [39]: GDGCN systematically explores spatial, temporal, and feature dimensions of data by combining parameter-sharing and independent modules. It designs a novel temporal graph convolution block to process the dynamic relationships of historical time slices in graph form. Additionally, a dynamic graph constructor is introduced to model time-specific spatial dependencies and the dynamic interaction relationships between different time slices.
  • PDFormer [30]: PDFormer addresses the limitations of current GNN-based models in static modeling, short-range spatial dependencies, and the neglect of propagation delays. PDFormer introduces a dynamic spatial self-attention module to capture dynamic spatial dependencies and employs both geographic and semantic graph mask matrices to simultaneously capture short-range and long-range dependencies.
  • DDGformer [33]: DDGformer captures the directional and relative positional relationships in traffic data using a direction- and distance-aware self-attention module. It also uses a dynamically enhanced adaptive graph convolution network to capture dynamic patterns in traffic systems.
  • STGAFormer [32]: STGAFormer effectively integrates both local and global dynamic spatio-temporal features and employs a distance-based self-attention module to capture critical features between different regions. The model incorporates multidimensional inputs, including traffic flow attributes, periodicity, proximity adjacency matrices, and adaptive adjacency matrices, to better capture the spatio-temporal characteristics of traffic flow.

5.3. Experimental Settings

In this experiment, data from the past 12 time steps (one hour) are used to predict the traffic flow for the next 12 time steps, based on current mainstream traffic flow prediction methods. The dataset is divided into training, validation, and test sets in the ratio of 6:2:2 to fully evaluate the generalization ability of the model. The experiments were conducted on an NVIDIA RTX 4090 GPU (NVIDIA, Santa Clara, CA, USA). The batch size was set to 16 for PeMS03, PeMS04, and PeMS08, and to 6 for PeMS07, with 200 epochs of training to ensure sufficient model convergence and to avoid overfitting. The model’s spatial–temporal encoder comprises 6 layers (L), with the hidden dimension (d) set to 64. The optimizer uses AdamW with an initial learning rate of 0.001 and combines the weight decay mechanism to effectively prevent overfitting. In addition, overfitting is further prevented during training by employing an early stopping criterion, which halts training once the validation performance stops improving.

5.4. Evaluation Metrics

To comprehensively assess the model performance, we introduced three commonly used evaluation metrics: mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE). To ensure the accuracy of the results, we filtered the missing data.
M A E = 1 n i = 1 n | y i y ^ i |
R M S E = 1 n i = 1 n ( y i y ^ i ) 2
M A P E = 1 n i = 1 n y i y ^ i y i × 100 %
Here, y i and y ^ i denote the actual and predicted traffic flow values at node i for a given time step, respectively. The variable n represents the total number of nodes.

5.5. Comparison Results

Table 2 presents the performance of our proposed MMHFormer model compared to other baseline models on next-hour predictions across four real-world datasets. The best-performing results are highlighted in bold, while the second-best results are underlined. To validate the superiority of the model, most baseline results were obtained from official documentation and related studies [28]. Our model achieved the best performance across all metrics on the four datasets: PeMS03, PeMS04, PeMS07, and PeMS08.
To evaluate the long-term prediction capability of the model, we visualized its MAE, RMSE, and MAPE across the next 12 steps, as shown in Figure 5. Compared to other models, our model demonstrates smaller variations in all metrics throughout the prediction horizon, highlighting its superior stability and robustness. This consistent performance further confirms its suitability for long-term prediction tasks.
DDGformer, PDFormer, and STGAFormer are three representative attention-based models. However, they fuse features across different dimensions through simple addition during data input embedding. In contrast, MMHFormer employs a concatenation approach across dimensions, which not only preserves the complete representation of spatial–temporal features but also incorporates traffic occupancy as an additional data source, enabling a more accurate representation of real-world traffic conditions. Additionally, the introduction of a gating mechanism allows the model to dynamically adjust its focus on each dimension under different scenarios, thereby effectively retaining and utilizing information from each dimension.
Although PDFormer and STGAFormer are designed to address short-range and long-range spatial correlations, their focus is limited to local spatial relationships, failing to adequately capture global spatial features. In contrast, MMHFormer introduces two masking matrices to capture global, local, and long-distance dynamic similarity patterns of nodes simultaneously, enabling multi-view modeling of spatial features. Furthermore, in the temporal dimension, other models use single temporal attention to capture temporal features, which may be insufficient to adapt to significant temporal heterogeneity. In contrast, MMHFormer employs a hierarchical two-stage temporal attention mechanism to flexibly adjust to the temporal features of different nodes, dynamically adapting to varying traffic patterns.
To further demonstrate the effectiveness and practicality of MMHFormer, the computational complexity of the model is also analyzed in this section. The computational complexity of each module in STGAFormer’s spatial–temporal encoder is analyzed as follows: the distance-based spatial self-attention module requires O ( 4 N 2 d model + 2 N 2 ) operations, the gated temporal self-attention module consumes O ( 2 T 2 d model + T 2 ) , and the position-wise feed-forward network contributes O ( N T d model 2 ) . Consequently, the overall time complexity of STGAFormer’s spatial–temporal encoder amounts to O ( l ( 4 N 2 d model + 2 T 2 d model + T 2 + 2 N 2 + N T d model 2 ) ) , where l denotes the number of encoding layers, T represents the number of time steps, N indicates the number of sensors in the traffic network, and d model corresponds to the hidden dimension size. In the MMHFormer framework, the hierarchical multi-view spatial attention module has a complexity of O ( 6 N 2 d model + 3 N 2 ) . The hierarchical two-stage temporal attention mechanism requires O ( 2 T 2 d model + 2 T 2 ) operations. The feed-forward network maintains O ( N T d model 2 ) complexity, resulting in a total time complexity of O ( l ( 6 N 2 d model + 2 T 2 d model + 2 T 2 + 3 N 2 + N T d model 2 ) ) .
While MMHFormer demonstrates a moderately higher theoretical complexity compared to the best-performing baseline, STGAFormer, primarily due to its multi-view spatial attention design, this increased computational overhead is strategically justified. The hierarchical architecture enables more comprehensive spatial–temporal representation learning by simultaneously capturing global, local, and dynamic similarity patterns. The model remains practically feasible for real-world applications, as the polynomial complexity is manageable for typical urban traffic networks where N and T are constrained by physical infrastructure. The additional computational investment results in significant improvements in prediction accuracy and model interpretability, as shown by the experimental results across multiple benchmark datasets.

5.6. Long-Range Forecasting

To further evaluate the long-horizon forecasting capability of different models, we conduct experiments on the PEMS04 and PEMS08 datasets under an extended prediction setting. Specifically, we increase the prediction horizons to 36, 48, and 60 steps and report the performance at each horizon as well as their average.
When the horizon is extended to 60 steps, models that perform competitively at short horizons—such as DCRNN, Graph WaveNet, and MTGNN—exhibit pronounced degradation in MAE, RMSE, and MAPE. In contrast, attention-based architectures such as GMAN and the Transformer-based PDFormer show greater robustness and consistently outperform the recurrent and convolutional baselines, highlighting the advantage of explicitly modeling global temporal dependencies for long-range traffic forecasting. The detailed numerical results are summarized in Table 3.
Compared with these baselines, MMHFormer achieves the lowest errors across all long horizons on both PEMS04 and PEMS08. On PEMS04, it reduces the average long-horizon MAE by about 4% compared with PDFormer, while on PEMS08 it yields relative improvements of around 8–9% at the most challenging 60-step horizon. These results demonstrate that the multi-source gated embedding, hierarchical multi-view spatial attention, and two-stage temporal attention enable MMHFormer to effectively capture long-range spatial–temporal dependencies, resulting in consistently superior long-horizon prediction performance.

5.7. Ablation Study

The MMHFormer model comprises three key modules: the multi-source gated embedding layer, the hierarchical multi-view spatial attention module, and the hierarchical two-stage temporal attention module. To validate the effectiveness of each component, we conducted ablation experiments on the PeMS08 dataset.We compared MMHFormer against the following variants:
  • MMHFormer w/o multi-source embedding: This variant removes all auxiliary spatial, temporal, and traffic occupancy embeddings, using only raw traffic flow data as input.
  • MMHFormer w/o gated embedding: This variant retains all the embedding sources but removes the gating mechanism, instead fusing the features through simple addition.
  • MMHFormer w/o hierarchical multi-view spatial attention: The multi-view spatial attention mechanism is replaced by a global spatial attention mechanism for direct spatial feature extraction.
  • MMHFormer w/o hierarchical two-stage temporal attention: The second stage temporal attention structure is removed, and only the first stage temporal attention mechanism is retained.
Figure 6 illustrates the comparison results of these variants. We can observe that removing the multi-source embedding leads to a significant increase in MAE, RMSE, and MAPE, highlighting the importance of integrating diverse embeddings, such as spatial Laplacian, temporal periodic, and traffic occupancy, in effectively capturing spatial–temporal dependencies. Replacing the gated embedding mechanism with simple feature addition results in a slight performance decline, suggesting that the gating mechanism is crucial for dynamically adjusting feature contributions, thereby improving model accuracy.
The inclusion of the hierarchical multi-view spatial attention module enables MMHFormer to account for global spatial dependencies while also capturing geographic and dynamic similarity-based spatial dependencies. When this module is removed, the RMSE, MAE, and MAPE metrics increase, underscoring the importance of multi-view spatial information for modeling the spatial dependencies of traffic flow.
The hierarchical two-stage temporal attention module also proves highly impactful. Removing the second stage temporal attention structure results in increased RMSE, MAE, and MAPE values, demonstrating its effectiveness in capturing temporal dependencies within traffic flows. This module is particularly advantageous in addressing sudden traffic flow fluctuations and varying temporal patterns across nodes. The second-stage temporal attention mechanism, in particular, enhances the model’s adaptability to diverse traffic patterns at different nodes.

5.8. Parameter Sensitivity Analysis

To further investigate the influence of different parameter settings on our proposed model for the traffic forecasting task, we conduct a parameter sensitivity analysis for MMHFormer. Specifically, we explored various values for each hyperparameter within predefined search spaces: [4,5,6,7] for the geographic distance threshold λ , and [6,7,8,9] for the number of nearest neighbors K based on dynamic similarity mask matrix. These hyperparameters are pre-trained using only the training set data, ensuring no information leakage during the training or inference processes. This analysis allowed us to evaluate the impact of different configurations on the performance of our MMHFormer model.
The results, as shown in Figure 7, reveal the following observations: (1) The geographic distance threshold λ = 5 best preserves spatial dependencies, enabling the model to capture relevant relationships without unnecessary complexity. Smaller values (e.g., λ = 4 ) place too much emphasis on local dependencies, resulting in a loss of broader spatial context, while larger values (e.g., λ = 7 ) lead to overfitting and reduce efficiency by including irrelevant or weak connections. (2) Increasing the number of nearest neighbors K based on FFT-calculated similarity improves model performance up to K = 7 . Beyond this point, further increases provide minimal gains. Specifically, when K is smaller than 7 (e.g., K = 6 ), the model may fail to capture long-range dependencies adequately, leading to underfitting and a less accurate prediction of traffic flow. On the other hand, when K exceeds 7 (e.g., K = 8 or K = 9 ), the model becomes overly sensitive to noise, which can lead to overfitting and reduced generalization ability. A value of K = 7 strikes a balance, capturing both local and global patterns effectively while avoiding unnecessary complexity.

5.9. Traffic Occupancy Embedding

To highlight the importance of traffic occupancy data as an input embedding, this study conducts an in-depth analysis of the daily variations in both traffic flow and traffic occupancy. Twenty specific sensors were selected from the PeMS08 dataset, and heatmaps of both traffic flow and traffic occupancy were generated. Figure 8 visually demonstrates that, during peak periods (such as 7–9 a.m. and 5–7 p.m.), both traffic flow and traffic occupancy reach peak values at multiple nodes simultaneously. These significant peak regions indicate strong synchronization between flow and road occupancy during peak hours. Moreover, data from Node 16 shows that, although traffic flow decreases during certain periods, traffic occupancy remains relatively high. This may be due to factors such as reduced speed and increased vehicle density, leading to localized congestion. In such cases, relying solely on traffic flow may not accurately reflect congestion conditions. Traffic occupancy, as a supplementary indicator, can assist the model in identifying abnormal features under varying congestion states. Additionally, variations in traffic flow and occupancy across different nodes may reflect the features of the roads in the region, providing the model with the ability to identify differences between nodes. Therefore, embedding traffic occupancy into the model helps capture the complex spatial–temporal dependencies of traffic flow, thereby improving the model’s prediction accuracy and adaptability.

5.10. Case Study

To gain insights into the model’s spatial reasoning capabilities, We visualized the normalized attention weights of the geographic spatial attention during the morning peak (9:00–10:00) and the dynamic similarity spatial attention during the off-peak period (15:00–16:00). Figure 9 demonstrates that two attention mechanism captures distinct spatial patterns: During the morning peak (9:00–10:00), traffic congestion is typically higher on major arterial roads and key intersections, as people commute to work or school. The Geographic Spatial Attention mechanism captures this phenomenon by emphasizing the nodes (roads or intersections) with immediate neighbors that are highly trafficked, thus modeling the flow patterns in densely populated areas. As seen in Figure 9a, the attention heatmap predominantly highlights these localized clusters of road nodes, reflecting the high traffic density and the strong temporal dependencies at this time. The corresponding geospatial mask (Figure 9b) further refines this attention by suppressing connections that are distant or irrelevant to the immediate traffic flow. Dynamic similarity spatial attention (Figure 9c) reveals interesting long-range dependencies between nodes sharing similar traffic patterns, such as residential areas that experience synchronized morning outbound traffic flows. Similarly, by comparing the dynamic similarity attention heatmap (Figure 9c) with its mask matrix (Figure 9d), we observe that the mask strategically preserves connections between functionally similar nodes regardless of geographical distance, while filtering out irrelevant correlations. This enables the model to identify and leverage synchronized traffic behaviors across the network, capturing complex urban mobility patterns that transcend physical connectivity.

5.11. Forecasting Results and Visualization

After selecting two specific nodes from the PeMS08 dataset, we visualized the prediction results for the test set (horizon = 12), as shown in Figure 10a. The true traffic flow curve exhibits relatively smooth variations with a clear trend. In this scenario, the model performs exceptionally well, almost perfectly capturing the overall trend of traffic flow, especially at peak and trough points. Furthermore, as shown in Figure 10b, despite the larger and more frequent fluctuations in the actual traffic flow, the model successfully identifies various abrupt changes and responds quickly, closely approximating the true values even in high-frequency fluctuation intervals. This stable prediction of extreme fluctuations further highlights the model’s superiority in handling complex and unstable traffic patterns.

6. Conclusions

In this paper, we proposed MMHFormer, a novel multi-source and multi-view hierarchical Transformer model designed for traffic flow prediction, effectively tackling the challenges of complex spatial–temporal dependencies, dynamic variations, and external influences in traffic networks. MMHFormer incorporates a multi-source gated embedding layer to dynamically fuse multidimensional features, including spatial, temporal, and external conditions, enhancing the representation of traffic scenarios. Additionally, it employs a hierarchical multi-view spatial attention module to capture global, local, and similarity-based spatial dependencies, effectively addressing spatial heterogeneity. To further improve adaptability, the model leverages a hierarchical two-stage temporal attention mechanism, which models global temporal patterns while adapting to node-specific variations. Extensive experiments on four benchmark datasets (PeMS03, PeMS04, PeMS07, and PeMS08) demonstrate that MMHFormer consistently outperforms state-of-the-art methods across various evaluation metrics, including MAE, RMSE, and MAPE. The model also exhibits remarkable long-term prediction stability and effectiveness, making it a reliable tool for intelligent transportation systems. Future work will aim to enhance MMHFormer by incorporating a broader range of external contextual features, such as various weather conditions and traffic incidents, and extending its capabilities to support multimodal data inputs.

Author Contributions

Conceptualization, H.W. (Han Wu) and G.T.; methodology, H.W. (Han Wu) and G.T.; software, H.W. (Hao Wu); validation, H.W. (Han Wu), G.T., and Z.Q.; formal analysis, H.W. (Hao Wu); investigation, G.T.; data curation, M.Z.; writing—original draft preparation, H.W. (Han Wu); writing—review and editing, H.W. (Han Wu), G.T., and Z.Q.; visualization, M.Z.; supervision, M.Z.; project administration, G.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, H.; Zhu, C.; Zhang, D.; Li, Q. Multi-Scale Spatial-Temporal Recurrent Networks for Traffic Flow Prediction. arXiv 2023, arXiv:2310.08138. [Google Scholar] [CrossRef]
  2. Kim, K.; Jin, S.; Ko, S.; Choo, J. Stgrat: A spatio-temporal graph attention network for traffic forecasting. In Proceedings of the International Conference on Information and Knowledge Management, Online, 19–23 October 2020. [Google Scholar]
  3. Williams, B.M.; Hoel, L.A. Modeling and Forecasting Vehicular Traffic Flow as a Seasonal ARIMA Process: Theoretical Basis and Empirical Results. J. Transp. Eng. 2003, 129, 664–672. [Google Scholar] [CrossRef]
  4. Lu, Z.; Zhou, C.; Wu, J.; Jiang, H.; Cui, S. Integrating granger causality and vector auto-regression for traffic prediction of large-scale WLANs. KSII Trans. Internet Inf. Syst. (TIIS) 2016, 10, 136–151. [Google Scholar] [CrossRef]
  5. Dhiman, H.S.; Deb, D.; Guerrero, J.M. Hybrid machine intelligent SVR variants for wind forecasting and ramp events. Renew. Sustain. Energy Rev. 2019, 108, 369–379. [Google Scholar] [CrossRef]
  6. Hochreiter, S. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 1998, 06, 107–116. [Google Scholar] [CrossRef]
  7. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
  8. Xu, J.; Song, R.; Wei, H.; Guo, J.; Zhou, Y.; Huang, X. A fast human action recognition network based on spatio-temporal features. Neurocomputing 2021, 441, 350–358. [Google Scholar] [CrossRef]
  9. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
  10. Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction. IEEE Trans. Intell. Transp. Syst. 2020, 21, 3848–3858. [Google Scholar] [CrossRef]
  11. Huang, R.; Li, P. Hub-hub connections matter: Improving edge dropout to relieve over-smoothing in graph neural networks. Knowl. Based Syst. 2023, 270, 110556. [Google Scholar] [CrossRef]
  12. Zheng, G.; Chai, W.K.; Duanmu, J.L.; Katos, V. Hybrid deep learning models for traffic prediction in large-scale road networks. Inf. Fusion 2023, 92, 93–114. [Google Scholar] [CrossRef]
  13. Hamed, M.M.; Al-Masaeid, H.R.; Said, Z.M.B. Short-Term Prediction of Traffic Volume in Urban Arterials. J. Transp. Eng. 1995, 121, 249–254. [Google Scholar] [CrossRef]
  14. Sun, Y.; Leng, B.; Guan, W. A novel wavelet-SVM short-time passenger flow prediction in Beijing subway system. Neurocomputing 2015, 166, 109–121. [Google Scholar] [CrossRef]
  15. Sun, B.; Cheng, W.; Goswami, P.; Bai, G. Flow-aware WPT k-nearest neighbours regression for short-term traffic prediction. In Proceedings of the 2017 IEEE Symposium on Computers and Communications (ISCC), Heraklion, Greece, 3–6 July 2017; pp. 48–53. [Google Scholar] [CrossRef]
  16. Liu, Y.; Zheng, H.; Feng, X.; Chen, Z. Short-term traffic flow prediction with Conv-LSTM. In Proceedings of the 2017 9th International Conference on Wireless Communications and Signal Processing (WCSP), Nanjing, China, 11–13 October 2017; pp. 1–6. [Google Scholar] [CrossRef]
  17. Lv, Z.; Xu, J.; Zheng, K.; Yin, H.; Zhao, P.; Zhou, X. LC-RNN: A deep learning model for traffic speed prediction. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, Stockholm, Sweden, 13–19 July 2018; pp. 3470–3476. [Google Scholar]
  18. Yu, B.; Yin, H.; Zhu, Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, Stockholm, Sweden, 13–19 July 2018; pp. 3634–3640. [Google Scholar] [CrossRef]
  19. Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting. Proc. AAAI Conf. Artif. Intell. 2019, 33, 922–929. [Google Scholar] [CrossRef]
  20. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph WaveNet for Deep Spatial-Temporal Graph Modeling. arXiv 2019, arXiv:1906.00121. [Google Scholar] [CrossRef]
  21. BAI, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 17804–17815. [Google Scholar]
  22. Guo, K.; Hu, Y.; Sun, Y.; Qian, S.; Gao, J.; Yin, B. Hierarchical Graph Convolution Network for Traffic Forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 151–159. [Google Scholar] [CrossRef]
  23. Ye, Y.; Xiao, Y.; Zhou, Y.; Li, S.; Zang, Y.; Zhang, Y. Dynamic multi-graph neural network for traffic flow prediction incorporating traffic accidents. Expert Syst. Appl. 2023, 234, 121101. [Google Scholar] [CrossRef]
  24. Yin, X.; Zhang, W.; Jing, X. Static-dynamic collaborative graph convolutional network with meta-learning for node-level traffic flow prediction. Expert Syst. Appl. 2023, 227, 120333. [Google Scholar] [CrossRef]
  25. Huang, X.; Ye, Y.; Yang, X.; Xiong, L. Multi-view dynamic graph convolution neural network for traffic flow prediction. Expert Syst. Appl. 2023, 222, 119779. [Google Scholar] [CrossRef]
  26. Zhong, W.; Suo, Q.; Jia, X.; Zhang, A.; Su, L. Heterogeneous Spatio-Temporal Graph Convolution Network for Traffic Forecasting with Missing Values. In Proceedings of the 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS), Washington, DC, USA, 7–10 July 2021; pp. 707–717. [Google Scholar] [CrossRef]
  27. Xu, J.; Li, Y.; Lu, W.; Wu, S.; Li, Y. A heterogeneous traffic spatio-temporal graph convolution model for traffic prediction. Phys. A Stat. Mech. Its Appl. 2024, 641, 129746. [Google Scholar] [CrossRef]
  28. Wu, M.; Lin, Y.; Jiang, T.; Weng, W. MHGNet: Multi-Heterogeneous Graph Neural Network for Traffic Prediction. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
  29. Guo, S.; Lin, Y.; Wan, H.; Li, X.; Cong, G. Learning Dynamics and Heterogeneity of Spatial-Temporal Graph Data for Traffic Forecasting. IEEE Trans. Knowl. Data Eng. 2022, 34, 5415–5428. [Google Scholar] [CrossRef]
  30. Jiang, J.; Han, C.; Zhao, W.X.; Wang, J. PDFormer: Propagation Delay-Aware Dynamic Long-Range Transformer for Traffic Flow Prediction. Proc. AAAI Conf. Artif. Intell. 2023, 37, 4365–4373. [Google Scholar] [CrossRef]
  31. Cai, J.; Wang, C.H.; Hu, K. LCDFormer: Long-term correlations dual-graph transformer for traffic forecasting. Expert Syst. Appl. 2024, 249, 123721. [Google Scholar] [CrossRef]
  32. Geng, Z.; Xu, J.; Wu, R.; Zhao, C.; Wang, J.; Li, Y.; Zhang, C. STGAFormer: Spatial–temporal Gated Attention Transformer based Graph Neural Network for traffic flow forecasting. Inf. Fusion 2024, 105, 102228. [Google Scholar] [CrossRef]
  33. Li, Y.; Xu, H.; Zhang, T.; Li, X.; Li, G.; Tian, W. DDGformer: Direction- and distance-aware graph transformer for traffic flow prediction. Knowl. Based Syst. 2024, 302, 112381. [Google Scholar] [CrossRef]
  34. Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. arXiv 2018, arXiv:1707.01926. [Google Scholar] [CrossRef]
  35. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, Virtual, 6–10 July 2020; pp. 753–763. [Google Scholar] [CrossRef]
  36. Zheng, C.; Fan, X.; Wang, C.; Qi, J. GMAN: A Graph Multi-Attention Network for Traffic Prediction. Proc. AAAI Conf. Artif. Intell. 2020, 34, 1234–1241. [Google Scholar] [CrossRef]
  37. Li, M.; Zhu, Z. Spatial-Temporal Fusion Graph Neural Networks for Traffic Flow Forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 4189–4196. [Google Scholar] [CrossRef]
  38. Shao, Z.; Zhang, Z.; Wang, F.; Wei, W.; Xu, Y. Spatial-Temporal Identity: A Simple yet Effective Baseline for Multivariate Time Series Forecasting. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM ’22, Atlanta, GA, USA, 17–21 October 2022; pp. 4454–4458. [Google Scholar] [CrossRef]
  39. Xu, Y.; Han, L.; Zhu, T.; Sun, L.; Du, B.; Lv, W. Generic Dynamic Graph Convolutional Network for traffic flow forecasting. Inf. Fusion 2023, 100, 101946. [Google Scholar] [CrossRef]
Figure 1. The discoveries about traffic data. (a) Periodicity of data, (b) Long-range spatial dependencies between A and D, (c) Dynamic spatial dependencies between A and B, (d) Dynamic spatial dependencies between B and C, (e) The connectivity between nodes, where D and A exhibit similar traffic flow patterns.
Figure 1. The discoveries about traffic data. (a) Periodicity of data, (b) Long-range spatial dependencies between A and D, (c) Dynamic spatial dependencies between A and B, (d) Dynamic spatial dependencies between B and C, (e) The connectivity between nodes, where D and A exhibit similar traffic flow patterns.
Applsci 15 12804 g001
Figure 2. The overall architecture of the MMHFormer, which contains L identical layers.
Figure 2. The overall architecture of the MMHFormer, which contains L identical layers.
Applsci 15 12804 g002
Figure 3. Hierarchical multi-view spatial attention module.
Figure 3. Hierarchical multi-view spatial attention module.
Applsci 15 12804 g003
Figure 4. Cross-Node temporal attention module.
Figure 4. Cross-Node temporal attention module.
Applsci 15 12804 g004
Figure 5. Comparison of single-step prediction on different datasets. (a) MAE on PeMS08, (b) RMSE on PeMS08, (c) MAPE on PeMS08, (d) MAE on PeMS04, (e) RMSE on PeMS04, (f) MAPE on PeMS04.
Figure 5. Comparison of single-step prediction on different datasets. (a) MAE on PeMS08, (b) RMSE on PeMS08, (c) MAPE on PeMS08, (d) MAE on PeMS04, (e) RMSE on PeMS04, (f) MAPE on PeMS04.
Applsci 15 12804 g005
Figure 6. Ablation study of key designs in MMHFormer. (a) MAE on PeMS08, (b) RMSE on PeMS08, (c) MAPE on PeMS08, (d) Representations of connections between colors and variants.
Figure 6. Ablation study of key designs in MMHFormer. (a) MAE on PeMS08, (b) RMSE on PeMS08, (c) MAPE on PeMS08, (d) Representations of connections between colors and variants.
Applsci 15 12804 g006
Figure 7. Experimental results of the hyperparameter study on the PeMS08 dataset. (a) Effect of geographic distance threshold λ on model performance; (b) Effect of number of nearest neighbors K on model performance.
Figure 7. Experimental results of the hyperparameter study on the PeMS08 dataset. (a) Effect of geographic distance threshold λ on model performance; (b) Effect of number of nearest neighbors K on model performance.
Applsci 15 12804 g007
Figure 8. Heat maps of traffic flow and traffic occupancy of MMHFormer on PeMS08 dataset. (a) Daily traffic flow; (b) Daily traffic occupancy.
Figure 8. Heat maps of traffic flow and traffic occupancy of MMHFormer on PeMS08 dataset. (a) Daily traffic flow; (b) Daily traffic occupancy.
Applsci 15 12804 g008
Figure 9. Visualization of multi-view spatial attention weights on PeMS08 dataset. Darker colors in the heat maps denote higher attention weights, indicating stronger spatial dependencies. The mask’s black regions represent not selected areas. (a) Heat maps of geographic spatial attention during the morning peak, (b) Visualization of the geospatial mask matrix, (c) Heat maps of dynamic similarity spatial attention during the off-peak period, (d) Visualization of the dynamic similarity mask matrix.
Figure 9. Visualization of multi-view spatial attention weights on PeMS08 dataset. Darker colors in the heat maps denote higher attention weights, indicating stronger spatial dependencies. The mask’s black regions represent not selected areas. (a) Heat maps of geographic spatial attention during the morning peak, (b) Visualization of the geospatial mask matrix, (c) Heat maps of dynamic similarity spatial attention during the off-peak period, (d) Visualization of the dynamic similarity mask matrix.
Applsci 15 12804 g009aApplsci 15 12804 g009b
Figure 10. Visualization of prediction results of MMHFormer on PeMS08 dataset. (a) Predicted traffic flow with smooth trends; (b) Predicted traffic flow with abrupt changes and fluctuations.
Figure 10. Visualization of prediction results of MMHFormer on PeMS08 dataset. (a) Predicted traffic flow with smooth trends; (b) Predicted traffic flow with abrupt changes and fluctuations.
Applsci 15 12804 g010
Table 1. The detailed information of datasets.
Table 1. The detailed information of datasets.
DatasetSensorsEdgesTime RangeTime Steps
PeMS03358547September 2018–November 201826,208
PeMS04307340January 2018–February 201816,992
PeMS07883866May 2017–August 201728,224
PeMS08170295July 2016–August 201617,856
Table 2. Comparison Table of Model Performance. Bold: best; underline: second best.
Table 2. Comparison Table of Model Performance. Bold: best; underline: second best.
ModelPeMS03PeMS04PeMS07PeMS08
MAERMSEMAPE (%)MAERMSEMAPE (%)MAERMSEMAPE (%)MAERMSEMAPE (%)
SVR [5]27.4026.4644.5128.6644.5919.1532.9750.1515.4323.2536.1514.71
VAR [4]23.6538.2624.5124.5438.6117.2450.2275.6332.2219.1929.8113.10
DCRNN [34]17.9930.3118.3421.2233.4414.1725.2238.6111.8216.8226.3610.92
GraphWaveNet [20]19.1232.7718.8939.6631.7217.2926.3941.5011.9718.2830.0512.15
AGCRN [21]15.9828.2515.2319.8332.2612.9722.3736.559.1215.9525.2210.09
STGCN [18]17.5530.4217.3421.1634.8913.8325.3339.3411.2117.5027.0911.29
MTGNN [35]15.8526.2315.5519.0831.5612.9620.8234.099.0315.4024.9310.17
ASTGCN [19]17.3429.5617.2122.9335.2216.5624.0137.8710.7318.2528.0611.64
STFGNN [37]16.7728.3416.3019.8331.8813.0222.0735.809.2116.6426.2210.60
GDGCN [39]14.6624.3013.9418.4429.7912.5220.1533.218.5014.8223.879.35
GMAN [36]16.8727.9218.2319.1431.6013.1920.9634.109.0515.3124.9210.13
STID [38]15.3327.4016.4018.2929.8212.4919.5432.828.2514.2023.499.28
PDFormer [30]14.9425.3915.8218.3229.9712.1019.8332.878.5313.5823.519.05
DDGFormer [33]15.0124.9515.8918.0430.0611.7618.9932.257.9313.3723.158.83
STGAFormer [32]14.5624.9414.6918.1829.7811.9819.6532.628.4513.0622.438.87
MMHFormer (ours)14.3324.8014.6817.9929.5311.6418.7432.107.7912.9422.308.77
Table 3. Performance on long-range traffic flow forecasting. Bold: best; underline: second best.
Table 3. Performance on long-range traffic flow forecasting. Bold: best; underline: second best.
DatasetMethod@Horizon 36@Horizon 48@Horizon 60Average
MAERMSEMAPEMAERMSEMAPEMAERMSEMAPEMAERMSEMAPE
PEMS04DCRNN [34]23.6737.6216.03%24.4738.8616.81%25.3939.9218.10%22.7936.1915.70%
STGCN [18]26.8541.7818.32%27.2742.4018.74%28.9044.5419.65%26.6641.5818.30%
GraphWaveNet [20]24.2539.1317.94%24.3239.2918.01%24.9839.9718.48%24.2639.2017.78%
MTGNN [35]23.1837.1215.63%23.7837.8016.17%25.1139.1617.65%22.3735.8815.25%
AGCRN [21]23.0336.7615.72%23.4737.6516.38%24.9339.3817.87%22.2935.5815.72%
GMAN [36]22.4536.9116.07%22.7239.1316.33%23.2040.4817.00%22.2635.6615.93%
PDFormer [30]22.1835.5715.26%22.4536.4015.33%27.6437.7315.91%21.3434.5514.60%
MMHFormer (ours)21.0135.3814.73%21.4935.1914.95%22.2836.1115.53%20.4333.3414.43%
PEMS08DCRNN [34]20.4831.7313.73%21.3833.0314.56%23.1235.2216.35%20.0230.9313.58%
STGCN [18]27.2541.2817.02%27.6341.9717.27%28.8043.5518.24%26.8340.8616.77%
GraphWaveNet [20]21.2735.0613.75%21.5435.7613.96%22.0136.3814.95%21.0034.6313.63%
MTGNN [35]19.5731.5513.14%20.4532.7414.19%22.1034.5715.95%18.8630.3612.52%
AGCRN [21]20.0631.9214.39%20.4932.7114.56%21.5834.0414.80%19.1030.5612.99%
GMAN [36]17.6930.6913.75%18.1331.2014.06%18.9532.7314.52%17.4830.1413.56%
PDFormer [30]17.6128.9812.22%18.2429.8712.80%19.5231.4014.12%16.9328.0511.83%
MMHFormer (ours)16.8928.7211.54%17.2929.7311.96%17.9831.1812.78%16.2627.4911.11%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, H.; Teng, G.; Wu, H.; Qiu, Z.; Zhao, M. MMHFormer: Multi-Source and Multi-View Hierarchical Transformer for Traffic Flow Prediction. Appl. Sci. 2025, 15, 12804. https://doi.org/10.3390/app152312804

AMA Style

Wu H, Teng G, Wu H, Qiu Z, Zhao M. MMHFormer: Multi-Source and Multi-View Hierarchical Transformer for Traffic Flow Prediction. Applied Sciences. 2025; 15(23):12804. https://doi.org/10.3390/app152312804

Chicago/Turabian Style

Wu, Han, Guoqing Teng, Hao Wu, Zicheng Qiu, and Meng Zhao. 2025. "MMHFormer: Multi-Source and Multi-View Hierarchical Transformer for Traffic Flow Prediction" Applied Sciences 15, no. 23: 12804. https://doi.org/10.3390/app152312804

APA Style

Wu, H., Teng, G., Wu, H., Qiu, Z., & Zhao, M. (2025). MMHFormer: Multi-Source and Multi-View Hierarchical Transformer for Traffic Flow Prediction. Applied Sciences, 15(23), 12804. https://doi.org/10.3390/app152312804

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop