1. Introduction
Traffic flow prediction has long been recognized as a fundamental capability supporting intelligent transportation systems (ITS) and efficient urban operation [
1]. However, its practical importance is not merely conceptual; it is quantitatively significant. According to global urban mobility reports, congestion costs in major metropolitan areas account for billions of dollars annually in lost productivity, excess fuel consumption, and environmental externalities. In many large cities, commuters lose tens to over one hundred hours per year due to congestion, corresponding to measurable economic losses at both individual and societal levels [
2]. Even marginal improvements in short-term traffic prediction accuracy have been shown to translate into measurable reductions in travel delay, improved signal coordination efficiency, and enhanced throughput at bottleneck intersections. Therefore, improving prediction reliability is directly linked to operational efficiency and economic performance.
With the rapid growth of private vehicle ownership—particularly in developing economies—urban road networks are operating increasingly close to or beyond capacity. Congestion frequently emerges at arterial corridors and critical junctions, where small disturbances can propagate nonlinearly through the network. A schematic diagram of urban traffic flow is shown in
Figure 1. In such high-sensitivity systems, accurate short-term prediction plays a pivotal role in adaptive signal control, ramp metering, congestion pricing, and dynamic route guidance. Prediction errors during peak periods may lead to suboptimal control strategies, compounding queue spillbacks and increasing accident risks.
Methodologically, traffic flow prediction research has evolved from classical time-series models such as ARIMA [
3], which model traffic as a seasonal stationary process and provide interpretable parameter estimates but fail under abrupt non-stationary conditions, to deep learning approaches including RNN and LSTM architectures [
4], which demonstrated that recurrent networks can learn long-term temporal dependencies in speed data from remote microwave sensors and substantially outperform statistical baselines. While these methods achieve satisfactory performance under relatively stable conditions, their robustness and generalization remain challenged by the intrinsic characteristics of urban traffic: randomness, strong nonlinearity, abrupt regime shifts, and complex spatiotemporal coupling, as shown by the diffusion convolutional recurrent framework that highlighted the inadequacy of node-independent temporal models for spatially coupled networks. Moreover, Polson [
5] specifically demonstrated that deep learning models optimized purely for accuracy frequently suffer from prohibitive inference latency when scaled to city-wide sensor deployments, raising critical concerns about practical applicability.
Traffic flow prediction also interacts closely with related tasks such as travel time estimation, speed forecasting, density inference, queue dissipation analysis, and travel behavior modeling [
6]. As sensing infrastructure expands and data volumes grow, research trends increasingly emphasize multi-source data integration, cross-regional generalization, and joint optimization of accuracy and efficiency [
7,
8]. By grounding methodological innovation in measurable operational impact—such as delay reduction, throughput improvement, and congestion cost mitigation—the field can more clearly align theoretical advances with tangible societal benefits.
In this context, a review of traffic flow prediction methods has important theoretical and practical value [
9]. On the one hand, with the increasing complexity of transportation systems and the rapid expansion of data dimensions [
10], existing research has shown diversified development in terms of model structure, data utilization strategies, spatiotemporal feature modeling, and multi-task prediction. It is urgent to conduct a systematic review to clarify the research context, summarize the evolution of methods, and identify the advantages and limitations of various models. On the other hand, ITS is moving from local pilot projects to large-scale deployments, placing higher demands on the real-time performance, scalability, and adaptability of predictive models. This makes a comprehensive evaluation of the performance and applicability of existing methods crucial. By summarizing research hotspots and pointing out current challenges, we can not only provide a clear direction for subsequent research but also provide a basis for traffic management departments to rationally select predictive technologies in practical applications.
To better understand the methodological development reviewed in this paper, it is necessary to distinguish the three developmental stages of traffic flow prediction research. The first generation (before 2020) mainly relied on statistical methods and shallow machine learning methods (such as ARIMA and SVM), which were difficult to handle nonlinear spatiotemporal dependencies. The second generation (2020–2022) witnessed the rise of static graph neural network architectures (such as GCN and GAT) and early LSTM-based hybrid models, in which the graph structure was mostly predefined, and the attention mechanism was introduced as an auxiliary module for the first time. The third generation (2023–2025) is the focus of this review, characterized by three key technological changes compared to previous work: (1) a shift from static graph topology to dynamically learned graph topology; (2) the adoption of federated learning as a privacy framework that can be used in production environments, rather than a conceptual proposal; and (3) the systematic integration of Transformer-based self-attention mechanisms with graph neural network encoders to achieve large-scale parallel spatiotemporal modeling.
As shown in
Table 1. Liu et al. [
11] focused on traffic flow prediction technology in intelligent transportation systems, categorizing methods into three main types: statistical, machine learning, and deep learning. They analyzed the core principles, application scenarios, and advantages and disadvantages of each method, highlighting the irreplaceable advantages of deep learning in handling complex nonlinear relationships, particularly the superior prediction accuracy and generalization ability of hybrid neural networks compared to traditional methods. They also noted the challenges in model generalization across different scenarios and long-term prediction. Attioui et al. [
12], following the PRISMA 2020 guidelines, systematically reviewed the application of machine learning in traffic congestion prediction from 2010 to 2024. They selected 115 high-quality studies from 9695 records, emphasizing the dominant role of deep learning and supervised learning. They analyzed the distribution of research by road type, vehicle type, and prediction cycle, while also pointing out the insufficient application of reinforcement learning and the lack of research on rural roads, providing a current status reference for research in this field. Kong et al. [
13] focused on time series forecasting, covering multiple application areas such as traffic flow. They divided deep learning model architectures into five paradigms, summarized feature extraction methods such as dimensionality decomposition and time-frequency transformation, compiled relevant datasets, and deeply analyzed data privacy and model interpretability issues. They also provided an outlook on future directions such as representation learning and causal inference, offering a systematic framework for cross-domain time series forecasting research. Annarita et al. [
14] comprehensively reviewed the development of traffic flow forecasting technology, categorizing methods into four types: naive techniques, parametric methods, traffic simulation techniques, and nonparametric models. They analyzed the theoretical foundations and practical applications, emphasizing the role of artificial intelligence in dynamic and accurate forecasting, and pointing out that spatiotemporal modeling and real-time data fusion are the main future development trends, providing a panoramic reference for researchers and policymakers. Shahriar et al. [
15] focused on the application of deep learning algorithms and classic models in traffic forecasting. They introduced the principles of deep learning models such as LSTM and CNN, as well as classic models such as Kalman filtering and ARIMA, and compared their performance in traffic flow, speed, and congestion forecasting, providing guidance for model selection under different needs. Bernardo et al. [
16] analyzed traffic flow prediction and classification research in Europe over the past five years, elucidating the application of historical and real-time data, outlining data preprocessing techniques, comparing the effectiveness of methods such as deep learning, parametric models, and genetic programming, as well as clustering and classification methods, and clarifying the applicable scenarios for various performance evaluation indicators, filling a gap in regional research. Aristeidis et al. [
17] focused on traffic congestion prediction, comprehensively covering statistical, machine learning, deep learning, and ensemble methods. They clarified the key points for selecting short-term, medium-term, and long-term prediction models, listed key input parameters such as weather, season, and road information, and proposed a standard process for data collection, preprocessing, and model selection. They also analyzed the limitations of various methods and future directions for improvement.
Existing research on traffic flow prediction has made significant progress. Systematic cyclic prediction methods are categorized into three main types: statistical methods, machine learning, and deep learning. These studies have analyzed the core principles, application scenarios, and performance of each method in depth. They have fully demonstrated the irreplaceable advantages of deep learning in handling complex relationships, particularly the superior prediction accuracy and generalization capabilities of hybrid neural networks (such as LSTM and CNN) compared to traditional methods, which have been validated in multiple aspects including traffic flow, speed, and congestion prediction. Furthermore, industry experts have summarized model architecture paradigms, feature extraction methods (such as dimensionality reduction and various time-frequency transformations), and data reconstruction techniques, providing systematic guidance for model selection for different prediction cycles and application scenarios, and clarifying spatiotemporal modeling and real-time data fusion as the main future development trends.
However, these studies also have some limitations: (1) The methodological classification is not systematic enough. Most reviews classify the technology types (statistics, machine learning, deep learning) and lack a deep classification framework based on core innovation mechanisms; (2) There is insufficient attention to cutting-edge GNN methods. Although some reviews point out graph neural networks, there is no systematic review of their evolution path and technical paradigm in traffic prediction; (3) Signaling methods lack systematic summary. There is a lack of comprehensive induction and comparative analysis of the combination of hierarchical, heuristic optimization and deep learning frameworks. In response to these shortcomings, this review makes up for the deficiencies of existing reviews in terms of methodological system, cutting-edge technology tracking, hybrid method induction and practical guidance through a systematic classification framework, comprehensive technology coverage, detailed comparative analysis and practice-oriented insights. It provides a more complete and practical reference frame for traffic flow research, prediction and practical application. This review makes the following contributions:
- (1)
A two-dimensional classification framework. Existing reviews typically organize methods by technology type (statistics, machine learning, deep learning), without distinguishing core innovation mechanisms. This review categorizes GNN-based methods into four mechanism-driven paradigms (federated learning and privacy protection, dynamic adaptive graph structures, multi-graph fusion and attention mechanisms, and cross-domain integration) and summarizes hybrid deep learning into three implementation paths, providing a more structured methodological overview.
- (2)
Systematic coverage of recent GNN advances (2023–2025). While some existing reviews mention graph neural networks, few systematically examine their development within traffic prediction. This review traces the progression from static to dynamic graphs, from single-graph to multi-graph architectures, and from centralized to federated models, covering literature from 2023 to 2025 that is largely absent from prior surveys.
- (3)
Method selection guidance and future directions. This review provides method selection suggestions for different application scenarios (e.g., real-time prediction, privacy protection, cross-domain generalization) and discusses several emerging directions, including interpretable AI, edge computing, multimodal fusion, and reinforcement learning, offering practical reference for subsequent research.
2. Materials and Methods
This chapter introduces the theoretical foundation and literature selection methods supporting the full-text analysis. Regarding literature sources, the study conducted Boolean searches across five major academic databases: Web of Science, Scopus, IEEE Xplore, Google Scholar, and ACM Digital Library, ultimately including over 100 high-quality original articles. In terms of the theoretical framework, this chapter sequentially elucidates the message passing mechanism of Graph Neural Networks and the layer-by-layer propagation rules of GCNs; the gating structures and long-term dependency modeling capabilities of RNNs and LSTMs in deep learning systems; the neighborhood search paradigm and multi-objective optimization strategies of heuristic optimization algorithms; and the core advantages of attention mechanisms (especially the self-attention architecture of Transformers) in capturing the spatiotemporal dependencies of traffic flow. These theoretical modules are interconnected, collectively forming the technical support system for the methodological analysis in the subsequent three chapters, laying a solid conceptual foundation for readers to understand various hybrid prediction models.
2.1. Theoretical Foundations of Traffic Flow and Congestion Modeling
Beyond the methodological framework, understanding traffic flow prediction also requires a solid theoretical foundation in traffic flow dynamics. The core dynamic characteristic of traffic flow is phase transition behavior: near the critical density, the traffic system can abruptly transition from a free-flowing state to a congested state, exhibiting strong nonlinearity and instability. The congestion boundary method provides a rigorous analytical framework for characterizing these phase transition boundaries by explicitly defining the conditions under which traffic transitions from free flow to congestion [
18]. By linking the macroscopic flow-density relationship with the congestion propagation mechanism, this approach provides a physically interpretable framework that data-driven prediction models can leverage to constrain learned representations and ensure physically consistent outputs.
At the microscopic level, car-following models are crucial tools for analyzing congestion formation and propagation. Recent work on heterogeneous ring road car-following models incorporating visual angle defects and speed limit effects has further revealed that the incompleteness of driver perception in real-world traffic systems significantly affects vehicle fleet stability, accelerating congestion formation and spread [
19]. This finding provides direct theoretical guidance for introducing heterogeneity in driving behavior into prediction models, particularly for approaches that model individual vehicle interactions within graph-structured road networks.
At the macroscopic policy level, research on the synergistic optimization of road pricing and capacity expansion demonstrates that traffic congestion mitigation is not only a matter of prediction accuracy but also involves supply–demand balance and systematic allocation of policy instruments [
20]. Specifically, analyses of road price and capacity policies subject to fiscal constraints in urban settings reveal that the marginal social cost of congestion depends critically on network topology and demand elasticity—considerations that embedding predictive models into policy evaluation frameworks must account for in order to achieve a functional leap from “prediction” to “decision support.” This connection is directly relevant to the cross-domain technology integration paradigm reviewed in
Section 3.4.
Furthermore, empirical studies on congestion wave propagation in mixed-traffic environments [
21] and theoretical analyses of traffic breakdown probability at bottleneck locations [
22] collectively demonstrate that the spatiotemporal patterns of congestion are governed by deterministic physical laws as well as stochastic fluctuations, underscoring the necessity of hybrid physics-informed and data-driven modeling strategies as reviewed in
Section 3.4.
2.2. Literature Search and Selection
To ensure the reproducibility of the research results and the rigor of the methods, this review followed a structured literature search and selection process. We used Boolean search queries to systematically search five academic databases (Web of Science, Scopus, IEEE Xplore, Google Scholar, and ACM Digital Library), using keywords including “traffic flow prediction,” “graph neural network,” “GNN,” “GCN,” “deep learning,” “LSTM,” “federated learning,” “attention mechanism,” “spatiotemporal,” “dynamic graph,” “VMD,” “CEEMDAN,” and “heuristic optimization.” In addition, we manually screened the reference lists of key review papers. The search scope covered literature published between January 2020 and January 2026, with a particular focus on research published between 2023 and 2025. We also selectively included groundbreaking literature published before 2020, provided that these publications established fundamental methodologies directly relevant to the paradigm of this review. Inclusion criteria included studies that: proposed or evaluated methods for predicting traffic flow, speed, or density based on graph neural networks (GNNs) or hybrid deep learning frameworks; provided quantitative performance evaluations based on benchmark datasets or real-world datasets; and were published as peer-reviewed articles or conference papers in English. Studies focusing solely on non-traffic tasks, lacking methodological innovation, or whose full text was unavailable were excluded. The selection process consisted of three phases: title/abstract screening, full text review, and subject classification. Disagreements were resolved through discussion among co-authors. Ultimately, this review was based on 101 original studies.
2.3. Introduction to Graph Neural Networks
Graph Neural Networks (GNNs) are a specialized class of deep learning models designed to process data residing on non-Euclidean structured graphs [
23]. Unlike traditional neural networks optimized for regular grid data, GNNs leverage the inherent structure of a graph
to learn expressive representations for its nodes, edges, or the entire graph [
24]. The fundamental principle of a GNN lies in the iterative application of a message passing paradigm, where the representation of each node is updated by aggregating information from its local neighborhood [
25]. This mechanism allows the model to capture the complex spatial dependencies and relational context within the topological structure. Among various GNN architectures, the Graph Convolutional Network (GCN) stands out as the canonical and most influential model, successfully adapting the concept of convolution to the graph domain [
26].
Figure 2 illustrates the overall architecture of a Graph Convolutional Network, showing the propagation of node features through multiple GCN layers with ReLU activation functions. The core operation of a GCN layer is the transformation of node features from layer
l to layer
[
27]. This transformation effectively implements a local spectral filter that computes a feature representation for each node based on the aggregated features of its immediate neighbors. This process is defined by the layer-wise propagation rule. Given the node feature matrix
at layer
l, the GCN propagation rule for obtaining the features
at layer
is typically formulated as follows:
In Equation (
1),
represents the matrix of node features, where
is the number of nodes and
is the feature dimension at layer
l. The matrix
is the layer-specific trainable weight matrix. The term
denotes a non-linear activation function, such as ReLU. The critical component of the GCN is the normalized adjacency matrix
, where
A is the adjacency matrix of the graph,
(incorporating a self-loop
I), and
is the degree matrix of
. This normalization ensures that the aggregated information from neighbors is appropriately scaled, mitigating potential feature distortion and stabilizing the learning process. By iteratively stacking multiple GCN layers, the model can efficiently learn multi-hop spatial dependencies, enabling deep feature extraction within the graph structure.
2.4. Introduction to Deep Learning Algorithms
Deep learning is a class of machine learning methods based on multi-layered neural networks for feature learning and decision modeling [
28], where Wang et al. demonstrated its effectiveness for domain adaptation via latent transferability of feature components. This paradigm is designed to simulate the hierarchical information-processing mechanisms of the human brain: Wang et al. [
29] showed that teacher–student adversarial augmentation strategies substantially improve generalization under domain shift in medical image segmentation; Zeng et al. [
30] demonstrated that external graph neural network potentials integrated within the DeePMD-kit framework enable accurate molecular dynamics simulations with deep learning; and Yang et al. [
31] established that spatiotemporal graph neural networks with double-explored architectures achieve superior multi-site intra-hour photovoltaic power forecasting, illustrating the broad cross-domain applicability of hierarchical feature extraction. Its defining characteristic is network depth: unlike traditional shallow models with one or two layers, deep neural networks may contain hundreds of layers, enabling substantially stronger feature extraction and pattern recognition across tasks such as image recognition, speech recognition, and NLP [
32].
Within this family, Recurrent Neural Networks (RNNs) are specifically designed for sequential and temporal data [
33]. By introducing feedback connections, RNNs allow the current output to depend on historical inputs, enabling temporal context-based reasoning for applications including traffic flow and stock market prediction. Training is performed via Backpropagation Through Time (BPTT) [
34], which computes gradients across all time steps, supporting flexible input–output structures (one-to-many, many-to-one, many-to-many). However, unidirectional RNNs are inherently limited to past information and cannot leverage future context, restricting their performance on tasks with bidirectional dependencies [
35].
Long Short-Term Memory networks are a variant of RNNs specifically designed to address the vanishing and exploding gradient problems that commonly occur in ordinary RNNs during long sequence training [
36]. LSTMs achieve controlled storage and selective forgetting of historical information by designing additional “unit states” in hidden layers and weighted “gating mechanisms.” Its structure is shown in
Figure 3, where
represents the cell state at the previous time step, and
and
are the current input and the hidden state at the previous time step, respectively. The network calculates
,
, and
through forget gates, input gates, and output gates, respectively, to dynamically determine the information to retain, update, and output. The weights of each gate are controlled by the sigmoid function
, with the output ranging from 0 to 1, used to limit the proportion of information passed through.
Notably, LSTM specifically uses a “forget mechanism” to prevent irrelevant historical information from interfering with current decisions. The forget gate merges
and
to generate an element-wise mask vector, determining which information is retained or discarded: values close to 0 indicate content to be discarded, while values close to 1 indicate content to be retained. This mechanism ensures that the model can effectively handle dynamically changing sequences over the long term, thereby further enhancing its ability to handle long-dependency tasks.
In Equation (
2), the parameters
and
represent the weight matrix and bias vector of the forget gate, respectively, while
denotes the sigmoid activation function. The input gate regulates the extent to which current input information is incorporated into the cell state at the present time step, thereby managing which information requires updating. The input vector
and previous hidden state
are processed through the input gate, and subsequently combined with values transformed by a tanh activation function to produce updated control parameters. The mathematical representation of the input gate mechanism is given by:
In Equation (
5), the cell state undergoes an update to become
, where
corresponds to the weight matrix of the input gate. The output gate controls which portion of the current cell state information is propagated to the hidden state
. Both
and
initially traverse the output gate to delineate the scope of information to be output. Subsequently, through integration with the tanh activation function, a selected subset of memory information from
is processed, ultimately determining the final hidden state output
. The mathematical expression characterizing the output gate is formulated as follows:
In Equations (
6) and (
7), the parameter
designates the weight matrix associated with the output gate.
2.5. Introduction to Heuristic Optimization Algorithms
Metaheuristic algorithms are a class of general optimization strategies [
37]. Their core characteristic is independence from specific problem structure information, making them applicable to a wide range of scenarios, including combinatorial optimization and numerical function solving. These algorithms are typically based on empirical heuristics and random search, finding approximate optimal solutions through iterative methods.
Modern heuristic algorithms are proposed relatively independently, but their ideas often originate from natural or physical phenomena. For example, Simulated Annealing (SA) borrows the principle of thermodynamic annealing and uses a Monte Carlo mechanism for global probabilistic search; Genetic Algorithm (GA) [
38] mimics the inheritance and selection processes in biological evolution to achieve parallel global optimization; Evolutionary Programming (EP) [
39] emphasize differences in population behavior rather than genetic details; Tabu Search (TS) [
40] relies on an iterative mechanism with memory to achieve stepwise optimization; and Ant Colony Algorithm (ACA) [
41] performs random search based on the cooperative behavior of real ant colonies. Furthermore, different optimization models can use fixed-length or variable-length representations for variable encoding, offering a degree of flexibility.
Although these algorithms differ in their search principles, their processes generally follow a “neighborhood search” model: starting from one or a set of initial solutions, generating candidate solutions in the neighborhood under the control of key parameters, updating the state according to deterministic or probabilistic criteria, and adjusting the search parameters according to the strategy. This cycle continues until the convergence condition is met, thus obtaining the optimal or near-optimal solution.
In mathematical optimization, the core of a problem lies in defining an objective function , where denotes a set of decision variables of dimension n. The objective function may present itself in a tractable explicit form—such as the loss functions commonly used in regression or neural networks—or in an implicit form that cannot be explicitly described by a closed mathematical expression. For tasks involving implicit or highly complex objectives, conventional analytical methods often become infeasible, and heuristic optimization algorithms provide an effective alternative.
The optimization goal is typically to obtain a solution that maximizes or minimizes the objective value:
When multiple objectives
, are involved and each falls within a bounded interval, such as
, the optimization process becomes more intricate. A common strategy is to transform multi-objective optimization into a single-objective problem through normalization or aggregation. For example:
which seeks balanced outcomes by minimizing the disparity among objectives. This form of transformation enables more tractable computation while preserving the essential structure of the multi-objective problem.
In most real-world applications, the variable set x is continuous rather than discrete, making certain traditional search strategies less suitable. In contrast, heuristic algorithms are flexible and can be directly applied or extended to continuous domains, offering clear advantages for complex optimization tasks.
2.6. Introduction to Attention Mechanisms
In traffic flow prediction, the attention mechanism is widely used to address the information redundancy problem caused by large amounts of input information and multiple feature dimensions [
42]. With the continuous growth of traffic network data, traditional recurrent neural networks or long short-term memory networks are prone to gradient vanishing or information decay when processing long-term sequences or multi-node dependencies, affecting prediction accuracy [
42]. By introducing the attention mechanism, the model can focus on the traffic nodes or time periods most relevant to the prediction target from numerous input features, reducing attention to irrelevant information and thus improving prediction efficiency and accuracy [
42].
Since its introduction, the Transformer model has demonstrated superior performance in traffic flow prediction due to its unique self-attention mechanism and efficient parallel computing capabilities. Unlike traditional recurrent structures, the Transformer can effectively capture long-term sequence dependencies, simultaneously handle multi-node features of the entire traffic network, avoid the gradient vanishing problem, and model complex spatiotemporal relationships.
In the Transformer structure, the input features of each time step or traffic node can dynamically adjust their weights based on information from other nodes in the entire sequence or network, thus more accurately reflecting the spatiotemporal dependencies of traffic flow.
Figure 4 illustrates the core structure of the Transformer model used for traffic flow prediction: Input features first pass through embedding and positional encoding to integrate temporal and spatial information. The encoder stack processes the input through multiple layers of the same structure, each layer including a multi-head self-attention mechanism (to capture spatiotemporal dependencies), addition and normalization operations (residual connections and layer normalization), and a feedforward network. For multi-step traffic flow prediction tasks, the decoder stack uses previous predictions for autoregressive generation, with each layer containing masked multi-head attention (restricting attention to only historical information), multi-head attention to the encoder output (aligning historical and target features), and a feedforward network with additive normalization operations. Finally, the decoder output generates predicted values through linear projection and either a softmax or regression layer.
The core advantages of this structure are: through the attention mechanism, the model can flexibly capture the long-term temporal and spatial dependencies of traffic flow; through parallel computing and residual connections, the model training is stable and efficient; through the encoder-decoder structure, Transformer can simultaneously handle multi-node, multi-step prediction tasks, thereby achieving high-precision traffic flow prediction in complex traffic networks.
2.7. Benchmark Datasets for Traffic Flow Prediction
The selection of benchmark datasets is fundamental to evaluating and comparing traffic flow prediction methods. Existing datasets vary considerably in terms of data source, spatial coverage, temporal resolution, feature richness, and scale. This subsection provides a systematic overview of the most widely adopted datasets, categorizes them by collection modality, and discusses their respective strengths and limitations to guide methodological selection.
2.7.1. Fixed-Sensor Freeway Datasets
The most extensively used datasets in the traffic prediction literature are derived from the California Performance Measurement System (PeMS), which aggregates real-time data from inductive loop detectors embedded in state freeway pavement. The PeMS family—including PeMS03, PeMS04, PeMS07, and PeMS08—records flow (vehicles per unit time), occupancy (proportion of time a sensor is occupied), and speed at 5-min intervals, covering hundreds of sensors over periods of one to several months. Due to their standardized preprocessing pipelines, clear graph topologies, and multi-attribute nature, PeMS datasets have become the de facto standard for benchmarking spatiotemporal GNN models.
Among derivative datasets, METR-LA and PEMS-BAY—curated from the original PeMS data and widely adopted following the DCRNN study—provide speed-only measurements at 207 and 325 sensors, respectively. Although these datasets contain only a single feature (speed), their well-established train/validation/test splits have made them indispensable for reproducible comparison. The Loop Seattle dataset extends this category to the Pacific Northwest, enabling limited cross-regional evaluation.
2.7.2. Urban Mobility and Trajectory Datasets
A second category comprises datasets derived from GPS-equipped floating vehicles, predominantly taxis and ride-hailing fleets. TaxiBJ aggregates taxi trajectory data in Beijing into a grid-based inflow/outflow representation, supporting crowd flow prediction tasks. SZ-Taxi covers 156 road segments in Shenzhen with 15-min resolution, offering a graph-compatible urban speed dataset. NYC-Taxi and NYC-Bike, derived from official New York City open data portals, provide zone- or grid-level demand records spanning multiple years, making them suitable for long-horizon and demand forecasting studies.
The TDrive dataset, collected from 10,357 Beijing taxis over one week, is primarily used for route inference and speed estimation rather than direct flow prediction, but it supports research on data-sparse scenarios and trajectory-based spatiotemporal modeling.
2.7.3. Large-Scale and Multi-City Datasets
As models scale toward city-wide deployment, large-scale benchmark datasets have emerged to address the limitations of small, single-city evaluations. LargeST, constructed from over 8600 PeMS sensors spanning five years (2017–2021), is designed specifically to assess the scalability of spatiotemporal models and reveals that many methods performing well on PeMS04/PeMS08 degrade substantially at larger scales. UTD19, aggregating over 23,000 sensor time series across 39 global cities, enables multi-city and cross-domain generalization experiments, directly addressing the transferability concerns raised in federated and meta-learning studies reviewed in
Section 3.1.
The Next Generation Simulation (NGSIM) dataset occupies a distinct niche, providing sub-second vehicle trajectory data from video-based tracking on California freeway segments. While its microscopic resolution makes it unsuitable for network-level flow prediction, NGSIM serves as the primary reference for car-following model validation and congestion formation analysis at the vehicle-interaction level.
2.7.4. Comparative Summary and Dataset Selection Guidance
Table 2 provides a structured comparison of the datasets reviewed above across nine dimensions: region, data source type, number of sensors or nodes, temporal coverage, sampling interval, available features, primary prediction task, and data accessibility. Several observations are noteworthy.
Geographic representativeness. The majority of widely used datasets originate from the United States (primarily California), which raises concerns about geographic representativeness. Studies relying exclusively on PeMS-derived data may overfit evaluation to the specific traffic regime, road network topology, and sensor density of California freeways, limiting the generalizability of reported conclusions to other regions or road types.
Feature dimensionality. Most datasets provide only flow, speed, or occupancy as primary features, with few datasets natively incorporating weather conditions, incident records, or land-use attributes. This structural limitation has motivated the use of data augmentation and multimodal fusion strategies reviewed in
Section 3.3, but also means that reported model performance often reflects optimistic upper bounds achievable only under complete sensor conditions.
Temporal coverage inconsistency. Temporal coverage varies from a few weeks (NGSIM, SZ-Taxi) to multiple years (NYC-Taxi, LargeST), creating significant inconsistency in the evaluation of seasonal patterns and long-term model stability. Future benchmark construction should standardize temporal coverage to include at least one full annual cycle to enable rigorous assessment of periodic and seasonal effects.
In practice, researchers should select datasets according to their methodological focus: PeMS04 and PeMS08 for standard spatiotemporal GNN benchmarking; METR-LA and PEMS-BAY for reproducible comparison with canonical baselines; LargeST and UTD19 for scalability and generalization experiments; and TaxiBJ or NYC-Taxi for urban demand forecasting and grid-based modeling. Wherever possible, multi-dataset evaluation should be adopted to substantiate the generalizability of reported improvements.
3. Traffic Flow Prediction Based on Graph Neural Network Method
This chapter reviews the systematic evolution of graph neural networks in traffic flow prediction, providing a structured summary of existing research by combining the inherent logic and core innovation mechanisms of methodological development. Overall, this field has undergone a continuous evolution from static modeling to dynamic modeling, from single-point prediction to collaborative and fusion, and from closed modeling to cross-domain integration.
Early research primarily relied on predefined fixed topological structures to characterize road network spatial relationships. Subsequently, attention mechanisms were introduced to dynamically weight the influence of neighboring nodes, further developing into learnable adjacency matrices and adaptive graph structures, fundamentally alleviating the problem of spatial dependence changing over time. Building on this, to address cross-regional data silos and privacy compliance constraints, federated learning frameworks were introduced and deeply integrated with graph neural networks, forming a collaborative training paradigm of “data remains stationary, model moves dynamically.” Simultaneously, multi-graph fusion, robust modeling, and cross-domain technology embedding are continuously expanding the expressive boundaries of models, gradually leading traffic prediction towards a perception-decision integration model.
Based on a systematic analysis of relevant literature, existing research can be summarized into four major paradigms.
- (1)
Federated Learning and Privacy Protection Methods: Solving cross-regional collaboration and data privacy protection issues through federated learning frameworks, and combining graph neural networks to improve spatiotemporal dependency modeling capabilities.
- (2)
Dynamic Graph Neural Network Methods: Utilizing dynamic or adaptive graph structures to characterize the time-varying spatiotemporal relationships in transportation networks, thereby improving the model’s adaptability to dynamic traffic patterns.
- (3)
Multi-Graph Fusion and Attention Mechanism Methods: Integrating multiple graph structures (such as semantic graphs and topological graphs) or employing attention mechanisms to enhance the model’s ability to capture complex spatiotemporal features.
- (4)
Other Innovative Methods: Including the introduction of new technologies such as reinforcement learning, Bayesian networks, and information geometry to address specific challenges such as data uncertainty or hierarchical structure modeling.
3.1. Federated Graph Neural Network Methods
Currently, many researchers are dedicated to addressing the pressing issue of “data privacy and data silos” to meet the development needs of intelligent transportation. In real-world transportation systems, data is often scattered across different regions and institutions, making centralized sharing difficult due to privacy policies and regulations. Therefore, achieving cross-regional collaborative modeling while ensuring data remains within its domain has become a key challenge for intelligent transportation prediction. Federated learning (FL) provides a feasible and efficient technical approach to solving this problem [
58]. Its core concept is “data is not shared, but knowledge can be shared,” thus achieving a balance between privacy protection, cross-regional collaboration, and high-accuracy prediction. Specifically, as shown in the
Figure 5, this framework demonstrates a multi-client graph neural network system based on federated learning, where multiple local clients (e.g., Local Client k) train their own AFSTGCN models, each containing a local prediction loss
and a model memory loss
. The central server aggregates the model parameters
of each client (with weights
) to obtain the global model
, achieving collaborative model optimization and knowledge sharing. The diagram also includes components such as LTP (Local Training Process), Meme Model, and RND (which may represent random initialization or noise mechanism), which together constitute a hierarchical federated graph learning system with a memory mechanism.
The pioneering work of Feng et al. [
59] laid the foundation for this direction. Their proposed federated spatiotemporal prediction framework consists of two stages: road network partitioning and federated model training. In the partitioning stage, dynamic time warping and K-means clustering are used to perform pattern-driven sub-network decomposition of the traffic road network. In the training stage, each sub-network learns locally using a spatiotemporal graph neural network model, and knowledge distillation is used to mitigate the heterogeneity of models and tasks caused by differences in data distribution. A multi-factor weighting strategy is also designed to improve the fairness and accuracy of global aggregation. This method systematically discusses the issue of federated heterogeneity in traffic prediction scenarios for the first time, providing a reference paradigm for subsequent research. Based on the above ideas, Xia et al. [
60] further focused on the efficiency and deployability of short-term traffic prediction. They incorporated community detection methods into the federated framework, refining the local subnetwork into multiple community units, and training each unit locally based on a graph convolutional network. This approach reduces global communication overhead and the risk of data leakage, while improving the model’s flexibility and response speed in practical deployment, providing an efficient and privacy-preserving solution for real-time prediction scenarios. Wang et al. [
61] made significant breakthroughs at the structural level of the federated framework. Their proposed “Federated Graph Neural Network and Equivalent Hypergraph” framework focuses on solving the problem of missing cross-client connections. This model maps local traffic graphs to high-order supernodes and constructs a dynamically adjustable global hypergraph. Through performance feedback-driven hyperedge update mechanisms, it automatically adds or removes potential cross-client associations. This design effectively restores the cross-regional connection structure broken due to privacy isolation, thereby reconstructing a more complete global traffic space model while protecting data privacy.
Furthermore, Liu et al. [
62] extended federated graph neural networks to prediction-driven decision-making scenarios, proposing a federated load balancing framework based on spatiotemporal prediction and reinforcement learning to optimize the neighbor cell relationship configuration of cellular network base stations. This research demonstrates that the federated prediction model not only possesses data security advantages but can also serve as a high-precision decision-making basis for resource optimization in complex network systems, showcasing its potential value in cross-domain applications.
As shown in
Table 3. From a spatial-context perspective, federated GNN methods are most suitable for cross-jurisdictional, multi-city deployments where data sovereignty constraints structurally preclude centralized aggregation—a scenario common in metropolitan area transportation networks spanning multiple administrative boundaries. Their advantage diminishes in single-city deployments with centralized data access, and they are particularly ill-suited for sparse-sensor rural environments where even local graph construction is data-limited and the communication overhead of federated protocols adds cost without proportional benefit. Existing federated traffic prediction studies exhibit several methodological and practical limitations. Most approaches rely on static or predefined subnetwork partitioning strategies, limiting adaptability to dynamic road conditions and unexpected events. Although knowledge distillation and aggregation mechanisms improve accuracy and fairness, they rarely address model interpretability, causal inference, or the transferability assumptions underlying latent representation sharing across heterogeneous sub-networks. Theoretical foundations also remain insufficient: convergence guarantees under non-IID spatiotemporal graph data are poorly established, and the widespread adoption of FedAvg-style aggregation—originally designed for statistically homogeneous settings—lacks rigorous justification in heterogeneous traffic scenarios. Moreover, communication-efficient designs often overlook real-world deployment constraints, such as uneven edge computing capacity and system cost. Evaluation protocols are highly inconsistent across studies, with varying client numbers, data splits, and privacy budgets, making cross-study comparisons scientifically unreliable. Additionally, the robustness and generalization of cross-domain topology reconstruction under extremely sparse or anomalous data conditions require further validation. Future research should therefore prioritize adaptive partitioning mechanisms, theoretically grounded aggregation strategies, interpretable and causally informed modeling, and standardized federated benchmarks with explicit threat models to support reliable large-scale intelligent transportation applications.
3.2. Dynamic and Adaptive Graph Structure Methods
Currently, many researchers are directly addressing the core dynamic challenges of transportation systems, striving to overcome the limitations of traditional predefined graph structures. These studies recognize that the interactions between nodes in a road network are not static but dynamically evolve with scenarios such as rush hours and unexpected events. Therefore, their core innovation lies in introducing a data-driven mechanism, allowing the model to learn and construct spatial relationships that best reflect the current traffic conditions. Specifically, as shown in
Figure 6, the model achieves this objective through two core modules: the residual connection module and the self-attention module [
63]. The residual connection module comprises linear layers, temporal convolutional networks (TCN), and graph convolutional networks (GCN). Within this module, the gated temporal convolution from the TCN-a branch and the TCN-b branch forms a gating mechanism through the activation functions tanh and
, which is then combined with the GCN output and transformed via the weight matrix
. This architecture mitigates gradient vanishing through skip connections and stabilizes training [
64]. The self-attention module consists of alternating linear layers and ReLU activation functions, which compute correlation weights between sequence elements to capture dynamic global spatial dependencies. These two modules work synergistically, enabling the model to adaptively learn dynamic spatial dependencies and effectively capture complex spatiotemporal patterns in traffic data.
Specifically, the work of Wu et al. [
65] is inspiring. They pointed out that strong correlations may exist between non-adjacent road segments, thus proposing a “spatiotemporal aggregation graph neural network.” This model not only relies on a given spatial adjacency graph but also innovatively generates a “time graph” from the spatiotemporal data itself and calculates the correlation coefficient matrix, thereby compensating for the shortcomings of a single spatial graph in expressing temporal correlations and achieving a more comprehensive enhancement of the feature relationships of the road network. Building on this, Gu et al. [
66] dynamic correlation graph convolutional network“ goes a step further, completely abandoning predefined graph structures. This model directly constructs an adjacency matrix from the input multivariate time series data based on real-time calculated correlation coefficients. This “no-preset” approach endows the model with powerful adaptability, enabling it to discover hidden, potential spatial dependencies based on the characteristics of different datasets, and even the state of the same dataset at different times. To pursue even greater dynamism, Ma et al. [
67] treat the graph structure itself as a learnable, continuously evolving entity. Their “spatiotemporal evolutionary graph neural network” continuously updates its semantic adjacency matrix throughout the model’s training process. This means that the graph structure is no longer fixed after the initial setting, but can be continuously adjusted and optimized with the input of training data, so that its final form can better adapt to complex and ever-changing real traffic patterns. Another technical approach is to enhance the model’s expressive power. Hu et al. [
68] made significant improvements to the classic “Graph WaveNet” architecture, introducing a self-attention mechanism to construct an adaptive adjacency matrix. This method allows the model to not only capture spatial proximity but also fit more complex, non-local spatiotemporal dependencies between nodes. Experiments show that its MAE, MAPE, and RMSE metrics are significantly reduced. Jiang et al. [
69] dynamic graph spatiotemporal neural network“ employs a clever dual-graph strategy. They simultaneously constructed a static topological graph (representing the inherent physical connections of the road network) and a dynamic information graph (representing the similarity of traffic flow over time). This design allows the model to distinguish and utilize both stable structural relationships and rapidly changing dynamic connections between nodes, thus providing a more refined characterization of the spatiotemporal properties of the traffic network. To address the impact of external unforeseen events, Ye et al. [
70] made targeted designs with their “Dynamic Multi-Graph Neural Network.” This model not only constructs multiple prior graphs to provide rich contextual information but also specifically designs a dynamic graph adjustment module, enabling the model to update the adjacency matrix based on the currently learned state at each training step. More importantly, it explicitly incorporates traffic accident event data, allowing the model to focus on and learn local traffic fluctuation patterns caused by accidents.
Finally, Chen et al. [
71] focused on the dynamics of the temporal dimension. Their “time-based adaptive graph neural network” can generate different graph dependency matrices for different time steps, thereby accurately capturing the unique spatial correlation patterns of traffic flow at different times of the day.
Regarding spatial context suitability, dynamic and adaptive graph methods offer the greatest advantage in arterial-dominated urban networks with pronounced time-varying spatial dependencies—for instance, commuter corridors exhibiting strong tidal flow patterns where peak-hour connectivity structures differ substantially from off-peak configurations. In contrast, for stable low-density rural road networks where spatial dependencies are structurally fixed and sensor coverage is sparse, the high computational cost of dynamic graph construction is difficult to justify, and simpler static-graph or decomposition-based approaches are more practical. The performance gains of dynamic graph approaches are therefore strongly conditioned on the temporal variability of the target network’s spatial interaction patterns.
As shown in
Table 4, This line of research has evolved from temporal graph generation to fully dynamic construction, continuous evolution, and spatiotemporal dual-graph modeling, progressively enhancing the adaptability of graph structures to real-world traffic dynamics. However, several limitations persist. Most methods rely heavily on correlation-based relevance matrices or attention scores, conflating statistical co-movement with genuine spatial influence and thus lacking causal semantics and stable physical interpretability. The widespread use of Pearson, DTW similarity, or learnable adjacency matrices optimizes predictive loss rather than structural fidelity, meaning the inferred graphs may deviate substantially from actual road-network topology, raising concerns about reproducibility and epistemological validity. Although frequent structural updates improve flexibility, they introduce high computational costs and training instability, limiting real-time deployment feasibility. External event modeling remains dependent on explicitly labeled data, making it difficult to capture implicit perturbations or unknown anomalies. Moreover, dynamic graphs are highly sensitive to data variation, and their robustness, generalization capacity, and adaptability under extremely sparse scenarios lack systematic cross-domain verification, especially when evaluation is confined to a single city or dataset. Future research should therefore balance adaptability with interpretability by incorporating causal constraints, improving structural consistency and robustness, and developing computationally efficient dynamic graph frameworks suitable for large-scale, real-world deployment.
3.3. Multi-Graph Fusion and Attention Mechanism Methods
The core of this research category lies in answering the question of “how to more fully and effectively utilize graph structures for traffic prediction.” Its fundamental motivation stems from the inherent complexity of traffic systems; a single perspective or traditional graph convolution methods are insufficient to fully characterize multi-layered spatiotemporal features. Therefore, related research generally focuses on multi-source graph information fusion and enhanced feature extraction mechanisms (especially attention mechanisms) to comprehensively improve model performance in breadth, depth, and robustness [
72]. As illustrated in the
Figure 7, a typical implementation employs an attention mechanism atop a cascade of RippleGNN modules, where each RippleGNN simulates the ripple-like propagation of information through the graph, capturing deep and long-range spatiotemporal dependencies layer by layer. The attention mechanism then adaptively fuses these multi-level features to distinguish the importance of different information, with the integrated representation ultimately fed into the prediction layer to generate accurate traffic forecasts [
73]. This architecture exemplifies how combining hierarchical graph propagation with attention-based fusion effectively captures complex dynamic spatial dependencies and enhances prediction performance.
In terms of breadth, this direction achieves more comprehensive traffic knowledge representation by constructing and fusing multiple types of graph structures. As shown in
Table 5. For example, Wang et al. [
74] introduced graph attention mechanisms into traffic prediction, enabling models to aggregate differentiated features based on neighborhood importance rather than simple averaging, thereby enhancing the feature discriminative power of spatial representations. Building on this, Wang et al. [
75] further proposed introducing channel attention mechanisms into spatiotemporal graph convolution, allowing the model to adaptively adjust the importance of different feature channels, thus optimizing the selective attention to key spatiotemporal patterns. Regarding robustness, research has begun to focus on data noise, missing data, and uncertainty in real-world traffic environments. peng et al. [
76] proposed a hybrid spatiotemporal graph neural network that simultaneously constructs a static adaptive graph, a dynamic learning graph, and a semantic graph (generated by dynamic time warping and masked attention). It employs multi-scale gated temporal attention to model complex temporal dependencies, achieving leading performance on multiple public datasets and demonstrating the significant potential of multidimensional modeling strategies.
In terms of spatial context suitability, multi-graph fusion and attention-based methods are best suited to high-density urban cores where rich sensor coverage, complex overlapping spatial relationships (topological, functional, and flow-based), and dense OD demand matrices provide sufficient multi-perspective input signals to justify the increased modeling complexity. In moderately dense arterial networks, selective adoption of dual-graph strategies (combining static topology and dynamic information graphs) can provide a practical balance between expressiveness and computational efficiency. However, in sparse-sensor environments, the multi-graph paradigm is fundamentally constrained by insufficient input diversity: constructing meaningful semantic or functional graphs requires adequate sensor density, and the absence of such data can render multi-graph fusion architectures over-parameterized relative to available information, increasing the risk of overfitting.
Furthermore, some studies have further expanded the spatial modeling paradigm at both theoretical and applied levels. For example, Cheng et al. [
77] and others emphasized simultaneous modeling of spatiotemporal dependencies and incorporated external factors such as weather and events into a joint prediction framework; Han et al. [
78] and others pioneered the introduction of Ollivier–Ricci curvature into traffic graph modeling, using “neighborhood-neighborhood” relationship constraints based on optimal transport theory to guide feature propagation, integrating differential geometric constraints into graph structure learning, and theoretically expanding the boundaries of spatial modeling. However, existing approaches still exhibit notable limitations. Multidimensional graph constructions and complex attention mechanisms substantially increase computational cost while offering only limited and sometimes misleading interpretability. Multi-graph fusion strategies are largely based on empirical concatenation or gating designs, lacking a unified theoretical framework to guide how heterogeneous graph signals should be integrated. As a result, key architectural choices—such as graph types, fusion order, and weighting schemes—are often determined by trial-and-error, introducing significant experimenter degrees of freedom and raising concerns about overfitting and weak cross-scenario generalization. Moreover, attention weight visualization is frequently used as evidence of interpretability, yet it does not reliably reflect feature importance or causal attribution, making such claims scientifically fragile, particularly in safety-critical traffic applications. Although robustness enhancements mitigate noise effects, they still depend on external labels, heuristic rules, or prior knowledge, limiting autonomous learning capacity and universality. Meanwhile, theoretically appealing methods such as Ollivier–Ricci curvature lack large-scale validation, and their optimal transport–based edge weighting entails high computational complexity that challenges real-world deployment. Therefore, future research should pursue theoretically grounded, computationally efficient spatial graph learning frameworks that balance multidimensional expressiveness with causal interpretability and scalable practicality for large-scale intelligent transportation systems.
3.4. Cross-Domain Technology Integration Methods
This category of research represents the most cutting-edge and exploratory directions in the field of traffic prediction. They go beyond incremental improvements to existing models, boldly integrating graph neural networks with cutting-edge technologies from other fields, or designing entirely new network architectures to address specific bottlenecks or open up entirely new applications.
One group of studies focuses on closing the prediction and decision-making loop. As shown in
Table 6. The work of Xing et al. [
80] is a prime example; their proposed RL-GCN model integrates graph convolutional networks, LSTM, and reinforcement learning. GCN and LSTM are responsible for sensing and predicting traffic flow, while the reinforcement learning part formulates the optimal traffic control strategy based on the prediction results, achieving a leap from “seeing” to “decision-making” and providing a blueprint for building truly intelligent traffic control systems. Yang et al. [
81] is a milestone in this direction. They developed a “deep learning framework for integrating macroscopic traffic flow models,” the core of which is the deep integration of cellular transport models (a classic macroscopic traffic flow theoretical model) with deep learning. The CTM model mathematically describes traffic state propagation given initial and boundary conditions, while a spatiotemporal attention RNN is responsible for predicting these boundary conditions. Finally, an extended Kalman filter is used to assimilate the prediction results, ensuring compliance with the law of traffic conservation. This “theory-guided data-driven” approach ensures that the predictions are not only accurate but also conform to physical laws. Regarding model efficiency and practicality. Rajagopal et al. [
82] proposed the MTH-QGNN traffic flow prediction model, which integrates hypercurvature embedding, meta-learning, quantum graph neural networks, and neural ordinary differential equations to improve the spatiotemporal modeling capabilities and cross-city adaptability of large-scale traffic networks. Experimental results show that the proposed method achieves high prediction accuracy and stability on the Los-loop and SZ-taxi datasets. An et al. [
83] proposed a spatiotemporal graph convolutional network model, IGAGCN, based on information geometry and attention mechanisms, for traffic flow prediction in urban road networks. This method addresses the prediction difficulties arising from the dynamic spatiotemporal characteristics and external environmental factors in real traffic systems. It characterizes the data distribution differences between different sensors using information geometry methods and constructs a dynamic relationship matrix using an attention mechanism, thereby more effectively capturing the spatiotemporal dependencies in traffic flow data and improving the model’s ability to express complex traffic dynamics and its predictive performance. Lv et al. [
84] proposed the TS-STNN (Tree Structure Spatiotemporal Neural Network) traffic flow prediction model, which extracts spatial information of the traffic network by constructing a spatial tree matrix with hierarchical and directional features, and combines it with GRU to model temporal dependencies. Experimental results show that this method has higher prediction accuracy than the baseline model in various scenarios. Another group of studies emphasizes combining physical models with data-driven approaches to enhance the theoretical rationale for prediction. The research of and Abbas et al. [
85] proposed the DFHITSSC framework, which leverages the complementary strengths of SVM and artificial neural networks through a decision-level fusion strategy enhanced by fuzzy inference.
Taken together, the cross-domain methods reviewed in this section address distinct spatial deployment contexts that earlier paradigms do not adequately serve. Lightweight architectures such as Light-ASTNN are specifically designed for resource-constrained edge deployments—roadside units, in-vehicle systems, or IoT-scale sensors—where strict memory and latency budgets make full-scale GNN stacks impractical. Tree-structure spatial neural networks are suited to hierarchically organized road networks (e.g., freeway-arterial-local road hierarchies) in which the directional and level-based structure of traffic flow provides natural tree-topological priors. Physics-integrated frameworks such as MTFD are most valuable in contexts where labeled data are scarce but physical relationships (e.g., conservation laws, fundamental diagram constraints) are well-established—including rural highway corridors and newly instrumented networks with limited training data. Finally, the meta-learning component of MTH-QGNN specifically targets the cross-city generalization problem: networks in cities without existing prediction infrastructure can leverage models pre-trained on data-rich cities, directly addressing the practical challenge of deployment in low-data spatial contexts. These context-specific advantages highlight that the choice among cross-domain methods should be guided not only by accuracy benchmarks but also by the structural, resource, and data characteristics of the target deployment environment.
4. Application of Intelligent Optimization and Hybrid Deep Learning in Traffic Flow Prediction
This type of research represents the most cutting-edge and exploratory direction in the field of traffic prediction. It moves beyond incremental improvements to existing models, actively pushing paradigm boundaries by deeply integrating graph neural networks or time-series models with cutting-edge technologies from other fields, or reconstructing entirely new network architectures to solve long-standing bottlenecks and expand into new application scenarios. Against this backdrop, this chapter systematically reviews three types of hybrid deep learning prediction paradigms centered on LSTM to address the inherent non-stationarity and complexity of traffic flow data, forming complementary technical paths.
The first type is the fusion of decomposition algorithms and LSTM, following a “decomposition-prediction-reconstruction” technical route. Through signal processing techniques such as VMD and CEEMDAN, the original traffic sequence is decomposed into several relatively stationary subsequences and modeled separately, thereby reducing prediction difficulty and mitigating the impact of non-stationarity. However, the multi-stage processing flow easily introduces accumulated errors, has high computational costs, and is highly sensitive to decomposition parameters (such as the number of modes), limiting its real-time deployment capabilities.
The second type is the combination of heuristic optimization algorithms and LSTM. These methods utilize metaheuristic strategies such as dung beetle optimization, whale algorithms, and particle swarm optimization to automatically search for hyperparameter configurations, reducing reliance on manual parameter tuning experience and improving model performance to some extent. However, within the black-box search framework, the true source of performance improvement is difficult to explain, and systematic horizontal comparisons of different optimization algorithms on a unified public benchmark are scarce, weakening the verifiability of the methodology.
The third category is the fusion of attention mechanisms and LSTM. By introducing structures such as multi-head self-attention, bidirectional LSTM, and Transformer, the model can adaptively focus on more predictive spatiotemporal features, demonstrating outstanding performance in long-range dependency modeling. However, attention weights are not equivalent to causal attribution, and their “interpretability” claims still require careful evaluation in safety-critical traffic management scenarios.
Overall, these three paradigms expand the application boundaries of LSTM in traffic prediction from three dimensions: signal decomposition, parameter optimization, and feature selection. Together with cutting-edge exploratory research such as graph neural networks, they constitute an important trend in the evolution of traffic prediction technology from single-model optimization to multi-technology fusion, structural innovation, and system-level intelligence.
Before proceeding, it is worth clarifying the functional rationale underlying these recurring hybrid combinations, as the same structural pairings—decomposition with LSTM, optimization with LSTM, and attention with LSTM—appear repeatedly across studies. This convergence is not coincidental but reflects three distinct functional objectives that these combinations are designed to serve. Specifically, the fusion of decomposition algorithms (e.g., VMD, CEEMDAN) with LSTM primarily targets non-stationarity reduction: by transforming a non-stationary traffic sequence into quasi-stationary sub-components, the prediction task is simplified and the risk of model misspecification is reduced. The pairing of heuristic optimization algorithms with LSTM primarily addresses search space reduction for hyperparameter configuration: rather than navigating a high-dimensional, non-convex parameter space manually, metaheuristic strategies automate this search and reduce sensitivity to initialization. The integration of attention mechanisms with LSTM primarily serves to stabilize parameter estimation under long-range dependency conditions: attention selectively weights temporal features, mitigating the gradient decay that degrades standard LSTM performance on extended sequences. Recognizing these distinct functional objectives helps explain why these combinations emerge persistently and provides a more principled basis for method selection in practice.
4.1. Decomposition Algorithm Combined with LSTM Prediction
The primary functional objective of decomposition-LSTM hybrids is non-stationarity mitigation: signal processing techniques restructure the input sequence to reduce distributional complexity before modeling. Nowadays, many researchers adopt the “decomposition-prediction-reconstruction” research paradigm, which aims to decompose the non-stationary and nonlinear original traffic flow sequence into a series of relatively stationary subsequences through signal processing techniques, thereby reducing the difficulty of model learning. As illustrated in the
Figure 8, A typical implementation of this idea is a hybrid time series prediction model that first applies STL decomposition to split the original data into trend, seasonal, and residual components. Then, different models are employed to capture distinct patterns in each component: LSTM models for trend data, ARIMA models for seasonal patterns, and XGBoost models for residuals. Finally, the predictions from these models are integrated to generate the overall forecast. This “decomposition-prediction-integration” strategy leverages the strengths of each model, effectively capturing linear trends, cyclical fluctuations, and complex nonlinear patterns in the traffic data, thereby improving the overall prediction accuracy and mitigating the challenges posed by non-stationarity in traffic flow sequences. As shown in
Table 7, Wang et al. [
86] proposed an IHPO-VMD-LSTM-Informer model in which an improved Hunter–Prey Optimization algorithm adaptively determines key VMD parameters while NPCA reduces feature dimensionality, thereby extracting more informative traffic indicators. Vo et al. [
87] further advanced this direction by integrating FVMD for signal decomposition, WOA for parameter optimization, and GA for model selection, enabling each decomposed component to be assigned the most suitable deep model (e.g., LSTM, BiLSTM, GRU), which significantly boosts accuracy and reduces inference time. Zhou et al. [
88] employed CEEMD combined with a novel differencing operation to stabilize traffic data and applied Bayesian optimization to search for optimal LSTM hyperparameters, achieving strong performance in highly stochastic air traffic flow. Similarly, the approaches of Zhao et al. [
89] and Dai et al. [
90] rely on improved heuristic optimizers (e.g., IDBO, enhanced bat algorithm) to refine VMD decomposition and optimize LSTM parameters, thereby generating more physically meaningful sub-sequences and improving predictive capability. In general, these methods effectively mitigate non-stationarity, enhance feature representation, and improve forecasting precision by modeling decomposed components individually. However, they still face limitations such as high computational cost and poor real-time capability, potential error accumulation during reconstruction, strong sensitivity to parameter settings (e.g., VMD mode number), and increased model complexity that reduces interpretability and hinders deployment in real industrial applications.
However, these methods still have significant shortcomings: First, the decomposition and model optimization processes are computationally expensive and lack real-time performance; second, the reconstruction stage may introduce accumulated errors, affecting overall prediction accuracy; third, the model is highly sensitive to parameters (such as the number of VMD patterns), limiting its generalization ability; furthermore, multi-stage processing increases model complexity, reduces interpretability, and hinders deployment in real-world industrial environments. Therefore, future research urgently needs to explore low-cost, robust, and interpretable decomposition-prediction-reconstruction strategies to balance accuracy and practical applicability. Beyond these engineering concerns, deeper methodological issues persist. VMD and CEEMD implicitly assume component-level stationarity, an assumption often violated under incident-driven or abrupt non-stationary traffic conditions. The common practice of treating decomposed sub-signals as independently predictable lacks rigorous theoretical justification, as inter-component dependencies may be non-trivial, potentially introducing systematic bias when modeled separately. In addition, most decomposition-LSTM studies rely on fixed in-sample train–test splits without cross-validation or out-of-distribution evaluation, risking inflated generalization claims. The exclusive use of RMSE and MAE further obscures asymmetric error costs in traffic management, where underestimating peak flow can be significantly more consequential than overestimation. Future research should therefore develop computationally efficient, theoretically grounded, and robust decomposition–prediction–reconstruction frameworks, accompanied by more rigorous and context-aware evaluation protocols to ensure both predictive accuracy and practical reliability.
4.2. Heuristic Optimization Algorithm Combined with LSTM Prediction
The primary functional objective of optimization-LSTM hybrids is search space reduction: heuristic algorithms replace expert-driven manual tuning by efficiently exploring the hyperparameter configuration space. These studies aim to address the reliance on expert-driven manual tuning of LSTM hyperparameters (e.g., number of layers, neurons, learning rate) by employing heuristic algorithms to automatically search for optimal configurations.
Figure 9 implementation involves using the IVY optimization algorithm to automatically select and train an LSTM model. The process begins with data preprocessing and defining the key hyperparameters, such as the number of neurons in a two-layer LSTM, dropout rate, and batch size, along with their corresponding search ranges. The IVY algorithm generates an initial population of candidate hyperparameter sets, with each individual representing a unique combination. Through iterative evolution, each LSTM model is evaluated using metrics such as Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination (
), thereby guiding the search toward the global optimum.
Moreover, the workflow explicitly illustrates the internal structure of an LSTM unit, including the transmission of input (
), cell state (
c), and hidden state (
h) across time steps. This demonstrates how temporal dependencies are captured and processed. Overall, such heuristic-driven automated tuning strategies effectively reduce reliance on manual expertise and enhance the predictive performance of LSTM models in time series forecasting. As shown in
Table 8, Dong et al. [
91] introduced a novel dung beetle optimizer to tune LSTM hyperparameters and achieved high precision in maritime traffic forecasting, demonstrating its effectiveness in complex optimization tasks. Jardines et al. [
92] applied LSTM to convective weather prediction in aviation, offering valuable decision support for air traffic flow management and highlighting LSTM’s potential in spatiotemporal forecasting. Guo et al. [
93] proposed the MVHS-LSTM, which dynamically selects epochs and learning rates through heuristic iteration and integrates ordinary least squares for feature selection, achieving a balance between accuracy and efficiency. Fu et al. [
94] extended an LSTM-CNN model with Bayesian inference to quantify predictive uncertainty, a critical aspect of risk management in traffic systems. Cini and Aydin [
95] developed a deep ensemble model that adaptively weights base learners according to past performance rather than simple averaging, leading to more responsive forecasting. In addition, Zhuang and Cao [
96] leveraged K-nearest neighbors for spatial filtering and BiLSTM for temporal modeling, proving effective on UK highway traffic data. Vijayalakshmi et al. [
97] utilized stacked LSTM autoencoders for weather-feature compression and combined BiLSTM and CNN to achieve accurate traffic prediction and congestion recognition in multivariate settings. Cao et al. [
98] adopted whale optimization to tune LSTM parameters and applied multi-channel graph convolution to capture spatial dependencies, improving regional forecasting accuracy. Hussain et al. [
99] demonstrated the suitability of deep networks in urban environments through a hybrid GRU–BiLSTM architecture, while Lan et al. [
100] and Wang et al. [
101] applied grey wolf optimization and particle swarm optimization, respectively, confirming the broad applicability of different heuristics in LSTM tuning. Overall, these methods automate hyperparameter search, enhance model performance and robustness, and validate a wide range of optimization algorithms. However, they also suffer from high computational overhead, risk of convergence to suboptimal solutions, new parameter-setting demands for the optimizers themselves, reduced interpretability of the tuning process, and increased engineering complexity in practical deployment. A methodologically distinct contribution to ensemble-based traffic prediction is the Improved Bayesian Combination Model with Deep Learning (IBCM-DL) proposed by Gu et al. [
102] Unlike heuristic search strategies that optimize a single model’s hyperparameters, IBCM-DL employs a principled Bayesian weighting mechanism to combine three heterogeneous sub-predictors—a gated recurrent unit neural network (GRUNN), an autoregressive integrated moving average model (ARIMA), and a radial basis function neural network (RBFNN)—into a unified probabilistic forecasting framework. The Bayesian combination assigns posterior weights to each sub-model based on its predictive likelihood, dynamically reflecting each model’s relative reliability under varying traffic conditions. Empirical validation on highway traffic data from Beijing demonstrates that this approach effectively overcomes the error magnification phenomenon inherent in traditional fixed-weight combination schemes, yielding superior accuracy and stability compared to individual deep learning models, classical machine learning methods, and naive ensemble averaging. From a methodological standpoint, IBCM-DL represents an important bridge between classical Bayesian model averaging and modern deep learning ensembles: it provides theoretically grounded uncertainty quantification while retaining the flexibility of data-driven sub-models. However, the framework’s performance is sensitive to the initial selection and diversity of constituent sub-models, and extending the Bayesian weighting mechanism to accommodate more complex architectures such as Transformer-based encoders or graph neural networks remains an open challenge requiring further theoretical development.
Scientifically, the heuristic optimization paradigm as applied to LSTM hyperparameter tuning raises fundamental concerns about validity and reproducibility. Most studies treat hyperparameter optimization as a black-box search problem, reporting the best configuration found on a specific dataset without analyzing the sensitivity of model performance to hyperparameter perturbations. This makes it unclear whether the reported gains reflect genuine improvements in model architecture or merely fortuitous configurations tailored to specific data characteristics. Additionally, the no-free-lunch theorem implies that the superiority of any particular heuristic optimizer is dataset-dependent; yet comparative evaluations across optimizers on common traffic benchmarks are conspicuously absent from the literature. The absence of statistical significance testing—such as reporting confidence intervals or conducting Wilcoxon signed-rank tests over multiple runs—further undermines the scientific credibility of performance comparisons in this subfield.
4.3. LSTM Combined with Attention Mechanism for Prediction
The primary functional objective of attention-LSTM hybrids is feature selection stabilization: attention mechanisms direct the model’s capacity toward temporally or spatially informative features, reducing noise sensitivity and improving long-range dependency modeling. This line of research enhances traffic flow forecasting by integrating attention mechanisms into deep learning models, enabling the model to automatically prioritize key time steps or features and thereby address long-term dependency issues.
Figure 10 implementation is a hybrid prediction model that combines Transformer and bidirectional LSTM (BiLSTM). The model takes a decomposed single IMF component as input, processed through a sliding window, and first incorporates positional encoding to inject sequential information. The core multi-head attention module then captures important temporal dependencies, with a masking mechanism to prevent future information leakage and residual connections to stabilize training. The attention-weighted features are subsequently fed into a BiLSTM layer for further sequential modeling, capturing both forward and backward dependencies. Finally, the processed features pass through a fully connected layer with dropout for integration and transformation, producing the final prediction output, which is then compared with the true values. This “attention-enhanced sequential modeling” strategy effectively leverages both global temporal correlations and deep bidirectional dependencies, improving the accuracy and robustness of traffic flow forecasting. As shown in
Table 9. A representative example is the hybrid model proposed by Aburasain [
103], where attention networks dynamically weight spatiotemporal features extracted by Bi-LSTM and CNN, allowing the model to focus on the most congestion-relevant information. Jia et al. [
104] further extended attention to multiple domains—temporal, spatial, and frequency—by incorporating Transformer-based self-attention to learn frequency-domain representations, leading to more comprehensive feature modeling. In contrast, Shuvro et al. [
105] replaced LSTM entirely with a Transformer architecture, leveraging intrinsic self-attention to parallelize the learning of long-range spatiotemporal dependencies and embedding predictions within an SDN-VANET framework for networked transportation. Song et al. [
106] developed the TransFusion model, which applies Transformer-based attention at the fusion level to dynamically integrate outputs from both TCN and LSTM, allowing the model to decide which base predictor is more reliable under varying input conditions. Overall, attention mechanisms improve model expressiveness, enhance interpretability through weight visualization, and naturally support variable-length inputs. However, they also pose challenges, including high computational and memory costs—particularly for self-attention, increased training complexity due to larger architectures, limited interpretability as attention does not precisely reflect causal influence, and sensitivity to noisy data that may mislead attention allocation.
5. Discussion and Prospects
5.1. Discussion
This review systematically examines the latest advances in traffic flow prediction methods, with a particular focus on graph neural network-based approaches and hybrid deep learning frameworks. Through analysis of four main research categories, several important conclusions are drawn.
First, graph neural networks demonstrate significant advantages in capturing the inherent spatial dependencies of road networks. The evolution from static graph structures to dynamic adaptive methods represents a key advancement in time-varying traffic pattern modeling. Methods such as federated learning-based graph neural networks successfully address key challenges of data privacy and cross-regional collaboration, while dynamic graph construction techniques enable models to adapt to changing traffic conditions in real time.
Second, the fusion of multiple graph structures and attention mechanisms has proven effective in enhancing the expressive power of models. Multi-graph fusion methods can simultaneously capture different types of spatial relationships, such as physical connectivity, functional similarity, and flow correlation, thus providing a more comprehensive representation of traffic networks. Attention mechanisms further improve prediction accuracy by enabling models to selectively focus on the most relevant spatiotemporal features.
Third, hybrid methods combining decomposition algorithms, heuristic optimization, and attention mechanisms with deep learning models show promising application prospects in addressing the non-stationarity and complexity of traffic data. Decomposition-based methods effectively reduce prediction difficulty by transforming complex signals into more stable components, while optimization algorithms can automatically handle hyperparameter tuning and improve model robustness.
However, despite these advances, several challenges remain. Many state-of-the-art models suffer from high computational complexity, limiting their application in real-time traffic management systems. The “black box” nature of deep learning models raises concerns about their interpretability, which is crucial for practical deployment and decision-making. Furthermore, while most methods perform well on specific datasets, they lack sufficient validation across diverse traffic scenarios and geographical regions, raising questions about their generalization capabilities.
5.2. From Prediction to Operation: Bridging the Research-Practice Gap
A recurring limitation identified across the reviewed literature is the insufficient articulation of how prediction model outputs are translated into actionable traffic operational decisions. Most existing studies evaluate model performance exclusively in terms of predictive accuracy metrics—MAE, RMSE, and MAPE—without specifying how predicted variables are consumed by downstream control systems. This disconnect constrains the practical value of otherwise technically sophisticated models and represents a critical barrier to large-scale ITS deployment.
As illustrated in
Figure 11, we propose a conceptual six-layer framework that explicitly maps the prediction-to-operation pipeline. At the core of this linkage, three categories of predicted state variables serve as direct inputs to operational decision systems. First, short-term flow (q), speed (v), and density (k) forecasts feed directly into adaptive signal control algorithms, where predicted saturation flow determines green phase allocation and cycle length adjustment. Second, travel time and congestion index predictions drive dynamic route guidance systems, informing variable message sign content and real-time navigation re-routing recommendations. Third, origin-destination demand forecasts support higher-level decisions including congestion pricing rate adjustment, transit fleet dispatching, and access restriction enforcement. The translation from predicted variables to control actions is not direct but mediated by a decision support interface comprising three functional components: threshold-based trigger logic (activating control responses when predicted states exceed operational thresholds), multi-objective optimization (balancing throughput, delay, emissions, and equity), and uncertainty quantification (propagating prediction confidence intervals into risk-aware control strategies). This intermediate layer is largely absent from current traffic prediction research, yet it is precisely where academic models must interface with real-world traffic management center infrastructure.
Furthermore, the framework highlights the importance of a real-time feedback loop: observed traffic responses to control actions are continuously fed back into the data input layer, enabling online model updating and closed-loop system adaptation. This feedback mechanism is essential for maintaining prediction accuracy under non-stationary traffic conditions—particularly during incidents, special events, or demand shifts—and connects directly to the federated learning and dynamic graph structure paradigms reviewed in
Section 3.1 and
Section 3.2. In summary, future traffic flow prediction research should not treat operational integration as an afterthought. Model design choices—including prediction horizon, output granularity, uncertainty representation, and computational latency—should be explicitly aligned with the requirements of target control applications. Establishing standardized prediction-to-operation interfaces would not only improve the practical deployability of advanced models but also enable more ecologically valid evaluation protocols that assess system-level performance rather than isolated predictive accuracy.
5.3. Future Research Prospects
Future research on traffic flow prediction should prioritize several concrete and technically grounded directions rather than broad conceptual aspirations.
First, model interpretability requires systematic methodological advancement rather than general calls for explainable AI. Future work should explicitly integrate data-driven architectures with established traffic flow theories (e.g., fundamental diagram models and shockwave theory) to impose physics-informed constraints on learned representations. Instead of relying solely on post-hoc attention visualization, structural interpretability should be embedded into model design and validated through controlled perturbation experiments.
Second, real-time efficiency must be addressed through measurable architectural simplification. Research should quantify the trade-off between prediction accuracy and latency by incorporating standardized runtime benchmarks. Techniques such as structured pruning, low-rank factorization, and knowledge distillation should be evaluated under realistic deployment constraints (e.g., edge computing nodes with limited memory and heterogeneous processing capacity), rather than solely reporting offline accuracy improvements.
Third, privacy-preserving collaboration needs more rigorous protocol-level analysis. Within federated learning frameworks, differential privacy budgets, secure aggregation schemes, and communication costs should be explicitly reported and compared. Future studies should define clear threat models and evaluate the performance–privacy trade-off instead of treating privacy mechanisms as add-on components.
Fourth, cross-regional generalization should be validated through cross-city transfer experiments and out-of-distribution testing. Transfer learning and meta-learning approaches must demonstrate consistent performance under heterogeneous traffic regimes rather than relying on single-dataset evaluations. Similarly, multimodal data fusion (e.g., weather, events, travel demand signals) should be assessed through ablation studies to quantify incremental contributions and avoid over-parameterized fusion architectures.
Fifth, uncertainty quantification and decision integration should move beyond point prediction. Probabilistic forecasting methods need calibration evaluation (e.g., reliability diagrams, coverage probability) and should be tested in downstream decision-making scenarios, such as congestion mitigation or signal control. Closed-loop reinforcement learning frameworks that combine prediction and control should report system-level metrics, including stability and safety, rather than isolated prediction gains.
Finally, the field would benefit from standardized benchmarks and evaluation protocols, including unified dataset splits, client partitioning strategies for federated settings, consistent reporting of computational overhead, and reproducibility checklists. Without methodological standardization, comparative conclusions remain fragile.
In summary, future progress in traffic flow prediction depends less on speculative architectural expansion and more on theoretically grounded modeling, rigorous experimental design, reproducible benchmarking, and deployment-aware evaluation.
6. Conclusions
This paper provides a comprehensive and systematic review of traffic flow prediction methods, focusing on the latest advancements in graph neural network (Graph Neural Network)-based approaches and hybrid deep learning frameworks. Through detailed analysis and classification of existing research, several important conclusions are drawn. Graph Neural Networks (Graph Neural Networks) have become powerful tools for traffic flow prediction due to their inherent ability to model the spatial dependencies of road networks. Our review reveals a clear evolutionary trajectory of Graph Neural Networks from static, predefined graph structures to dynamic, adaptive, and learnable graph representations. Federated learning-based Graph Neural Network methods successfully address key challenges of data privacy and cross-regional collaboration, enabling knowledge sharing without centralizing sensitive traffic data. Dynamic graph construction techniques demonstrate excellent adaptability to time-varying traffic patterns, while multi-graph fusion methods effectively capture complex spatial relationships from multiple perspectives. Hybrid deep learning frameworks combining decomposition algorithms, heuristic optimization, and attention mechanisms with recurrent neural networks show great potential in addressing the inherent non-stationarity and complexity of traffic data. Decomposition-based methods effectively stabilize traffic signals by transforming them into more predictable components. Heuristic optimization algorithms can automatically complete the highly challenging task of hyperparameter tuning, thereby improving model performance and practical applicability. Attention mechanisms enhance the expressive power of models by selectively focusing on the most relevant spatiotemporal features. However, existing methods still face numerous challenges. High computational complexity limits their real-time deployment in large-scale networks. The lack of interpretability in deep learning models hinders their widespread adoption in safety-critical applications. The generalization ability of models across different traffic scenarios and geographical regions has not been fully validated. While privacy-preserving technologies hold great promise, further development is needed to balance practicality and security. These challenges underscore the necessity for continuous research and innovation. Analysis of existing methods shows that no single approach can dominate in all scenarios. The choice of appropriate method depends on specific application requirements, including prediction timeframes, data availability, computational resources, and privacy constraints. Practitioners should carefully weigh these advantages and disadvantages when designing traffic prediction systems. In conclusion, despite significant progress made by graph neural networks and hybrid deep learning methods in traffic flow prediction, there is still considerable room for development in this field. Future research should prioritize model interpretability, computational efficiency, robust privacy protection, and practical deployment capabilities. By addressing these challenges and leveraging emerging technologies, researchers can develop more efficient, reliable, and practical traffic prediction systems, thereby making meaningful contributions to the realization of intelligent transportation systems and sustainable urban development.