1. Introduction
With the profound progress of urbanization, as of 2023, the total number of motor vehicles nationwide has reached 435 million, and the average annual number of newly registered vehicles exceeds 34 million. The rapid growth of the vehicle population has significantly exacerbated traffic congestion and accident risks. From 2015 to 2020, the average annual growth rate of the number of traffic accident fatalities was 3.35%. The peak congestion index in major cities generally exceeds 2.0, and in developed cities, the congestion index during the morning and evening rush hours even reaches 3.0. Meanwhile, the frequent starts and stops resulting from congestion not only decrease traffic efficiency but also indirectly trigger emotional fluctuations among drivers, thereby further increasing the likelihood of accidents [
1,
2,
3,
4]. In this context, Intelligent Transportation Systems (ITS) have emerged as the crucial solution to these challenges. Traffic flow prediction, an essential component of ITS, can offer scientific support for the management and planning of urban transportation systems. Therefore, enhancing the accuracy of traffic flow prediction and formulating effective traffic management strategies to alleviate congestion, reduce accident probabilities, and improve travel efficiency are of paramount importance for urban traffic management, enhancing network efficiency, and reducing energy consumption. Traditional traffic prediction methods typically focus on extracting spatiotemporal characteristics from historical traffic data to forecast future traffic conditions. However, traffic conditions are not solely determined by historical traffic data; they are also significantly influenced by external information, including weather conditions, Points of Interest (POI), road conditions, and other environmental factors [
5,
6,
7,
8,
9].
We: (1) designs a multi-source information fusion module that deeply explores the external temporal relationships between traffic flow and environmental factors, as well as their inherent spatial correlations with the road network structure, through feature-level fusion and multi-graph convolution fusion, thereby enhancing the model’s ability to perceive complex scenarios; (2) designs a spatiotemporal attention module that dynamically adjusts the model’s spatiotemporal focus on different time periods and regions through an attention mechanism, improving the prediction accuracy of the model; (3) proposes a Spatio-temporal Multi-graph Convolution Traffic Flow Prediction Model Based on Multi-source Information Fusion and Attention Enhancement (MIFA-ST-MGCN). This model integrates multi-source heterogeneous information, fusing external environmental data, regional similarity features, and traffic flow information, and incorporates the spatiotemporal attention module for feature enhancement. Ultimately, the traffic flow is predicted through a spatiotemporal graph convolutional network, and the model’s performance is evaluated using real traffic flow datasets. Additionally, we design ablation experiments to validate the effectiveness of attribute data fusion, multi-graph convolution, and spatiotemporal attention mechanisms. The model is also subjected to perturbation analysis, where Gaussian and Poisson noises are added to the original data to test the model’s robustness and stability.
2. Related Works
Traffic flow prediction is a crucial component of Intelligent Transportation Systems (ITS) and plays an important role in urban traffic control and development. Traffic flow prediction has undergone different evolutionary stages. In early research in the field of traffic flow prediction, due to limited understanding of the problem and technical constraints, traffic flow prediction was simply described as a time series statistical task. In this approach, mathematical statistics were used to extract linear or periodic patterns from historical traffic data to predict future traffic conditions. Among these methods, the Historical Average Model (HA) [
10] predicts future traffic flow states by using historical averages. This model is simple in principle, computationally fast, but suffers from low prediction accuracy and is difficult to apply to complex traffic scenarios. Time series models such as the ARMA model and its variants [
11,
12,
13] are based on autoregressive and moving average models. They predict based on the relationship between current and historical data while modeling periodicity and trends in the data. These models can capture the linear dependencies of the data well and are suitable for short-term traffic flow prediction where the data exhibits clear periodicity. These models are interpretable, computationally efficient, and cost-effective, making them widely used in simpler forecasting scenarios. However, these models are based on the assumption of time series stability and cannot handle nonlinear traffic features, making them ill-suited for dynamic changes.
With the continuous advancement of computer technology, machine learning-based traffic prediction models have gradually been applied in the field of traffic forecasting. Representative algorithms include K-Nearest Neighbors (KNN), Support Vector Machines (SVM), and Bayesian Networks, among others. The K-Nearest Neighbors algorithm [
14] selects the k most similar samples from historical data to the current state and predicts traffic flow through a weighted average. SVM [
15] use kernel functions to map low-dimensional nonlinear traffic data to high-dimensional spaces, constructing an optimal separating hyperplane to handle nonlinear features. Bayesian Networks [
16] model variable conditional dependencies by constructing a directed acyclic graph (DAG) and fit traffic flow using a Gaussian Mixture Model (GMM) with joint probability distributions, further predicting future traffic conditions. Compared to models based on mathematical statistics, these machine learning-based algorithms are capable of modeling more complex traffic flow features. However, they have limited ability to capture nonlinear traffic characteristics and long-term dependencies.
In recent years, deep learning has attracted significant attention from researchers due to its advantages in capturing nonlinear features and handling complex scenarios. Traffic flow data is a typical form of time-series data, and extracting its inherent temporal characteristics is one of the key challenges in traffic flow prediction. Recurrent Neural Networks (RNNs) have been widely applied in traffic prediction tasks due to their ability to effectively capture temporal dependencies in time-series data [
17]. However, during the backpropagation process, the gradient can vanish, causing RNNs to be influenced only by short-term memory, thereby failing to capture long-term temporal features. To address this issue, researchers have designed Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRU) to capture long-term temporal dependencies in traffic flow [
18,
19]. Although these models can capture the temporal dependencies of traffic flow, researchers have gradually recognized the importance of spatial dependencies in traffic prediction tasks. By incorporating Convolutional Neural Networks (CNNs) to extract spatial information from roads and combining them with LSTMs, the prediction accuracy can be improved [
20]. However, since CNNs are primarily designed for regular grid data, they cannot directly handle irregular topological structures, thus failing to fully capture the spatial dependencies of traffic flow. Graph Convolutional Networks (GCNs), on the other hand, can directly process irregular graph network structures, enabling a better exploration of the spatial characteristics of traffic networks [
21].
Beyond historical traffic data and spatial topology, traffic flow is also influenced by a variety of external factors. For instance, weather conditions, special events, and the spatial distribution of points of interest can all introduce disturbances to traffic patterns [
22,
23,
24]. Moreover, regions with similar road network structures often exhibit comparable traffic behaviors, and interregional interactions further affect the evolution of traffic flows. Effectively integrating these external factors and capturing interregional dependencies within predictive models remains a significant challenge in contemporary traffic flow forecasting [
25,
26,
27].
Notably, some recent research on fundamental traffic flow theory has provided new perspectives for understanding and predicting traffic congestion. On the one hand, studies from a traffic flow phase-transition perspective have proposed a “congestion boundary” approach to characterize the critical transformation between free-flow and congested states. This method leverages the bimodal distribution of traffic parameters (e.g., speed, density) to identify threshold values that separate free flow from congestion, thereby quantifying the critical conditions for traffic breakdown. For example, using real highway data, Lee et al. [
28] estimated a congestion boundary at approximately 66.9 km/h (speed), 22.8 vehicles per kilometer (density), and about 341 vehicles per five-minute interval (flow). Based on these thresholds, it was observed that actual roadways can experience a phase transition into congestion when flow reaches only around 72–83% of the conventional theoretical capacity. Thus, the congestion boundary method provides a theoretical foundation for pinpointing the onset of congestion and enabling effective congestion management. On the other hand, another line of research has drawn inspiration from the theory of phase transitions in simple fluids, treating traffic flow as a fluid system analogous to a gas–liquid phase transition, in order to explore the scaling laws of urban congestion [
29]. Laval et al. proposed that the fundamental diagram of traffic flow is analogous to the coexistence curve in gas–liquid phase transitions. Using this analogy, they demonstrated that urban traffic dynamics obey scaling relations characteristic of the Kardar–Parisi–Zhang (KPZ) universality class. Moreover, they found that the “costs” of congestion (such as travel delays and fuel consumption) scale superlinearly with city size (population), with growth even higher than predicted by conventional urban scaling theories [
30]. Taken together, these macro-level theoretical studies provide new insights into the underlying mechanisms of traffic congestion and offer important guidance for alleviating congestion in large cities. In addition, in the related task of travel time prediction, spatiotemporal deep learning models have achieved remarkable progress. Lee et al. combined a Gated Recurrent Unit (GRU) network with spatiotemporal analysis, proposing a model for highway travel-time prediction. This model explicitly integrates spatial dependencies among road segments and temporal dependency features within the GRU, effectively reducing the lag of predicted travel times relative to actual conditions. Experiments demonstrated that a GRU model augmented with spatiotemporal features outperforms traditional RNNs, LSTMs, as well as a GRU baseline without spatial information, achieving the highest accuracy in travel time prediction at both the segment and route levels. This finding indicates that incorporating spatial correlations into time-series predictions can significantly enhance the accuracy of travel time estimates.
In summary, advanced models such as AST-GCN and ST-GRAT have achieved notable progress in spatiotemporal traffic forecasting. AST-GCN pioneered the integration of external (exogenous) information via an attribute-enhancement module, while ST-GRAT dynamically captures road-network dependencies through a carefully designed spatiotemporal attention mechanism. Nevertheless, these methods still have limitations: the fusion strategy in AST-GCN is relatively simple and does not fully exploit diverse spatial relations; and although ST-GRAT excels in attention modeling, its ability to fuse heterogeneous multi-source exogenous data (e.g., weather, POIs) remains underexplored. To address these limitations, we propose the MIFA-ST-MGCN model, whose core innovations are as follows:
Multi-level multi-source information fusion architecture: Unlike AST-GCN’s simple attribute concatenation, our model constructs three complementary graphs—a geographic adjacency graph, a POI functional-similarity graph, and a spatial-similarity graph—to enable deep, layer-wise fusion within the graph-convolutional hierarchy.
Adaptive multi-graph fusion mechanism: We devise a learnable weighted fusion scheme that dynamically adjusts the relative contributions of different graph structures to the forecasting objective, thereby addressing the weight-allocation challenge in multi-source information fusion.
Enhanced temporal modeling capacity: Building on conventional GRU-based sequence modeling, we incorporate Transformer-style multi-head self-attention to better capture long-range dependencies, thereby mitigating GRU’s limitations in long-sequence modeling.
The comparison of the characteristics of these models is shown in
Table 1.
3. Model Design
MIFA-ST-MGCN captures space–time dependencies in traffic flow by integrating multi-graph convolution, spatiotemporal graph convolution, and spatiotemporal attention. The model can be divided into two branches: temporal and spatial. In the temporal branch, the model models the temporal attribute information and captures temporal features at different time scales through temporal attention. In the spatial branch, the model uses multi-graph convolution to capture various types of spatial dependencies, and combines spatial attention to better capture the influence of key road segments on traffic flow. Finally, the features from both the temporal and spatial branches are fused to generate the prediction results.
3.1. Problem Definition
The goal of traffic flow prediction is to forecast future traffic conditions based on historical states and both internal and external information. The traffic state of a road network is typically described using metrics such as traffic volume, average speed, and road occupancy.
Definition 1. Traffic Network Graph G. In the field of traffic prediction, the traffic network can be represented as a graph , where denotes the set of sensors recording traffic-related information in the network, with being the number of sensors, i.e., the number of nodes. represents the set of road segments connecting pairs of sensors, where is the number of road segments, i.e., the number of edges. is the adjacency matrix used to represent the connectivity between sensors, where is the element at the row and column of the adjacency matrix, indicating the connection status between nodes and . If , it indicates that there is a road segment connecting nodes and ; if , it means there is no direct road segment connecting nodes and . Therefore, the adjacency matrix is a binary matrix composed of 0 and 1.
Definition 2. Traffic Flow Feature Matrix X. We use traffic speed as the primary node feature on the road-network graph, forming a matrix . denotes the traffic speed at the sensor node at time .
Definition 3. Auxiliary Information K. Auxiliary information refers to environmental factors that influence the traffic flow state. We represent the environmental factors that affect traffic conditions as node-level auxiliary features, denoted by , where is the number of categories of auxiliary features. The auxiliary feature information of category is represented as , where denotes the auxiliary feature information of category at the sensor node at time .
To sum up, the traffic flow prediction problem can be viewed as learning the traffic flow information for the future time period
by combining the traffic network graph
, the flow feature matrix
, and the auxiliary information
, through the establishment of a function f, that models the relationship between these components, i.e.,
3.2. Overall Framework
We proposes a Spatio-temporal Multi-graph Convolution Traffic Flow Prediction Model based on Multi-source Information Fusion and Attention Enhancement (MIFA-ST-MGCN). As shown in
Figure 1, the model mainly consists of data preprocessing, a spatiotemporal attention module, and a spatiotemporal convolution module.
3.3. Spatio-Temporal Dependency Modeling
Traffic flow data exhibits dependencies not only in the spatial domain but also in the temporal domain. We choose spatiotemporal graph convolutional networks as the base prediction model. The Temporal Graph Convolutional Network (TGCN) [
31] integrates graph convolutional networks with temporal forecasting, simultaneously modeling in both the spatial and temporal dimensions. This approach effectively captures the spatiotemporal features of traffic flow and addresses the spatiotemporal dependencies in traffic flow prediction.
3.3.1. Spatial Modeling: Graph Convolution Operation
Graph Convolutional Networks (GCN) capture the spatial dependencies of each road segment through neighborhood aggregation. The modeling process of GCN is illustrated in
Figure 2 and can be divided into an input layer, hidden layers, and an output layer. The inputs to the GCN are twofold: the node feature matrix
, which describes the traffic conditions of each road segment, and the adjacency matrix
, which represents the connectivity between road segments, where
is the number of road segments and
is the feature dimension of each road segment. The hidden layers are the core components of the GCN, where convolution operations are defined on the graph structure to extract features. The computation for the
layer hidden state can be expressed as:
where
is the adjacency matrix with self-loops added,
is the degree matrix,
is the output of the
layer hidden state,
is the weight matrix of the
layer, and
is a nonlinear activation function. The output layer performs classification and regression on the hidden layer data through a fully connected layer.
3.3.2. Temporal Modeling: Graph Convolution Operation
Traffic flow data is a typical form of time-series data, and effectively capturing the temporal features within the data is crucial for prediction accuracy. We use Gated Recurrent Units (GRUs) to model the temporal dynamics in the data. Compared to traditional Recurrent Neural Networks (RNNs), GRU effectively mitigates the vanishing gradient problem in long time series by introducing a gating mechanism. In contrast to Long Short-Term Memory (LSTM) networks, GRU optimizes the gating mechanism, resulting in a simpler structure and higher computational efficiency.
The operation process of the GRU in handling time-series data is shown in
Figure 3. At time step
, the GRU receives the traffic flow feature
at the current time step and the hidden state
from the previous time step. Through the gating mechanism, it outputs the hidden state
at the current time step and passes it as the input hidden state to the next time step. The GRU’s gating mechanism consists of two gates: the update gate and the reset gate. The update gate
determines how much of the past traffic flow features
should be retained at the current time step and how much of the current input feature
should be integrated into the new hidden state
. The reset gate controls how much of the past traffic flow information should be forgotten. This process can be represented by Equations (3)–(6). Here,
represents the batch size,
represents the number of nodes, and
refers to the number of GRU units in the layer. We utilize GCN to extract spatial features at each time step, which are then input into a GRU to capture temporal dependencies, thereby constructing a fundamental spatiotemporal feature extraction framework.
3.4. Multi-Source Information Fusion Modeling
As a complex open system, the transportation system is influenced by various factors such as weather conditions and geographical location, which in turn affect traffic flow states. Single-source traffic flow data is insufficient to comprehensively capture these complex influencing factors. Therefore, this model considers different types of information and designs various data fusion strategies to enhance the model’s adaptability to external disturbances and environmental changes.
3.4.1. Graph Structure Construction
The changes in traffic conditions are closely related to the spatial correlations between regions. We model inter-regional spatial correlations by constructing spatial and functional similarity graphs.
Spatial Similarity Graph: According to Tobler’s First Law of Geography [
32], spatial entities exhibit spatial autocorrelation, namely, closer entities are more strongly correlated, whereas entities farther apart tend to be more dissimilar. Based on this principle, we construct a spatial similarity graph
, where
denotes the spatial similarity matrix. The spatial similarity between nodes
and
, denoted as
, is calculated as shown in Equation (7):
where
represents the path length between nodes
and
.
Functional Similarity Graph: As the distance between regions increases, the spatial correlation between them gradually decreases. However, due to the potential similarity in the distribution of Points of Interest (POI), the traffic flow states between regions may exhibit similar patterns of change. To deeply explore the correlation between POI distribution and regional traffic flow state changes, we use the Jensen–Shannon (JS) divergence to measure the functional similarity between two nodes and define the functional similarity graph
.
is the functional similarity matrix, and the functional similarity
between nodes
and
is calculated as shown in Equations (8)–(10):
where
represents the KL divergence between nodes
and
, and
represents the JS divergence between nodes
and
. As shown in Equation (8),
, and
represents the POI distribution feature of node
.
3.4.2. Multi Graph Convolution Fusion
The impact of spatial data of the same type on the traffic state varies across different road segments, and the influence of different types of spatial data on the traffic state of the same road segment also differs. We employ a multi-graph convolutional fusion strategy to model the spatial dependencies that shape traffic flow. The processing flow of multi-graph convolution is illustrated in
Figure 4.
The multi-graph convolution fusion strategy can be represented as
where
represents the Geographic Adjacency Matrix,
is the POI Function Matrix, and
is the Spatial Similarity Matrix. As shown in Equation (11), we assign a learnable weight to each graph structure and optimize these weights end-to-end via gradient-based training. We apply a Softmax to normalize the weights so that
,
and
are nonnegative and sum to one, ensuring a well-balanced contribution of each component to the fused adjacency matrix. This multi-graph convolution fusion strategy allows the model to adaptively emphasize different spatial relations based on the data: for example, traffic states may at certain times be driven more by physically adjacent links, whereas in other scenarios synchronous fluctuations among functionally similar regions may dominate. Compared with models that rely on a single adjacency, multi-graph convolution can dynamically capture heterogeneous types of spatial correlation, thereby enhancing the model’s ability to characterize complex traffic patterns.
3.4.3. Feature Level Fusion
We integrate static attributes (e.g., POI) and dynamic exogenous variables (e.g., weather) with the traffic-flow inputs via feature concatenation. Compared to other methods, the feature concatenation operation is simple, and the model does not need to consider the spatial correlation effects between different nodes, allowing the model to focus more on the processing of temporal features.
The attribute features of static attribute data are fixed and do not change over time. For example, in the case of POI data, the distribution and quantity of POI remain constant within a given time and spatial range and do not change over time. Therefore, the feature matrix
of static attribute data can be represented as
, where
represents the static attribute feature of the
node, and
is the total number of nodes. The significant characteristic of dynamic attribute features is that their attribute information changes over time. For example, weather conditions change dynamically at different time points, and the weather at the current time is influenced by past weather conditions and affects future weather conditions. Therefore, the feature matrix
of dynamic attribute data can be represented as
, and the dynamic feature information of node
,
, is represented as
, where
is the current time step, and
is the historical time step length. The fused attribute feature matrix at time
can be represented as Equation (12):
represents the feature concatenation matrix at time , where denotes the traffic flow features (e.g., speed) of all N road segments at time . denotes the number of nodes, and denotes the feature dimensionality.
is the static attribute feature matrix, and is the number of static attribute types. Since static attribute information does not change over time, the static attributes are repeatedly used at different time steps.
is the dynamic attribute feature matrix from to , where is the number of dynamic attribute types. Since dynamic attribute features change over time, the information from time to is selected as the input for time when modeling the dynamic attribute matrix. Adopting straightforward feature concatenation to fuse attribute data is simple and effective to implement. Compared with more complex fusion schemes, plain concatenation avoids introducing excessive parameters at the fusion stage, deferring importance weighting to subsequent attention modules and thereby mitigating overfitting risk.
We use POI data as a semantically explicit static descriptor and weather data as a dynamic feature of traffic flow. Specifically, for POI we take, for each road segment, the dominant POI-category code as its POI feature, apply min–max normalization to scale it to [0, 1], and replicate it across time to fill the temporal dimension to length . For weather, we likewise use the weather-category code as the weather feature, min–max normalize it to [0, 1], and replicate it across space to fill the spatial dimension to segments. For traffic-flow variables, we directly apply min–max normalization to [0, 1]. Through this dimensional replication (broadcasting), all inputs are expanded to a unified shape of and then fed into subsequent modules for feature extraction and computation.
3.5. Attention-Enhancing Mechanism
Traffic flow data contains latent features that are difficult to capture in both the spatiotemporal dimensions. In the temporal dimension, traffic flow is nonlinear and dynamically changing, influenced by various external factors. In the spatial dimension, traffic flow exhibits complex spatial interactions, where the traffic state of other regions can directly or indirectly affect the traffic flow in the local region. We introduce a spatiotemporal attention module to capture spatiotemporal dependencies in traffic-flow data. The module consists of both temporal and spatial attention mechanisms. Time attention is used to dynamically adjust the weights of different time steps based on the historical sequence, while spatial attention highlights key road segments and graph structures that have a significant impact on traffic flow. Through a spatiotemporal multi-head self-attention mechanism, the model dynamically adjusts its attention to different time periods and regions, enhancing its ability to capture complex spatiotemporal dependencies.
The overall architecture of the spatiotemporal attention module is shown in
Figure 5. This module employs a multi-head self-attention mechanism to capture the spatiotemporal dependencies in traffic flow data. The multi-head self-attention mechanism projects the traffic flow feature data onto multiple attention heads, and the weights for each attention head are computed in parallel. Then, the outputs of each attention head are concatenated together through a weighted fusion. The computation process of multi-head self-attention can be expressed as
In the equations, represents the feature of the current time step or the current road segment; denotes the traffic flow features of the historical time steps or other road segments; represents the weighted features of each time step or road segment; is the number of attention heads; is the feature matrix. When processing temporal features, , where is the batch size, is the number of time steps, and is the feature dimension. When using the attention mechanism to capture dynamic correlations between nodes, , where is the number of nodes, i.e., the number of road segments.
are learnable weight matrices, and is the dimension of the Query and Key matrices.
3.6. Spatio-Temporal Multi Graph Convolution Model Based on Multi-Source Information Fusion and Attention Enhancement (MIFA-ST-MGCN)
Based on spatiotemporal graph convolutional networks, we have designed a spatiotemporal multi-graph convolution model with multi-source information fusion and attention enhancement to address the complex spatial dependencies and temporal dependencies in traffic flow prediction. As shown in
Figure 6, the attribute feature fusion module concatenates the dynamic and static attribute matrices at each time step into the feature matrix
to expand its feature dimension. The expanded feature matrix is denoted as
, and then, through temporal attention, the weights of each time step are adjusted to obtain
. The spatial feature fusion module adjusts the importance of different graphs and road segments using spatial attention mechanisms and multi-graph convolution, resulting in the fused adjacency matrix. The enhanced feature matrix
and the fused adjacency matrix
are then input into the model
to obtain the final prediction result
:
We construct the model by combining a Graph Convolutional Network (GCN) with Gated Recurrent Units (GRU) to capture spatiotemporal dependencies in traffic-flow data. Specifically, the fused adjacency matrix and the enhanced feature matrix are first input into the GCN module, where multiple layers of graph convolution and non-linear activation are applied to extract spatial representations of each road segment at different time steps. These representations are subsequently fed into the GRU, which utilizes update and reset gates to regulate the preservation and adaptation of traffic information over time, thereby capturing temporal dynamics in the traffic flow.
The primary objective of traffic flow forecasting is to minimize prediction errors, ensuring that the predicted values closely approximate the actual observations. Accordingly, when designing the loss function, it is essential to reduce the prediction error to enhance the model’s forecasting accuracy:
Specifically,
denotes the L2 loss, which quantifies the discrepancy between the predicted and actual values, where represents the predicted traffic flow of the road segment and denotes the corresponding ground truth. The term corresponds to L2 regularization, which is introduced to mitigate overfitting. Here, is the regularization coefficient that controls the strength of the regularization.
5. Conclusions
We propose a spatio-temporal multi-graph convolution traffic flow prediction model, termed MIFA-ST-MGCN, which integrates multi-source information fusion and attention enhancement to improve prediction accuracy and robustness. By combining Temporal Graph Convolutional Networks (TGCN), multi-source information fusion, and spatio-temporal attention mechanisms, we design a predictive framework capable of effectively capturing spatio-temporal dependencies in traffic flow while accounting for external environmental factors. Comparative experimental results demonstrate that MIFA-ST-MGCN outperforms existing baseline models across multiple evaluation metrics, validating its effectiveness in complex traffic scenarios. Ablation studies further confirm the contributions of feature fusion, multi-graph convolution, and spatio-temporal attention, while perturbation experiments show that the model maintains strong adaptability under noisy and uncertain conditions.
Despite the strong predictive performance demonstrated by the proposed model, several limitations remain. First, the model’s scalability is constrained by the computational complexity inherent in its key components. The spatio-temporal multi-head self-attention mechanism, while effective, incurs a computational cost of for spatial attention and for temporal attention, where is the number of nodes, is the sequence length, and is the feature dimension. This quadratic dependency on and can become a bottleneck when applied to large-scale metropolitan networks with thousands of nodes. Future work will explore more efficient attention mechanisms, such as linearized or sparse attention, to mitigate this issue. Second, the model’s performance is sensitive to certain hyperparameters, such as the number of graph convolution layers, the number of attention heads, and the dimensionality of GRU hidden states. Although we conducted extensive parameter tuning for this study, a comprehensive sensitivity analysis was not included. The depth of the multi-graph convolution module, in particular, requires careful design; preliminary ablation experiments (varying layers from 1 to 3) indicated that while a 2-layer structure offered the best trade-off, deeper architectures risk over-smoothing and increased computational overhead without commensurate gains in accuracy. Third, the model’s reliance on multi-source data (e.g., POI, weather), while beneficial for accuracy, introduces practical deployment challenges in data-scarce environments. The requirement for high-quality, synchronized external data may limit the model’s applicability in regions where such data are incomplete or unavailable. Future iterations could investigate semi-supervised or self-supervised learning strategies to reduce dependency on extensive labeled and auxiliary data. Lastly, the current framework assumes a static graph topology, which restricts its ability to adapt to dynamic network changes caused by traffic incidents or temporary road closures. Integrating dynamic graph construction techniques or temporal graph networks could enhance the model’s responsiveness to real-time structural variations.
In summary, the proposed MIFA-ST-MGCN model provides an effective solution to the traffic flow forecasting problem, carrying both theoretical significance and practical value. With the increasing complexity of urban transportation systems and the growing volume of data, the model can be further optimized and extended in future work to better adapt to more dynamic and complex traffic environments.