Next Article in Journal
State Estimation-Based Disturbance Rejection Control for Third-Order Fuzzy Parabolic PDE Systems with Hybrid Attacks
Previous Article in Journal
On Lexicographic and Colexicographic Orders and the Mirror (Left-Recursive) Reflected Gray Code for m-Ary Vectors
Previous Article in Special Issue
A Survey on Missing Data Generation in Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Traffic Flow Prediction in Complex Transportation Networks via a Spatiotemporal Causal–Trend Network

1
School of Automation, Nanjing University of Information Science and Technology, Nanjing 210044, China
2
Jiangsu Provincial University Key Laboratory of Vehicle-Road Multimodal Perception and Control, Wuxi University, Wuxi 214105, China
3
Traffic Management Research Institute of the Ministry of Public Security, Wuxi 214151, China
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(3), 443; https://doi.org/10.3390/math14030443
Submission received: 25 December 2025 / Revised: 22 January 2026 / Accepted: 24 January 2026 / Published: 27 January 2026
(This article belongs to the Special Issue Advanced Machine Learning Research in Complex System)

Abstract

Traffic systems are quintessential complex systems, characterized by nonlinear interactions, multiscale dynamics, and emergent spatiotemporal patterns over complex networks. These properties make traffic prediction highly challenging, as it requires jointly modeling stable global topology and time-varying local dependencies. Existing graph neural networks often rely on predefined or static learnable graphs, overlooking hidden dynamic structures, while most RNN- or CNN-based approaches struggle with long-range temporal dependencies. This paper proposes a Spatiotemporal Causal–Trend Network (SCTN) tailored to complex transportation networks. First, we introduce a dual-path adaptive graph learning scheme: a static graph that captures global, topology-aligned dependencies of the complex network, and a dynamic graph that adapts to localized, time-varying interactions. Second, we design a Gated Temporal Attention Module (GTAM) with a causal–trend attention mechanism that integrates 1D and causal convolutions to reinforce temporal causality and local trend awareness while maintaining long-range attention. Extensive experiments on two real-world PeMS traffic flow datasets demonstrate that SCTN consistently achieves superior accuracy compared to strong baselines, reducing by 3.5–4.5% over the best-performing existing methods, highlighting its effectiveness for modeling the intrinsic complexity of urban traffic systems.

1. Introduction

The rapid expansion of modern cities and the corresponding proliferation of vehicles have introduced pressing societal challenges, including traffic congestion, an increase in traffic accidents, and heightened environmental pollution [1,2,3,4]. As a core component of Intelligent Transportation Systems (ITS), traffic flow prediction offers a theoretical foundation for mitigating these issues by alleviating road congestion, reducing traffic incidents, and enhancing overall transportation efficiency [5,6,7,8]. Furthermore, accurate and timely traffic forecasting provides crucial information for both travelers and authorities: it enables individuals to perform more effective route planning and empowers transportation departments to formulate proactive traffic management strategies, thereby improving public safety [9,10,11,12].
In pursuit of higher prediction accuracy, numerous methodologies have been explored by the research community. Broadly, traffic flow forecasting methods can be categorized into three main paradigms: statistical learning, traditional machine learning, and deep learning. Statistical approaches, such as the Autoregressive Integrated Moving Average (ARIMA) model [13] and Kalman filtering [14], are particularly effective for modeling stationary or slowly varying traffic patterns. However, these models exhibit significant limitations when confronted with abrupt changes in traffic flow. Their high computational overhead and resulting prediction latency render them ill-suited for capturing the dynamic and stochastic nature of real-world traffic conditions.
Traditional machine learning methods, such as Support Vector Machines (SVM) [15], Support Vector Regression (SVR) [16], Bayesian models [17], and k-Nearest Neighbor (KNN) [18], can capture non-linearities in time-series data. However, these approaches often rely on intricate, manual feature engineering and exhibit poor generalization capabilities. Even recent attempts to fuse visual quantified features from multi-source sensors [19] require labor-intensive feature extraction and specialized hardware configurations, limiting their deployment scalability. In contrast, deep learning methods—including Recurrent Neural Networks (RNNs) [20], Gated Recurrent Units (GRUs) [21], Long Short-Term Memory (LSTM) networks [22], and Convolutional Neural Networks (CNNs) [23]—possess powerful feature learning and non-linear extraction abilities, making them well-suited for discovering latent patterns in traffic data. While these models have improved prediction accuracy, they are inherently limited in their capacity to automatically capture the complex spatial characteristics of urban road networks. Recent transformer-based architectures like the Synchronous Spatiotemporal Graph Transformer [24,25] and PDFormer [26] unify graph structures with self-attention to model global interactions; however, they still depend on predefined static adjacency matrices and lack explicit temporal causality constraints. In summary, a common limitation across many of these cutting-edge models is their reliance on progressive message-passing or recurrent mechanisms to capture spatiotemporal correlations. This stepwise process can lead to the loss of critical information at each propagation step, limiting the model’s overall efficacy.
While existing studies have made progress in spatial dependency modeling through graph neural networks and temporal pattern learning via attention mechanisms, three critical gaps remain unresolved:
(1)
Static-Dynamic Spatial Decoupling: Current graph-based methods typically rely on predefined static graphs (e.g., road connectivity) or learn dynamic graphs separately, failing to jointly model both global topological stability and localized traffic-induced spatial dynamics.
(2)
Temporal Causality Neglect: Most attention-based temporal modules treat all historical time steps equally, ignoring the fundamental causal constraints where future traffic states cannot influence past observations.
(3)
Information Degradation: The sequential message-passing paradigm in spatiotemporal models causes progressive information loss during feature propagation, particularly detrimental for long-range forecasting.
Our SCTN framework addresses these gaps through three key innovations: First, the co-evolution of static and dynamic graphs enables simultaneous learning of invariant road network properties and time-sensitive traffic interactions. Second, the causal-trend attention mechanism introduces temporal causality constraints through a masked attention structure while capturing localized trend patterns via differential operators. Third, the gated fusion architecture eliminates the need for sequential information passing, thereby preventing the characteristic information degradation of existing approaches. This integrated solution represents the first attempt to unify causal temporal modeling with adaptive spatial graph co-learning in traffic prediction.
To address the aforementioned challenges and limitations, we propose a novel framework, the Spatiotemporal Causal-Trend Network (SCTN). The SCTN architecture is composed of three main components: a graph learning layer, an adaptive graph convolution layer, and a gated temporal attention module. The primary contributions of this paper are summarized as follows:
  • We propose an adaptive graph learning layer that requires no prior knowledge to jointly model the spatial dependencies of the traffic network. This is achieved by constructing both a static and a dynamic graph: the static graph captures the stable, global spatial topology, while the dynamic graph focuses on capturing localized spatial dynamics that vary with time and traffic conditions.
  • We design a Gated Temporal Attention Module (GTAM) that integrates a novel causal-trend attention mechanism. This module not only effectively captures long-range temporal dependencies but also, through its specialized attention mechanism, enables the model to precisely extract causal relationships and local trend information from the time series data.
  • We conduct extensive experiments on multiple real-world traffic datasets. The results validate the effectiveness of our proposed method, demonstrating that the SCTN model achieves superior and more robust prediction performance compared to current state-of-the-art baseline models.
The remainder of this paper is organized as follows: Section 2 formulates the problem and details the proposed SCTN framework, including the dual-path graph learning and Causal-Trend Attention mechanism. Section 3 presents the experimental setup, datasets, and comparative results. Section 4 discusses the model’s advantages, limitations, and future directions. Finally, Section 5 concludes the work. In addition, the variables involved in this paper are listed in Table 1.

2. Materials and Methods

2.1. Problem Formulation

In this work, we formulate the task of traffic flow prediction as a spatiotemporal sequence forecasting problem. The underlying road network is modeled as a directed graph G ( V , A ) , where V is the set of N = | V | nodes, typically representing sensors, and A R N × N is a weighted adjacency matrix representing the spatial correlations between nodes. The diagonal elements of A are conventionally initialized to 1 to preserve the self-loop connections, thereby retaining each node’s intrinsic features. At any given time step t , the traffic data observed across all nodes forms a feature matrix X t R N × C , where C is the number of features per node.

2.2. The SCTN Framework

Figure 1 illustrates the overall architecture of our proposed Spatiotemporal Causal-Trend Network (SCTN). The framework is designed to capture both global and local spatial patterns via a graph learning layer, which simultaneously learns a static graph representing stable, network-wide relationships and a dynamic graph capturing localized, time-varying dynamics. For the temporal dimension, we introduce a Gated Temporal Attention Module (GTAM), which comprises two parallel attention layers. The core of this module is a novel causal-trend attention mechanism designed to identify and model long-range temporal dependencies. Following the temporal module, an adaptive graph convolution layer aggregates spatial information through two independent graph convolution branches, guided respectively by the learned static and dynamic graphs. To ensure robust information propagation and mitigate the vanishing gradient problem, the network incorporates residual connections between layers. Furthermore, skip connections are employed to directly link intermediate layers to the final output module, facilitating a strong flow of information throughout the model.

2.2.1. Static Graph Learning

The static graph learning layer is designed to learn a data-driven, static adaptive adjacency matrix as that captures global spatial correlations in traffic data without relying on a predefined graph. We construct as from learnable node embeddings, following embedding-based adaptive graph learning in prior work [6], as formulated in Equations (1)–(3):
M 1 = t a n h ( E 1 θ 1 )
M 2 = t a n h ( E 2 θ 2 )
A s = S o f t M a x ( R e L U ( M 1 M 2 T ) )
In the formulas, E 1 , E 2 are randomly initialized node embeddings whose parameters can be learned during the training process, and θ 1 , θ 2 are model parameters. The ReLU activation function is used to eliminate the connections between nodes. Additionally, the SoftMax activation function is used to normalize the adjacency matrix.
Given that the sparsity and smoothness of the graph structure have a significant impact on prediction performance, we introduce an additional graph regularization loss term to explicitly enforce these properties. For a given node feature matrix X F = ( x 1 , x 2 , , x N ) R N × D and the learned global adjacency matrix A , this regularization term is formulated as:
L G = α 1 N 2 i , j N A i , j x i x j 2 + β A F 2
Here, α 1 N 2 functions as a normalization factor where α is a hyperparameter controlling the weight of the smoothness term and N 2 scales the summation by the total number of node pairs. The first term i , j N A i , j x i x j 2 constitutes the graph Laplacian regularizer, which penalizes large feature differences between connected nodes and thereby enforces smooth traffic signal propagation across adjacent sensors. The second term incorporates β as a hyperparameter balancing the sparsity penalty, while A F 2 represents the Frobenius norm of A (sum of squared elements) that prevents the trivial solution of an all-zero adjacency matrix and controls graph sparsity.
The design of this regularizer is grounded in the graph signal smoothness assumption, which posits that traffic signals should vary smoothly across adjacent nodes. Minimizing the first term encourages connected sensors to have similar features, while the second term explicitly controls graph sparsity. During training, L G is integrated into each gradient update step and computed dynamically with respect to the output features at each layer.

2.2.2. Dynamic Graph Learning

To overcome the limitations of a static graph structure in capturing the time-varying dependencies inherent in spatiotemporal traffic data, we introduce a dynamic graph to capture real-time, localized correlations between nodes. This graph adaptively updates node relationships at each time step. The implementation is based on a self-attention mechanism, where the core idea is to compute the spatial correlation scores between nodes [18]. Specifically, given the dynamic node features X t , the dynamic spatial adjacency matrix is defined as follows:
A d = S o f t M a x ( X t X t T d m o d e l ) R N × N

2.2.3. Adaptive Graph Convolution Module

To effectively process non-grid or unstructured data, Graph Convolutional Networks are widely used to extract high-order node feature representations through neighborhood information aggregation. In a Graph Convolutional Network with k layers, the iterative formula for information propagation at the l -th layer can be expressed as:
H ( l ) = A ^ H ( l 1 ) W ( l )
In the formula, H ( l ) R N × d l represents the output of the node features at the l -th layer, W ( l ) R d l 1 × d l represents the model’s weight matrix at the l -th layer, and A ^ represents the normalized adjacency matrix.
However, the deep extension of graph convolutions faces a common problem: deep networks are prone to the over-smoothing phenomenon of feature homogenization, while shallow networks have the limitation of an insufficient information propagation range. This indicates that the receptive field of the nodes needs to be chosen carefully according to specific application requirements. Based on this, this paper designs an adaptive attention mechanism (Figure 2), which can independently adjust the effective neighborhood size for each node. This scheme is different from a simple concatenation of multi-layer features [ H ( 0 ) , H ( 1 ) , , H ( k ) ] ; instead, it achieves a more effective balance between local information and global propagation by assigning differentiated attention to neighbors of different diffusion depths, thereby generating more discriminative node features. Its working mechanism is described by the following formulas:
H ( 0 ) = M L P ( X ) , N × D H ( l ) = α H ( 0 ) + ( 1 α ) A ^ H ( l 1 ) ,         N × D P = s t a c k ( H ( 0 ) , H ( 1 ) , H ( k ) ) , R N × ( k + 1 ) × D S = r e s h a p e ( σ ( P W ) ) , R N × 1 × ( k + 1 ) Z = s q u e e z e ( S P ) , R N × D
In the formulas, H ( 0 ) represents the feature matrix extracted by applying a multilayer perceptron to the initial node features X . W is a trainable parameter matrix, and S represents the attention score corresponding to each layer. α is an activation function.
To investigate the interplay between global spatial adaptivity and local dynamics, this model is designed with a dual-path graph convolutional structure. This structure applies static and dynamic graph topologies to respective adaptive graph convolutional layers. Specifically, it replaces the original static matrix A s with two learned adjacency matrices, A ^ and A d . The final output of the model is generated by fusing the results from these two convolutional paths, as formulated below:
Z = Z static + Z d y n a m i c

2.2.4. Gated Temporal Attention Module

To capture long-term temporal dependencies, the model incorporates a temporal attention module. Its core mechanism consists of two parallel temporal attention layers: (1) one layer uses the hyperbolic tangent (tanh) activation function as an information filter, and (2) the other utilizes the sigmoid activation function to act as a gate, controlling the information flow from the current time step to subsequent modules. This design aims to capture dynamic temporal trends. Building on this, to further integrate information from different representation subspaces, the model employs a multi-head self-attention mechanism [27]. Its computation process is defined as follows:
Attention ( Q , K , V ) = So f t Max ( Q K T d k ) V
In the above formula, Q , K , V and d k represent the query, key, value, and dimension, respectively.
In this self-attention mechanism, the queries, keys, and values are all derived from the same input sequence, i.e., Q = K = V. The computation process is as follows: first, these three matrices (Q, K, V) are independently projected into multiple distinct representation subspaces via linear transformations. Subsequently, an attention function is computed in parallel within each subspace. Finally, the outputs from all subspaces are concatenated and integrated through a final linear projection. The final output of this process can be formally defined as:
MultiHead ( Q , K , V ) = ( h e a d 1 , , h e a d h ) W o
h e a d j = Attention ( Q W j Q , K W j K , V W j V )
In the above formula, W o represents the output projection matrix. This mechanism enables the model to focus its attention on critical information and efficiently model the global interactions among elements within a sequence. This contributes to forming an adaptive receptive field that is not constrained by locality.
To address the issue that standard attention mechanisms struggle to capture causality and local trends in time-series data, this paper proposes a Causal-Trend Attention mechanism. While conventional self-attention is prone to mismatches due to numerical similarity, our proposed method reconstructs the projection process by building upon one-dimensional (1D) convolution and causal convolution, thereby enhancing the model’s ability to capture temporal locality and causal structure. Specifically, 1D convolution is applied to the queries and keys to extract local contextual features, while a causal convolution is applied to the values to prevent information leakage from future time steps. The computation of the Causal-Trend Attention is defined as follows:
CTAttention ( Q , K , V ) = ( h e a d 1 , , h e a d h ) W o
h e a d j = Attention ( Q Φ j Q , K Φ j K , V Ψ j V )
In the above formula, Φ j Q and Φ j K are the parameters of the 1D convolutional kernels, and Ψ j V is the parameter of the causal convolutional kernel.

2.2.5. Spatiotemporal Embedding

Although the proposed model can capture spatial and temporal dynamics separately, it does not yet explicitly model spatiotemporal heterogeneity and the inherent spatiotemporal order of the signals. To address this, this paper introduces a spatiotemporal positional embedding to enhance the model’s overall capability for modeling spatiotemporal correlations. Specifically, two learnable embedding matrices, a temporal embedding T E R T × C and a spatial embedding S E R N × C , are constructed for the traffic signal sequence X G R N × T × C . During the model training process, these embedding matrices learn to encode additional spatiotemporal contextual information. Finally, they are added to the input traffic signal sequence via broadcasting operations, forming an enhanced sequential representation that simultaneously incorporates the original signal and its spatiotemporal coordinates:
X G + T e m b + S e m b = X G + T E + S E

2.2.6. Loss Function

Unlike most existing methods, this paper aims to simultaneously learn a task-driven graph structure and optimize the core model parameters by jointly optimizing a hybrid objective function. This function integrates a graph regularization term with a prediction loss term. The hybrid loss function is defined as follows:
L ( Y , Y ^ ) = L G + L 1 l o s s ( Y , Y ^ )

3. Results

3.1. Datasets

We evaluated the performance of SCTN on two public traffic datasets: PEMS04 and PEMS08 [28]. These datasets contain traffic flow data collected from highways in California, aggregated into 5-min intervals. The detailed information of the dataset is shown in Table 2. This results in 12 time steps per hour. We use 12 historical time steps (1 h) to predict the traffic flow for the next 12 time steps (1 h). Furthermore, we adopted the same data preprocessing procedures as STSGCN, which includes: (1) missing value handling via linear interpolation for short gaps (<3 time steps) and forward-fill for longer sequences; (2) outlier removal using a 5-sigma threshold based on historical means at each sensor; (3) noise reduction through a moving average filter with window size 3; and (4) Z-score normalization using statistics computed from the training set. This comprehensive filtration pipeline ensures data quality while preserving underlying traffic dynamics.

3.2. Experimental Settings

We split the datasets into training, validation, and test sets using a 6:2:2 ratio. Following Equation (7), the parameter k for the graph convolution and diffusion steps was set to 3. The hidden state dimension was set to 64, and the dimension of the node embeddings was set to 32. We used 8 attention heads, each with dimension 8. The graph regularization coefficients α and β in Equation (4) were set to 0.001 and 0.0001, respectively. Additional hyperparameters include: batch size of 64, maximum training epochs of 200, early stopping patience of 15 epochs, weight decay of 0.0001, and dropout rate of 0.3. Gradient clipping with a maximum norm of 5 was applied to ensure stable training. The model was trained using the Adam optimizer with an initial learning rate of 0.001, which was reduced by 0.5 using a reduce-on-plateau scheduler when validation loss did not improve for 5 consecutive epochs. We selected Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE) to evaluate the model’s performance. The formulas for these evaluation metrics are as follows:
M A E = 1 n t = 1 n Y t Y ^ t
R M S E = 1 n t = 1 n ( Y t Y ^ t ) 2
M A P E = 1 n t = 1 n Y t Y ^ t Y t

3.3. Baseline Models

SVR [16]: A classic regression model that uses Support Vector Machines to perform the prediction task.
DCRNN [27]: A model that integrates diffusion graph convolutions into a gated recurrent unit (GRU) to capture spatiotemporal dependencies.
FC-LSTM [28]: An encoder-decoder prediction model that employs a recurrent neural network with fully connected LSTM hidden units.
STGCN [29]: A model that employs graph convolutional layers and causal convolutional layers to model spatiotemporal dependencies.
ASTGCN [30]: A model that introduces a spatiotemporal attention mechanism for traffic prediction. It integrates three distinct components to model the periodicity of traffic data.
STSGCN [31]: A model that directly captures correlations by constructing and operating on local spatiotemporal graphs.
PGCN [32]: Progressively updates the adjacency graph from the current input (even at test time) and combines it with dilated causal temporal convolutions for traffic forecasting.
PDFormer [26]: A Transformer that models dynamic short-/long-range spatial dependencies via masked attention and explicitly accounts for propagation delays.
AGCRN [33]: A model that captures node-specific spatiotemporal dynamics through a learnable, adaptive graph structure.
DPSTGC [25]: A spatiotemporal Transformer that builds delay-aware directed graphs and learns dynamic correlations, using delay-aware attention for better interpretability.
STFGNN [34]: A model that constructs the graph using the Dynamic Time Warping (DTW) algorithm to explore both local and global spatial correlations.

3.4. Analysis of Experimental Results

Figure 3 presents SCTN’s error curves and relative improvements over different forecasting horizons on PEMS04 and PEMS08. Table 3 summarizes the overall performance. SCTN achieves competitive and consistently strong accuracy on both datasets.
Classical baselines (SVR) and purely temporal models (FC-LSTM) perform the worst, because they fail to explicitly model spatial interactions on the road network. Once spatial dependencies are introduced, graph-based approaches substantially reduce errors. For example, DCRNN and STGCN outperform SVR/FC-LSTM by a large margin on both datasets, demonstrating the necessity of incorporating spatial structure for traffic forecasting.
Among GNN-style methods, models using pre-defined/static graphs (DCRNN, STGCN, ASTGCN, STSGCN) are generally less effective than adaptive/dynamic graph learning approaches. AGCRN improves over several static-graph baselines by learning node-specific spatial relations, while PGCN further enhances performance by progressively updating the adjacency graph conditioned on current inputs, which better matches the non-stationary and time-varying nature of traffic correlations. This is reflected by PGCN’s clear gains over many conventional GNN baselines (e.g., PEMS08 RMSE 25.19 vs. STGCN 26.71).
Transformer-based models (PDFormer and DPSTGC) deliver very strong accuracy, benefiting from attention-based long-range dependency modeling and explicit delay-aware mechanisms. In Table 3, PDFormer achieves the best RMSE/MAE on both datasets (PEMS04: RMSE 29.98, MAE 18.35; PEMS08: RMSE 24.18, MAE 14.98), and DPSTGC also performs competitively (PEMS08 MAPE 9.76). However, these Transformer paradigms typically come with higher computational overhead (more parameters, slower training, and longer inference latency), which may limit their practicality in resource-constrained or real-time deployment scenarios.
Overall, SCTN provides a favorable balance between accuracy and efficiency. Compared with strong baselines, SCTN achieves the lowest MAPE on both datasets while maintaining near-optimal RMSE/MAE (close to the best Transformer results). The improvements can be attributed to: (1) the static-dynamic graph learning module that captures both stable structural relations and time-varying correlations; and (2) the proposed Causal-Trend Attention, which adaptively models temporal dynamics across horizons and alleviates error accumulation in longer-term forecasting. These results suggest that effectively combining dynamic spatial structure modeling with causal trend-aware temporal learning is critical for accurate and practical traffic prediction.

3.5. Ablation Study

To further investigate the impact of different constituent modules on the model’s performance, we conducted an ablation study on the PEMS04 and PEMS08 datasets. The model variants, each with a specific module removed, are denoted as follows:
w/o GLoss: SCTN without the graph regularization loss.
w/o Emb: SCTN without the spatial and temporal embeddings.
w/o DyGra: SCTN without the dynamic graph learning layer. In this variant, only the static graph learning layer is used to adaptively model spatial correlations.
w/o Gating: SCTN without the gating mechanism in the temporal attention module. The output of the temporal attention layer is directly passed to the next module without information filtering.
w/o CT-Att: SCTN without the Causal-Trend Attention mechanism. It is replaced by the conventional multi-head self-attention, which does not consider local trends.
As illustrated in Figure 4, removing the graph regularization loss (GLoss) leads to a significant degradation in performance. This is because the graph loss function optimizes the adaptive traffic graph structure, thereby facilitating information propagation across the graph. Without it, the learned adaptive matrix fails to effectively reflect the global spatial correlations of the traffic network. This result also indirectly validates that global spatial dependencies have a substantial impact on prediction performance. After removing the dynamic graph learning layer (w/o DyGra), the model’s predictive performance progressively deteriorates across the 12 prediction steps, an effect that is particularly evident in the RMSE on PeeMS04 and the MAE on PeMS08. The reason is that long-range spatial dependencies are highly dynamic, and a single global graph structure struggles to perceive fine-grained, local spatial information. The dynamic graph in SCTN overcomes this limitation by capturing these locally varying spatial correlations. The SCTN variant without the Causal-Trend Attention mechanism (w/o CT-Att) performs considerably worse than the full model. This indicates that modeling causality and local trends in time series yields superior predictive performance compared to the conventional multi-head self-attention mechanism. Furthermore, the spatiotemporal embeddings and the gating mechanism are also crucial, as they enhance the prediction accuracy at each step of the forecast horizon.

3.6. Visualization Analysis

We conducted a visualization analysis of the prediction results on the PEMS04 and PEMS08 datasets. Figure 5 displays a comparison between the predicted values from different models and the ground truth values over a one-week period. From a macroscopic perspective, our proposed SCTN model effectively learns the traffic flow patterns of the real-world network, and its predictions closely follow the trends observed in the ground truth data.
To further demonstrate the adaptive capability of our dynamic graph learning mechanism, Figure 6 visualizes the evolution of the dynamic adjacency matrix A d at four representative time slices (00:00, 06:00, 12:00, and 18:00) on the PEMS04 dataset. The heatmaps reveal distinct spatial dependency patterns: sparse connections during midnight free-flow conditions, strong localized clusters capturing morning rush-hour commuter corridors, moderately dispersed connections at midday, and dense long-range dependencies during evening peak congestion. These results validate that A d successfully captures time-varying spatial correlations driven by real traffic conditions.

3.7. Cost Experiment

On the PeMS08 dataset, as shown in Figure 7, we compare STGCN, ASTGCN, STSGCN, STFGNN, PDFormer, DPSTGC, and SCTN in terms of parameter size, training/inference time, and RMSE. The results show that SCTN offers a better overall balance between cost and performance: it has 392 k parameters, which is notably smaller than DPSTGC (569 k) and PDFormer (541 k), and also more compact than STFGNN (481 k). In terms of accuracy, SCTN achieves an RMSE of 24.32, improving over STGCN (26.71) by 8.9% and outperforming most baselines, including STFGNN (26.22). For efficiency, SCTN takes 5.64 s/epoch for training and 0.40 s/epoch for inference; it trains much faster than PDFormer (12.47 s/epoch) and DPSTGC (9.43 s/epoch), and it is the fastest at inference among all compared models (e.g., 0.89 s/epoch for STSGCN and 1.22 s/epoch for PDFormer). Although PDFormer achieves the best RMSE (24.18), it comes with substantially higher training and inference costs; overall, SCTN provides a more practical trade-off when both accuracy and deployment efficiency matter.

4. Discussion

The comparative results show that SCTN consistently outperforms representative baselines on two typical highway datasets, while also achieving a more favorable accuracy–efficiency trade-off when compared with recent stronger models. Unlike methods that rely on a fixed or a single adaptive graph, SCTN’s dual-path “static–dynamic” graph learning captures both stable global topology and time-varying local dependencies: the static graph provides an anchor for structural prior and global consistency, while the dynamic graph updates inter-node similarity via self-attention to respond to incidents and localized congestion propagation. Compared with DCRNN (diffusion convolution + GRU) and STGCN (GCN + temporal convolution, which can be constrained in long-range modeling by kernel size), SCTN’s Gated Temporal Attention builds an adaptive temporal receptive field and uses gating to suppress noise propagation. Relative to ASTGCN (explicit spatiotemporal attention for periodicity), STSGCN (localized spatiotemporal graph correlations), and STFGNN (graph construction via DTW), SCTN further benefits from task-driven joint graph learning with smoothness/sparsity regularization and from Causal-Trend Attention, which avoids future leakage and enhances local trends and directional temporal consistency via 1D convolutions on queries/keys and causal convolution on values. In comparison with AGCRN (node-adaptive graph learning), SCTN indicates that combining a stable global graph with an adaptive local graph yields more stable long-horizon forecasting. Moreover, against newer attention/Transformer-style baselines such as PDFormer and DPSTGC, SCTN delivers competitive or near-best accuracy with substantially lower training and inference latency, making it more suitable for real-time deployment where throughput and memory budgets are critical.
These findings align with our working hypotheses: spatial dependencies in traffic networks possess both stability and variability and benefit from task-driven joint graph learning; temporal modeling requires both long-range dependency capture and local trend emphasis, with explicit causal direction to avoid information leakage and mismatches. Moreover, SCTN’s advantage is most evident in RMSE, indicating better robustness to peaks and abrupt changes; visual analyses show that predicted weekly rhythms and peak–valley structures closely track ground truth. On the other hand, there are limitations: dual-path graph learning and multi-head attention increase computational and memory costs, raising scalability challenges for very large networks or high-frequency data; the learned adaptive graphs, although regularized for smoothness and sparsity, may encode “functional” relations that are not directly mappable to physical road links, complicating interpretability; robustness under severe sensor missingness, faults, or sparse coverage requires further evaluation; and the causal-trend attention emphasizes predictive causal direction and temporal consistency rather than formal causal inference, since exogenous confounders (weather, events, work zones) are not explicitly modeled.
Future work will focus on three directions: (1) incorporating exogenous factors (weather, incidents, events, holidays, construction) and coupling SCTN with interpretability and causal discovery to yield causally grounded, human-in-the-loop insights; (2) enhancing robustness via uncertainty-aware forecasting (e.g., quantile losses, ensembles, Bayesian attention) and developing online/continual learning with streaming adaptive graphs and attention recalibration to handle concept drift; and (3) improving scalability and generalization through sparse/dynamic attention, low-rank graph parameterization, and model distillation, alongside cross-city transfer, multi-task learning (flow/speed/occupancy), and extensions to multiplex or hypergraph structures. Although the latest Transformer models may have slight accuracy advantages in certain specific scenarios, their computational costs limit large-scale deployment in real-time traffic systems, while this model is precisely designed to address this engineering deployment pain point. Broader benchmarks against recent models will be conducted to validate this design philosophy. Collectively, these directions aim to make SCTN more accurate, robust, interpretable, and deployable at city scale.

5. Conclusions

To address the limitation of existing traffic prediction methods, which often focus on global spatial dependencies while overlooking local dynamics and temporal causality, this paper proposes a novel graph neural network model, SCTN. The model captures both global adaptive spatial relationships and local dynamic patterns by employing a static adaptive graph and a dynamic graph, respectively. Furthermore, it introduces a Causal-Trend Attention mechanism to strengthen the modeling of causality and local trends within the temporal dimension. Experiments conducted on two real-world traffic datasets validate the superiority of the proposed SCTN model over state-of-the-art baseline methods.

Author Contributions

Conceptualization, L.Z. and L.S.; methodology, L.Z. and X.X.; software, X.F.; validation, X.F., Y.F. and H.W.; formal analysis, X.X. and C.W.; investigation, X.F. and C.W.; resources, L.S. and H.W.; data curation, X.F. and Y.F.; writing—original draft preparation, X.F.; writing—review and editing, L.S., L.Z. and X.X.; visualization, X.F. and C.W.; supervision, L.S. and L.Z.; project administration, L.S.; funding acquisition, L.Z. and L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the “Taihu Light” Science and Technology Project of Wuxi (Grant No. K20231021), the National Natural Science Foundation of China (Grant No. 42305158), the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (Grant No. 23KJB170025), and the Jiangsu Provincial Postgraduate Research and Practice Innovation Program (Grant No. SJCX25_0484). The APC was funded by the “Taihu Light” Science and Technology Project of Wuxi (Grant No. K20231021).

Data Availability Statement

The original data presented in the study are openly available at https://github.com/a996bird/SCTN/tree/main (accessed on 23 January 2026).

Acknowledgments

The authors thank the Wuxi University School of Internet of Things Engineering and Nanjing University of Information Science and Technology for administrative and technical support. We also acknowledge the Caltrans PeMS program for providing public traffic data.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Lu, J.; Li, B.; Li, H.; Al-Barakani, A. Expansion of city scale, traffic modes, traffic congestion, and air pollution. Cities 2021, 108, 102974. [Google Scholar] [CrossRef]
  2. Pojani, D.; Stead, D. Sustainable urban transport in the developing world: Beyond megacities. Sustainability 2015, 7, 7784–7805. [Google Scholar] [CrossRef]
  3. Zhu, L.; Yu, F.R.; Wang, Y.; Ning, B.; Tang, T. Big data analytics in intelligent transportation systems: A survey. IEEE Trans. Intell. Transp. Syst. 2018, 20, 383–398. [Google Scholar] [CrossRef]
  4. Yang, S. On feature selection for traffic congestion prediction. Transp. Res. Part C Emerg. Technol. 2013, 26, 160–169. [Google Scholar] [CrossRef]
  5. Musa, A.A.; Malami, S.I.; Alanazi, F.; Ounaies, W.; Alshammari, M.; Haruna, S.I. Sustainable traffic management for smart cities using internet-of-things-oriented intelligent transportation systems (ITS): Challenges and recommendations. Sustainability 2023, 15, 9859. [Google Scholar] [CrossRef]
  6. Liu, R.; Shin, S.Y. A Review of Traffic Flow Prediction Methods in Intelligent Transportation System Construction. Appl. Sci. 2025, 15, 3866. [Google Scholar] [CrossRef]
  7. Liu, H.; Dong, Z.; Jiang, R.; Deng, J.; Deng, J.; Chen, Q.; Song, X. Spatio-temporal adaptive embedding makes vanilla transformer sota for traffic forecasting. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 4125–4129. [Google Scholar]
  8. Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar] [CrossRef]
  9. Liu, B.; Tang, X.; Cheng, J.; Shi, P. Traffic flow combination forecasting method based on improved LSTM and ARIMA. Int. J. Embed. Syst. 2020, 12, 22–30. [Google Scholar] [CrossRef]
  10. Ting, C.-C.; Wu, K.-T.; Lin, H.-T.C.; Lin, S. MixModel: A Hybrid TimesNet–Informer Architecture with 11-Dimensional Time Features for Enhanced Traffic Flow Forecasting. Mathematics 2025, 13, 3191. [Google Scholar] [CrossRef]
  11. Long, W.; Xiao, Z.; Wang, D.; Jiang, H.; Chen, J.; Li, Y.; Alazab, M. Unified spatial-temporal neighbor attention network for dynamic traffic prediction. IEEE Trans. Veh. Technol. 2022, 72, 1515–1529. [Google Scholar] [CrossRef]
  12. Chen, J.; Zhang, S.; Xu, W. Scalable Prediction of Heterogeneous Traffic Flow with Enhanced Non-Periodic Feature Modeling. Expert Syst. Appl. 2025, 294, 128847. [Google Scholar] [CrossRef]
  13. Lana, I.; Del Ser, J.; Velez, M.; Vlahogianni, E.I. Road traffic forecasting: Recent advances and new challenges. IEEE Intell. Transp. Syst. Mag. 2018, 10, 93–109. [Google Scholar] [CrossRef]
  14. Qi, P.; Pan, C.; Xu, X.; Wang, J.; Liang, J.; Zhou, W. A review of dynamic traffic flow prediction methods for global energy-efficient route planning. Sensors 2025, 25, 5560. [Google Scholar] [CrossRef]
  15. Toan, T.D.; Truong, V.H. Support vector machine for short-term traffic flow prediction and improvement of its model training using nearest neighbor approach. Transp. Res. Rec. 2021, 2675, 362–373. [Google Scholar] [CrossRef]
  16. Luo, X.; Li, D.; Zhang, S. Traffic flow prediction during the holidays based on DFT and SVR. J. Sens. 2019, 2019, 6461450. [Google Scholar] [CrossRef]
  17. Emami, A.; Sarvi, M.; Asadi Bagloee, S. Using Kalman filter algorithm for short-term traffic flow prediction in a connected vehicle environment. J. Mod. Transp. 2019, 27, 222–232. [Google Scholar] [CrossRef]
  18. Luo, X.; Li, D.; Yang, Y.; Zhang, S. Spatiotemporal traffic flow prediction with KNN and LSTM. J. Adv. Transp. 2019, 2019, 4145353. [Google Scholar] [CrossRef]
  19. Wang, Q.; Chen, J.; Song, Y.; Li, X.; Xu, W. Fusing visual quantified features for heterogeneous traffic flow prediction. Promet-Traffic Transp. 2024, 36, 1068–1077. [Google Scholar] [CrossRef]
  20. Lu, S.; Zhang, Q.; Chen, G.; Seng, D. A combined method for short-term traffic flow prediction based on recurrent neural network. Alex. Eng. J. 2021, 60, 87–94. [Google Scholar] [CrossRef]
  21. Sun, S.; Zhang, C.; Yu, G. A Bayesian network approach to traffic flow forecasting. IEEE Trans. Intell. Transp. Syst. 2006, 7, 124–132. [Google Scholar] [CrossRef]
  22. Shu, W.; Cai, K.; Xiong, N.N. A short-term traffic flow prediction model based on an improved gate recurrent unit neural network. IEEE Trans. Intell. Transp. Syst. 2021, 23, 16654–16665. [Google Scholar] [CrossRef]
  23. Zhang, W.; Yu, Y.; Qi, Y.; Shu, F.; Wang, Y. Short-term traffic flow prediction based on spatio-temporal analysis and CNN deep learning. Transp. A Transp. Sci. 2019, 15, 1688–1711. [Google Scholar] [CrossRef]
  24. Wang, T.; Chen, J.; Lü, J.; Liu, K.; Zhu, A.; Snoussi, H.; Zhang, B. Synchronous spatiotemporal graph transformer: A new framework for traffic data prediction. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 10589–10599. [Google Scholar] [CrossRef]
  25. Yingran, Z.; Chao, L.; Rui, S. Enhancing Traffic Flow Forecasting with Delay Propagation: Adaptive Graph Convolution Networks for Spatio-Temporal Data. IEEE Trans. Intell. Transp. Syst. 2025, 26, 650–660. [Google Scholar] [CrossRef]
  26. Jiang, J.; Han, C.; Zhao, W.X.; Wang, J. Pdformer: Propagation delay-aware dynamic long-range transformer for traffic flow prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 4365–4373. [Google Scholar] [CrossRef]
  27. Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv 2017. [Google Scholar] [CrossRef]
  28. Fu, R.; Zhang, Z.; Li, L. Using LSTM and GRU neural network methods for traffic flow prediction. In Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC); IEEE: Piscataway, NJ, USA, 2016; pp. 324–328. [Google Scholar] [CrossRef]
  29. Yu, B.; Yin, H.; Zhu, Z. Spatio–temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 3634–3640. [Google Scholar] [CrossRef]
  30. Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 922–929. [Google Scholar] [CrossRef]
  31. Song, C.; Lin, Y.; Guo, S.; Wan, H. Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 914–921. [Google Scholar] [CrossRef]
  32. Shin, Y.; Yoonjin, Y. PGCN: Progressive graph convolutional networks for spatial–temporal traffic forecasting. IEEE Trans. Intell. Transp. Syst. 2024, 25, 7633–7644. [Google Scholar] [CrossRef]
  33. Bai, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive graph convolutional recurrent network for traffic forecasting. Adv. Neural Inf. Process. Syst. 2020, 33, 17804–17815. [Google Scholar] [CrossRef]
  34. Li, M.; Zhu, Z. Spatial-temporal fusion graph neural networks for traffic flow forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 4189–4196. [Google Scholar] [CrossRef]
Figure 1. The overall framework of SCTN. The overall framework of SCTN. The architecture consists of four key components: (1) Graph Learning Layer, which simultaneously learns a static graph (capturing stable spatial dependencies) and a dynamic graph (capturing time-varying local correlations); (2) Spatiotemporal Embedding Layer, which injects position and periodic information; (3) Gated Temporal Attention Module (GTAM), which processes temporal dependencies through dual-path causal-trend attention; (4) Adaptive Graph Convolution Module, which aggregates spatial information via two independent branches guided by static and dynamic graphs, with residual and skip connections ensuring stable gradient flow.
Figure 1. The overall framework of SCTN. The overall framework of SCTN. The architecture consists of four key components: (1) Graph Learning Layer, which simultaneously learns a static graph (capturing stable spatial dependencies) and a dynamic graph (capturing time-varying local correlations); (2) Spatiotemporal Embedding Layer, which injects position and periodic information; (3) Gated Temporal Attention Module (GTAM), which processes temporal dependencies through dual-path causal-trend attention; (4) Adaptive Graph Convolution Module, which aggregates spatial information via two independent branches guided by static and dynamic graphs, with residual and skip connections ensuring stable gradient flow.
Mathematics 14 00443 g001
Figure 2. Gated Temporal Attention Module. The module employs an adaptive attention mechanism over multi-layer graph convolutions to dynamically select the effective receptive field for each node. Instead of simply concatenating features from different propagation depths, it learns attention scores for neighbors at different diffusion layers, thereby balancing local information and global propagation. This allows the model to generate more discriminative node representations on non-grid, unstructured data.
Figure 2. Gated Temporal Attention Module. The module employs an adaptive attention mechanism over multi-layer graph convolutions to dynamically select the effective receptive field for each node. Instead of simply concatenating features from different propagation depths, it learns attention scores for neighbors at different diffusion layers, thereby balancing local information and global propagation. This allows the model to generate more discriminative node representations on non-grid, unstructured data.
Mathematics 14 00443 g002
Figure 3. Visualization of Error Metrics and Reduction Percentages Across Forecasting Horizons on Datasets.
Figure 3. Visualization of Error Metrics and Reduction Percentages Across Forecasting Horizons on Datasets.
Mathematics 14 00443 g003
Figure 4. Ablation study on the PEMS04/08 dataset. (a) MAE results of ablation experiments on PEMS04; (b) RMSE results of ablation experiments on PEMS04; (c) MAE results of ablation experiments on PEMS08; (d) RMSE results of ablation experiments on PEMS08.
Figure 4. Ablation study on the PEMS04/08 dataset. (a) MAE results of ablation experiments on PEMS04; (b) RMSE results of ablation experiments on PEMS04; (c) MAE results of ablation experiments on PEMS08; (d) RMSE results of ablation experiments on PEMS08.
Mathematics 14 00443 g004aMathematics 14 00443 g004b
Figure 5. Visualization of traffic flow predictions on the PEMS04/08 dataset. (a) Visualization comparison of prediction results on PEMS04; (b) Visualization comparison of prediction results on PEMS08.
Figure 5. Visualization of traffic flow predictions on the PEMS04/08 dataset. (a) Visualization comparison of prediction results on PEMS04; (b) Visualization comparison of prediction results on PEMS08.
Mathematics 14 00443 g005
Figure 6. Dynamic graph A d evolution at different time slices on dataset. (PEMS04).
Figure 6. Dynamic graph A d evolution at different time slices on dataset. (PEMS04).
Mathematics 14 00443 g006
Figure 7. Cost Experiment (PEMS08).
Figure 7. Cost Experiment (PEMS08).
Mathematics 14 00443 g007
Table 1. Notations and explanations.
Table 1. Notations and explanations.
NotationsDescriptionTraffic Characteristic
G ( V , A ) Directed graph representation V : Set of N traffic sensors/detectors; A : Spatial correlation matrix between road segments
N = | V | Number of nodesTotal count of deployed traffic sensors in the network
A s Static adaptive adjacency matrixLearned static spatial dependencies between sensors based on global traffic patterns
A d Dynamic adjacency matrixTime-varying correlations between sensors at each step
A , A ^ (Normalized) adjacency matrixFinal graph topology for information propagation
X ,   X t Node feature matrixTraffic measurements at all sensors at time t
x i ,   x j Feature vector of node i / j Multi-dimensional traffic state at specific sensor location
H ( l ) ,   H ( 0 ) Node features at layer l , Initial transformed featuresEncoded traffic representations after l -hop spatial aggregation, MLP-projected traffic features before graph convolution
W Attention parameter matrixWeights for adaptive layer-wise importance scoring
k Number of diffusion stepsMaximum hop distance for spatial information propagation
T E ,   S E Temporal embedding, Spatial embeddingLearnable time-of-day/week patterns for periodic traffic trends, Learnable sensor-specific geographic/road context
Y , Y ^ Ground truth and predicted valuesTraffic flow measurements (vehicles per time interval)
Table 2. Statistics of Datasets.
Table 2. Statistics of Datasets.
DatasetPeMS04PeMS08
Nodes307170
Edges340295
Samples16,99217,856
Traffic PatternMedium-scale urban areaSuburban with fluctuations
Missing Rate3.18%0.69%
Aggregation Interval5 min5 min
Data TypesFlow, Speed, OccupancyFlow, Speed, Occupancy
LocationBay Area, San Francisco, CA, USASan Bernardino County, CA, USA
Table 3. Performance comparison of different methods on the PEMS04 and PEMS08 datasets.
Table 3. Performance comparison of different methods on the PEMS04 and PEMS08 datasets.
MethodsPeMS04PeMS08
RMSEMAEMAPE (%)RMSEMAEMAPE (%)
SVR44.5628.7019.2036.1623.2514.64
FC-LSTM41.5927.1418.2034.0622.2014.20
DCRNN38.1224.7017.1227.8317.8611.45
STGCN35.5522.7014.5926.7118.0211.40
ASTGCN35.2222.9316.5628.1618.6113.08
STSGCN33.6521.1913.9026.8017.1310.96
PGCN32.0220.0013.9625.1915.2610.02
PDFormer29.9818.3512.2624.1814.989.89
AGCRN32.3019.8312.9725.2215.9510.09
DPSTGC30.9919.0712.5224.8115.169.76
STFGNN31.8819.6412.6926.2216.6410.60
SCTN30.7518.8712.2324.3215.369.79
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Feng, X.; Sheng, L.; Zhu, L.; Feng, Y.; Wei, C.; Xiao, X.; Wang, H. Traffic Flow Prediction in Complex Transportation Networks via a Spatiotemporal Causal–Trend Network. Mathematics 2026, 14, 443. https://doi.org/10.3390/math14030443

AMA Style

Feng X, Sheng L, Zhu L, Feng Y, Wei C, Xiao X, Wang H. Traffic Flow Prediction in Complex Transportation Networks via a Spatiotemporal Causal–Trend Network. Mathematics. 2026; 14(3):443. https://doi.org/10.3390/math14030443

Chicago/Turabian Style

Feng, Xingyu, Lina Sheng, Linglong Zhu, Yishan Feng, Chen Wei, Xudong Xiao, and Haochen Wang. 2026. "Traffic Flow Prediction in Complex Transportation Networks via a Spatiotemporal Causal–Trend Network" Mathematics 14, no. 3: 443. https://doi.org/10.3390/math14030443

APA Style

Feng, X., Sheng, L., Zhu, L., Feng, Y., Wei, C., Xiao, X., & Wang, H. (2026). Traffic Flow Prediction in Complex Transportation Networks via a Spatiotemporal Causal–Trend Network. Mathematics, 14(3), 443. https://doi.org/10.3390/math14030443

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop