1. Introduction
In the management and optimization process of transportation systems, accurate traffic flow prediction plays a crucial role. It is not only the key basis for traffic planning, scheduling, and control but also an important prerequisite for improving traffic operation efficiency, releasing congestion problems, and ensuring travel safety. With the rapid development of data collection technology in the transportation field, a large amount of multi-dimensional traffic data can be obtained in real time, which provides a rich information foundation for building more accurate traffic flow prediction models.
Traffic flow sequence modeling, as an effective tool for processing such multidimensional traffic data, is widely used in traffic flow prediction. Traffic flow forecasting is essentially a type of time-series data with multiple dimensions, where each dimension corresponds to a specific univariate time-series, such as traffic flow, vehicle speed, and congestion index. The core task of traffic flow prediction is to fully utilize historical observations, explore the inherent patterns and patterns in the data, and accurately estimate future values. Compared with traditional univariate time series prediction [
1], the significant advantage of traffic flow prediction lies in its ability to comprehensively consider the mutual influence of multiple variables in traffic data, treating these variables as equally important features and inputting them into the prediction model together. This multivariate fusion processing method enables the model to more comprehensively capture the complex dynamic characteristics of the transportation system, thereby providing more reliable support for downstream transportation decision-making.
In the practical application scenarios of traffic flow prediction, the core task of traffic flow prediction is to accurately capture the complex dependencies in traffic flow data [
2,
3,
4]. Time dependence, as one of the key features of traffic flow data [
5], deeply reflects the historical evolution trajectory of the traffic status in different time and spatial dimensions. However, focusing solely on the cross-temporal dependencies is far from sufficient. Cross-dimensional dependencies of traffic variables [
6], information from other dimension-related sequences, could have a positive improvement effect on their prediction results. Taking traffic flow prediction as an example, in complex urban transportation networks, the mutual influence between intersections forms a spatial dependency relationship of traffic flow data. Specifically, the traffic flow at upstream intersections directly affects the input flow at downstream intersections [
7], while the traffic conditions at downstream intersections also have an impact on upstream intersections through feedback mechanisms. Some previous neural models have actively been explored and practiced, explicitly capturing cross-dimensional dependencies by preserving the information of dimensions in the latent feature space and using advanced techniques such as Convolutional Neural Networks (CNNs) [
8] or Graph Neural Networks (GNNs) [
9] to explore the dependencies between dimensions. Through these methods, the model can more comprehensively and accurately grasp the cross-dimensional dependencies in traffic low data, thereby improving the accuracy and dependency of traffic flow prediction.
A key challenge in traffic flow prediction lies in capturing the complex dual dependencies of temporal evolution and cross-dimensional (spatial/multi-variable) interactions within data [
10]—the latter, such as mutual influence between upstream and downstream intersections or correlations between traffic flow and speed, is equally critical to prediction accuracy as the former. While attention-based models have become mainstream due to their superior long-range temporal modeling ability, their design deviates from multi-dimensional traffic data characteristics, leading to notable limitations: GMAN [
11] and ST-WA [
12] flatten spatial and multi-variable dimensions into single feature vectors at each time step, erasing explicit dimension-wise structures and resulting in superficial cross-dimensional modeling; iTransformer [
13] enables cross-variable attention via dimension inversion but lacks dedicated designs for traffic-specific spatial dependencies; efficiency-optimized models like Informer [
14], Autoformer [
5], and Conformer [
15] prioritize reducing complexity or enhancing long-term temporal capture yet either neglect cross-dimensional dependencies entirely or model them implicitly; and even Crossformer [
16] and Scaleformer [
17], which focus on periodic/scale-aware patterns, still rely on dimension flattening that loses fine-grained cross-dimensional relationships. To address these limitations, this study proposes TSAformer, a novel Transformer-based model that retains explicit spatial/multi-variable structural information via a multi-dimensional input embedding layer, explicitly models dual dependencies in a hierarchical way using a custom Two-Stage Attention (TSA) module, and leverages a hierarchical encoder–decoder structure—enabling efficient and comprehensive capture of both critical dependency types to overcome the shortcomings of existing approaches. The complete framework is illustrated in
Figure 1.
The main contributions of this study are summarized as follows:
We systematically identify limitations of existing Transformer-based traffic flow prediction models: these models typically compress monitoring station time-series into single vectors to model only temporal patterns, failing to account for cross-station spatial correlations and comprehensive multi-dimensional feature integration, which restricts prediction accuracy.
We propose the TSAformer architecture: integrated with a multi-dimensional input embedding layer fusing traffic flow, temporal, and spatial features; a TSA module for dual capture of temporal and spatial dependencies; and a hierarchical encoder–decoder structure, it enables explicit modeling of cross-dimensional dependencies in transportation networks within the Transformer framework.
We validated TSAformer’s outstanding performance via extensive experiments: on multiple real-world datasets covering urban road networks and highways, TSAformer outperforms state-of-the-art deep learning models across core metrics, setting new records in 36 out of 58 key scenarios and ranking top two in 51 scenarios, demonstrating strong practical application value for intelligent transportation systems.
2. Related Work
We categorize traffic-flow forecasting into four families: (i) statistical and machine learning models, (ii) graph-based methods, (iii) attention-based methods, and (iv) differential equation-based methods.
2.1. Statistical and Machine Learning Models
Early work emphasized lightweight baselines. The Historical Average (HA) [
18] extrapolates by averaging past observations from similar time slots, performing adequately only when the process is close to stationary. ARIMA [
19] extends linear autoregression with moving-average errors, but its linearity hampers performance on real-world nonlinear signals. Multivariate approaches such as vector autoregression (VAR) [
20] jointly model interdependent series and capture cross-variable lag effects yet still rely on linear dynamics. Other popular choices—linear regression (LR) [
21], support vector regression (SVR) [
22], and gradient-boosted trees such as XGBoost—can be competitive with careful feature engineering, yet they struggle to natively encode complex spatiotemporal dependencies.
2.2. Graph-Based Models
There are many graph-based methods: GWNET [
23] learns an adaptive adjacency to uncover latent node-to-node influence. DCRNN [
24] couples diffusion graph convolutions for spatial propagation with a recurrent backbone for temporal evolution. STGCN [
25] replaces recurrence with gated temporal convolutions atop GCN layers/STFGNN [
26] fuses multiple spatial and temporal graphs to capture hidden correlations. AGCRN [
27] introduces data-adaptive graph construction and parameter sharing to automatically infer inter-series dependencies. STSGCN [
28] performs synchronous spatiotemporal graph convolution within localized windows to strengthen short-term spatiotemporal coupling. GCRNN [
29] uses graph convolution and RNN to model regional water demand time series. DSTAGNN [
30] adopts improved attention and multi-scale convolution to capture road network dynamic spatiotemporal correlations. HGCN-MA [
31] uses hierarchical structure and multi-scale attention to capture urban multi-granularity spatiotemporal dependencies. STAN [
32] employs edge-gated GIN and adaptive temporal convolution for critical phenomena forecasting. RL-GCN [
33] combines graph convolution, LSTM, and reinforcement learning to predict urban traffic flow with superior performance.
2.3. Differential Equation-Based Models
Continuous-time formulations have gained traction as means of representing smooth dynamics and irregular sampling. STGODE [
34] leverages neural ODEs to model coupled spatiotemporal evolution, and STGNCDE [
35] employs paired neural controlled differential equations to describe both temporal trajectories and spatial propagation. Overall, while classical baselines provide simplicity and interpretability, graph, attention, and differential-equation models better match the nonlinear and non-stationary nature of traffic flow.
3. Methodology
In the task of predicting traffic flow sequences, the goal of conducting related prediction is to base it on historical data . The future values that can predict time series are ; among them, the number of time steps between the future and the past can be expressed as , while the dimension is expressed as .
3.1. Input Embedding: Preserving Multimodal Spatiotemporal Semantics
Traffic flow data is inherently multimodal, combining dynamic measurements (e.g., volume and speed) with static or slowly varying contextual features such as time-of-day, day-of-week, and sensor-specific spatial characteristics. To enable the model to learn from this rich structure, we designed a comprehensive input embedding layer that explicitly encodes each modality while preserving their distinct semantics.
Let the raw traffic observation matrix be denoted as
, where
is the number of historical time steps and
N is the number of sensor nodes (or spatial dimensions). We first project
into a high-dimensional latent space using a three-layer fully connected network with nonlinear activation:
where
,
,
,
,
,
are learnable parameters, and
denotes the ReLU activation function. This transformation extracts nonlinear patterns from raw traffic values and maps them into a
d-dimensional feature space.
To capture periodic temporal behaviors, we introduce two learnable embedding matrices:
encodes the day of the week (Monday to Sunday) into a dense vector representation . This allows the model to distinguish weekday commuting patterns from weekend leisure travel.
encodes time-of-day (assuming 5 min intervals, 288 per day) into
, enabling the model to recognize rush hours, off-peak periods, and diurnal cycles. We adopt the 5 min resolution because (i) in real-world freeway monitoring pipelines (e.g., Caltrans PeMS), detector measurements are commonly aggregated and reported in 5 min summaries for analysis and operations [
36] and (ii) many established baselines and evaluation protocols on these benchmarks follow the same setting (5 min, 288 steps/day), ensuring consistency and fair comparison across methods [
24].
Crucially, we also embed node-specific characteristics and relative temporal positions using a learnable tensor . Unlike absolute timestamps, this tensor captures relative intervals between observations—a key factor in modeling non-stationary traffic dynamics. For example, a 10 min gap during morning peak may behave very differently from the same gap at midnight. Each element thus encodes both the identity of sensor n at time step t and its temporal context relative to neighboring observations.
Finally, we concatenate all four embeddings along the feature dimension:
where
. This fused representation
serves as the input to the subsequent spatiotemporal modeling blocks. By preserving modality-specific structure and avoiding premature fusion, our embedding layer provides a rich, interpretable foundation for capturing complex spatiotemporal interactions—a critical advantage over models that flatten or compress multimodal inputs too early.
For notational convenience, we denote in the following sections.
3.2. Two-Stage Attention (TSA): Modeling Time and Space Separately Yet Jointly
Traffic forecasting requires modeling dependencies along two distinct axes: time (how traffic evolves at a given location) and space (how locations influence each other). Unlike images—where height and width are symmetric—time and space in traffic data are semantically asymmetric and must be treated differently. Moreover, applying standard self-attention directly to the full spatiotemporal tensor would incur prohibitive computational cost: , which scales poorly for large networks.
To address this, we propose the Two-Stage Attention (TSA) layer, a lightweight yet expressive module that decomposes spatiotemporal modeling into two sequential stages: (1) intra-dimensional temporal attention and (2) inter-dimensional spatial routing. This design ensures efficiency while preserving modeling capacity.
3.2.1. Stage 1: Cross-Time Attention
Given an input tensor
—where
L is the number of time segments and
D is the number of dimensions—we first apply multi-head self-attention independently along the time axis for each dimension
d:
where
denotes multi-head self-attention, and all dimensions share the same attention weights to encourage generalization. This stage captures long-range temporal dependencies, such as morning rush hour patterns repeating daily, within each sensor’s time series. The computational complexity is
, which remains manageable even for long sequences.
3.2.2. Stage 2: Cross-Dimension Routing
Modeling spatial interactions naively via full self-attention across D dimensions would cost , which becomes infeasible for city-scale sensor networks (). Instead, we introduce a learnable routing mechanism that mediates information exchange between dimensions without pairwise computation.
Key Design: Learnable Position-Node Embedding Tensor (): Before detailing the routing process, we clarify the role of the learnable tensor introduced in our input embedding. This tensor serves as a joint spatiotemporal position encoding. Unlike standard positional embeddings that treat time and node indices independently, directly models the coupled representation of “when” and “where.” Concretely, each entry encodes
Node-specific characteristics: the intrinsic features of sensor n (e.g., its geographic role, lane type, or nearby points of interest).
Relative temporal context: the position of time step t within the observed sequence, capturing its order and interval-based relationships (e.g., whether it belongs to the start, middle, or end of a peak period).
This design allows the model to distinguish, for example, that a sensor near a school exhibits different traffic patterns during morning drop-off (relative time ) versus afternoon pickup (), even if the absolute clock times differ. The tensor is initialized randomly and optimized end-to-end, enabling it to learn data-driven spatiotemporal inductive biases.
Routing Mechanism: For each time segment
i, we define a set of
c (
) learnable router vectors
. These routers first aggregate information from all
D dimensions, leveraging the rich spatiotemporal context provided by
:
then distribute the aggregated context back to each dimension:
Finally, we apply residual connections and MLP:
This two-step routing reduces complexity from to (since c is small and fixed) while still enabling all-to-all spatial communication. The router acts as a bottleneck that forces the model to compress and redistribute the most relevant cross-dimensional signals—mimicking how traffic control centers aggregate and broadcast congestion alerts.
Combining both stages, the full TSA layer is defined as
The overall complexity is , dominated by the temporal stage, a favorable trade-off for traffic forecasting, where temporal patterns are typically longer-range and more structured than spatial ones.
3.3. Hierarchical Encoder–Decoder Structure: Capturing Multi-Scale Dynamics
Traffic systems exhibit dynamics at multiple temporal scales: short-term fluctuations (seconds to minutes), mid-term patterns (rush hours), and long-term trends (daily/weekly cycles). To capture this hierarchy, we designed a hierarchical encoder–decoder (HED) architecture that progressively coarsens the temporal resolution in the encoder and refines predictions across scales in the decoder.
3.3.1. Encoder: Coarsening for Multi-Scale Abstraction
The encoder consists of N stacked layers, each designed to capture traffic patterns at progressively broader time horizons. The first layer takes the embedded input . Each subsequent layer performs two complementary operations:
Segment Merging: Building Temporal Hierarchies: To mimic how traffic operators view data from minute-level readings to hourly trends, we merge adjacent time segments. This operation reduces sequence length while preserving essential information through a learnable projection:
where
is a learnable projection matrix, and
denotes concatenation. If
is odd, we zero-pad the last segment. Conceptually, this merges fine-scale fluctuations (e.g., rapid changes at a 5 min level) into coarser, smoothed representations (e.g., 10 min aggregates), enabling the model to focus on longer-horizon patterns.
TSA Refinement: Extracting Scale-Specific Dependencies: After merging, the representation is processed by a TSA layer to model both temporal evolution and spatial interactions at that specific scale:
This allows each layer to specialize: lower layers capture short-term, high-frequency variations (e.g., sudden congestion caused by an accident), while higher layers focus on stable, long-term periodicities (e.g., daily commute peaks).
This process yields a pyramid of representations , where higher layers capture coarser, longer-range patterns (e.g., daily trends), and lower layers retain fine-grained details (e.g., minute-by-minute fluctuations). The encoder thus builds a multi-resolution understanding of traffic dynamics, analogous to viewing the same road network through different temporal “zoom levels”.
3.3.2. Decoder: Multi-Scale Prediction Fusion
The decoder mirrors the encoder’s hierarchy but operates in reverse, starting from the coarsest (most abstract) scale and gradually reintroducing fine details to form accurate predictions. This design ensures that long-term trends guide the overall forecast, while short-term adjustments refine local accuracy.
Initial Context: Seeding with Learnable Future Positions: At the coarsest layer (
), we initialize the decoder with learnable position embeddings
(where
is the prediction horizon), which represent a learnable “prototype” of future temporal patterns:
These embeddings are optimized to capture common periodic structures in traffic, providing a structured prior for generation.
Cross-Scale Attention: Integrating Multi-Resolution Context: For each finer layer
, the decoder first refines its current representation using TSA then selectively queries the corresponding encoder layer at the same scale via cross-attention:
This mechanism allows the decoder to "look back" at the encoded history at the appropriate temporal granularity—for example, when predicting the next 30 min, it can refer to both recent 5 min details (from lower encoder layers) and broader hourly trends (from higher layers).
Scale-Specific Prediction: Specialized Contribution at Each Level: Each decoder layer
l produces a partial prediction that captures dynamics specific to its scale:
where
is the predicted segment for dimension
d at scale
l. Intuitively, coarse layers contribute smooth, trend-like components, while fine layers add high-resolution adjustments.
Final Aggregation: Synthesizing the Complete Forecast: Predictions from all scales are summed to produce the final output, integrating multi-scale insights:
This additive fusion ensures that the model can simultaneously account for both macroscopic patterns and microscopic variations.
This multi-scale design allows TSAformer to leverage both local details and global trends. For example, using coarse layers to predict overall congestion levels while fine layers adjust for lane-specific incidents. The hierarchical structure also improves training stability by providing intermediate supervision signals at multiple resolutions.
4. Experiments
In this section, we first describe the datasets, evaluation metrics, baselines, and training setup. We then report the overall performance of our model, analyze the impact of key hyperparameters on performance, and present ablation studies to quantify the contribution of each component.
4.1. Datasets
To validate our proposed approach, we conducted experiments on three widely adopted real-world datasets: PeMS04, PeMS07, and PeMS08. These datasets comprise traffic flow measurements gathered from sensor networks deployed across California’s highway and freeway infrastructure. The temporal resolution of all datasets is set at 5 min intervals, providing fine-grained traffic pattern information.
4.2. Evaluation
Our evaluation framework employs three standard performance metrics commonly utilized in traffic flow forecasting research: Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). The prediction framework is configured to use historical traffic data spanning 60 min (equivalent to 12 temporal intervals) as input to generate forecasts for the following 60 min period.
4.3. Baselines
To evaluate our model, we selected ten baselines grouped into three categories: classical statistical and machine learning models, multivariate time-series forecasting models, and traffic-flow forecasting models.
Classical statistical and machine learning models.
VAR [
20]: A classical multivariate linear model that captures inter-variable lagged dependencies via vector autoregression.
SVR [
22]: A kernel-based regression approach with
-insensitive loss, effective for nonlinear mappings from historical inputs to targets.
Multivariate time-series forecasting models.
PatchTST [
37]: Treats sub-sequences as patches and models variable-wise independence in attention to improve efficiency and accuracy.
iTransformer [
13]: An inverted Transformer that embeds time points as tokens (with variables as channels) to capture cross-variable correlations via self-attention.
Traffic-flow forecasting models.
DCRNN [
24]: Diffusion convolution integrated with recurrent units to learn directional spatiotemporal dependencies on sensor graphs.
STGCN [
25]: Stacks spatial graph convolutions with gated temporal convolutions to jointly model space–time dynamics.
STSGCN [
28]: Uses spatiotemporal synchronous graph convolution blocks to capture localized spatiotemporal correlations.
STFGNN [
26]: Fuses multiple spatial and temporal graphs within a GNN to model hidden spatiotemporal relations.
STGODE [
34]: Formulates node dynamics with neural ordinary differential equations to model continuous-time spatiotemporal evolution.
AGCRN [
27]: Adaptive graph convolutional recurrent network with node-adaptive parameters for personalized spatial and temporal modeling.
4.4. Implementation Details
All experiments were performed on a computational platform featuring an NVIDIA Tesla T4 GPU (24 GB; NVIDIA Corporation, Santa Clara, CA, USA) operating on Ubuntu 20.04. The model architecture was developed using PyTorch version 1.10.1 within a Python 3.9.7 environment. The experimental data was partitioned following a 7:1:2 split ratio for training, validation, and testing phases, respectively.
Model optimization involved systematic hyperparameter exploration across multiple dimensions: hidden dimension sizes were examined within the range {64, 128, 256}, and both encoder and decoder layer counts were varied among {1, 2, 3} configurations, while the multi-head attention mechanism utilized 4 attention heads consistently. The merge window parameter was configured to 2. To prevent overfitting during training, a dropout probability of 0.2 was incorporated.
The optimal model configuration was selected based on validation-set performance metrics. Training optimization utilized the AdamW algorithm with an initial learning rate of . The training process employed mini-batches of size 16 across 100 maximum epochs. An early termination mechanism was implemented with a patience threshold of 5 epochs to prevent overfitting and ensure efficient training convergence.
4.5. Prediction Results
The comparison results of our model and the baselines are presented in
Table 1. In the table, the best results are highlighted in bold, and the second-best are underlined. Across the nine metrics, our model attains the best results on eight metrics and remains competitive on the remaining metric. Moreover,
Figure 2 provides a qualitative case study on PeMS08 (stations 56 and 106), where our forecasts almost overlap with the ground truth over time, indicating that the prediction errors are extremely small and the model can faithfully track both trend changes and short-term fluctuations.
PeMS04: Our model achieves the best MAE (19.51) and RMSE (29.28), improving over the next-best results by 1.61% and 8.13%, respectively.
PeMS07: Our model achieves state-of-the-art performance on all three metrics, MAE 20.27 (improvement 4.79%), MAPE 8.64 (3.68%), and RMSE 32.32(3.35%), relative to the corresponding second-best results.
PeMS08: Our model obtains the best MAE 15.39 and RMSE 23.31, with improvements of 3.51% and 7.57%, respectively. It also achieves the best MAPE 10.03, slightly outperforming the second-best baseline.
The consistent gains in
Table 1 mainly stem from TSAformer’s explicit cross-dimensional dependency modeling and its multi-scale temporal abstraction, which are both weakly handled in many baselines. First, classical statistical/ML methods (e.g., VAR and SVR) are limited by linearity or heavy feature engineering and thus cannot robustly capture the strong nonlinearity and non-stationarity of traffic dynamics (e.g., abrupt congestion onset and dissipation). Second, time-series Transformers that emphasize temporal modeling (e.g., PatchTST and iTransformer) often reduce the spatial dimension to either independent channels or implicitly mixed features, which weakens the ability to represent propagation effects across sensors (upstream-to-downstream influence) and network-wide coupling. Third, compared with graph-based models (e.g., DCRNN, STGCN, STFGNN, and AGCRN), TSAformer avoids strong reliance on a pre-defined or locally constrained graph structure. While graph convolutions are effective for local diffusion, they may suffer from limited receptive fields (or oversmoothing when stacked deeply).
In contrast, TSAformer preserves the tensor structure along time and dimension and employs Two-Stage Attention to decouple these asymmetric axes: (i) time-axis self-attention captures long-range temporal regularities within each node, and (ii) routing-based cross-dimension interaction enables efficient all-to-all communication without quadratic cost, allowing the model to aggregate global context and redistribute it to each node adaptively. Finally, the hierarchical encoder–decoder structure further boosts performance by modeling traffic at multiple temporal resolutions: coarser levels summarize macroscopic trends, while finer levels correct short-term fluctuations, improving robustness for multi-step forecasting horizons. We also note that MAPE can be affected disproportionately by low-flow periods (small denominators), which may explain occasional cases where improvements in MAE/RMSE do not translate to the best MAPE on a specific dataset; nevertheless, TSAformer remains consistently strong across metrics and datasets, demonstrating both effectiveness and scalability.
4.6. Effect of Model Capacity (Width, Depth, and Router Number)
We conducted a grid search over the model width (dimension of model) and depth (number of layers) on PeMS04/PeMS07/PeMS08. Across all three datasets and all three metrics (MAE, MAPE, and RMSE), we observe a clear capacity trend: as width and depth increase, our model attains better accuracy, and the best results typically occur at the largest capacity, indicating that performance has not yet saturated.
As shown in
Figure 3 (left), from the smallest configuration to the best configuration within our search space, the three metrics decrease as follows. For PeMS04, MAE decreases by 14.64%, MAPE decreases by 24.66%, and RMSE decreases by 12.17%. For PeMS07, MAE decreases by 9.57%, MAPE decreases by 12.67%, and RMSE decreases by 5.63%. For PeMS08, MAE decreases by 24.00%, MAPE decreases by 26.29%, and RMSE decreases by 21.33%. These findings suggest that further scaling of width and depth is likely to deliver additional gains.
Figure 3 (right) reports the effect of the router number
while keeping other hyperparameters fixed. We observe that
yields the best overall performance (MAE
, MAPE
, RMSE
), whereas both smaller and larger router numbers lead to slightly worse accuracy. This suggests a trade-off between routing diversity and optimization efficiency: too few routers may limit the model’s ability to capture heterogeneous traffic patterns, while too many routers can introduce redundancy and over-fragment expert utilization and make training less stable. Overall, a moderate router number provides sufficient specialization without sacrificing optimization efficiency, and we used
as the default setting in experiments.
4.7. Ablation Study
We conducted an ablation study with two simplified variants: w/o Dec, which maps the encoder’s hidden states to predictions using a linear layer, and w/o Dec & Emb, which further removes the learnable spatiotemporal embeddings. As shown in
Table 2, removing the decoder consistently degrades performance, and removing the embeddings on top of that leads to a further decline. These results indicate that both components are effective and complementary, and the full model (Ours) achieves the best scores, validating the necessity of the decoder and the learnable spatiotemporal embeddings.
4.8. Computational Efficiency
Let L be the sequence length, D the number of nodes (dimensions), c the number of routers (), and d the hidden size. We apply self-attention along time for each node, yielding . Routers aggregate from D nodes and then broadcast back, costing . The TSA attention cost is therefore . In practical traffic forecasting, D is typically much larger than L (i.e., ), so replacing full cross-dimension attention with routing-based interaction substantially improves scalability.
We compared the computational efficiency of our method with several representative models, namely, DCRNN [
24], STGCN [
25], ASTGCN [
38], and AGCRN [
27], in terms of the number of trainable parameters and the training time per epoch (
Table 3). Despite achieving strong forecasting accuracy, our model maintains a competitive model size (around 0.5 M parameters) and moderate training cost (about 40 s per epoch). These results indicate that our approach attains improved predictive performance while preserving competitive computational efficiency, making it suitable for practical deployment with limited computational budgets.
5. Conclusions
In this work, we present TSAformer, a novel Transformer-based architecture tailored for multivariate traffic flow forecasting. Unlike conventional sequence modeling approaches that collapse spatiotemporal structure into flattened representations, TSAformer explicitly preserves and jointly models the dual axes of time and dimension (i.e., sensor node or road segment) throughout the prediction pipeline. This design enables the model to capture both temporal evolution patterns and spatial interaction mechanisms—two fundamental drivers of traffic dynamics.
At the foundation of TSAformer lies a multimodal input embedding layer that encodes not only raw traffic measurements but also contextual features such as time-of-day, day-of-week, and node-specific positional-temporal characteristics. This rich embedding ensures that the model is sensitive to both periodic behaviors (e.g., rush hours and weekend effects) and sensor-specific non-stationarities (e.g., intersection topology and lane capacity), providing a semantically grounded input representation.
To efficiently model dependencies across time and space, we introduce the Two-Stage Attention mechanism. In the first stage, temporal self-attention operates independently along each dimension to capture long-range sequential patterns. In the second stage, a lightweight routing-based cross-dimension attention mediates spatial interactions without incurring quadratic complexity, making the model scalable to large-scale sensor networks. This decoupled-yet-coordinated attention design strikes an optimal balance between expressiveness and efficiency.
Built upon TSA, our hierarchical encoder–decoder structure further enhances predictive capability by modeling traffic dynamics across multiple temporal scales. The encoder progressively coarsens temporal resolution to extract macroscopic trends, while the decoder refines predictions from coarse to fine, fusing multi-scale signals through cross-attention and residual aggregation. This enables TSAformer to simultaneously capture long-term congestion patterns and short-term incident-induced fluctuations.
Extensive experiments on three real-world traffic datasets—spanning urban arterials, freeways, and varying spatial scales—demonstrate that TSAformer consistently outperforms state-of-the-art baselines in both short-term and long-term forecasting settings. Notably, it achieves top performance in 36 out of 58 critical evaluation scenarios, including peak-hour prediction and event-driven congestion forecasting, validating its robustness and practical utility.
Despite these advances, we acknowledge several limitations that should be addressed in future work:
Road topology is not explicitly encoded: The model mainly learns spatial relations implicitly from data, which may be less faithful to physical connectivity and less robust to topology changes; integrating graph priors or sparse graph-based attention could improve efficiency and interpretability.
Lack of external context: Weather, incidents, events, and control signals are not modeled, which can degrade performance under abnormal conditions; multimodal context fusion is a clear next step.
Scalability at city scale remains challenging: On thousand-node networks, attention/routing can still be costly in memory and latency; hierarchical partitioning and sparse/linear attention could enable real-time deployment.
Robustness evaluation is limited: While we study sensitivity to model capacity (width/depth) and router number, further tests with reduced data, noisy/missing inputs, and different forecasting horizons are needed to better assess practical stability.
Temporal order sensitivity may be insufficient: Self-attention can under-emphasize strict temporal causality; adding stronger positional reinforcement or causal convolutional priors may improve long-horizon stability.
TSAformer provides a principled, scalable, and effective framework for spatiotemporal traffic forecasting. We hope this work inspires further research into structure-aware sequence modeling for transportation intelligence.