Next Article in Journal
Multi-Objective Deep Reinforcement Learning for Dynamic Task Scheduling Under Time-of-Use Electricity Price in Cloud Data Centers
Previous Article in Journal
Real-Time Visual Perception and Explainable Fault Diagnosis for Railway Point Machines at the Edge
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TSAformer: A Traffic Flow Prediction Model Based on Cross-Dimensional Dependency Capture

College of Electrical and Control Engineering, North China University of Technology, Beijing 100144, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(1), 231; https://doi.org/10.3390/electronics15010231
Submission received: 20 November 2025 / Revised: 22 December 2025 / Accepted: 22 December 2025 / Published: 4 January 2026
(This article belongs to the Special Issue Artificial Intelligence for Traffic Understanding and Control)

Abstract

Accurate multivariate traffic flow forecasting is critical for intelligent transportation systems yet remains challenging due to the complex interplay of temporal dynamics and spatial interactions. While Transformer-based models have shown promise in capturing long-range temporal dependencies, most existing approaches compress multidimensional observations into flattened sequences—thereby neglecting explicit modeling of cross-dimensional (i.e., spatial or inter-variable) relationships, which are essential for capturing traffic propagation, network-wide congestion, and node-specific behaviors. To address this limitation, we propose TSAformer, a novel Transformer architecture that explicitly preserves and jointly models time and dimension as dual structural axes. TSAformer begins with a multimodal input embedding layer that encodes raw traffic values alongside temporal context (time-of-day and day-of-week) and node-specific positional features, ensuring rich semantic representation. The core of TSAformer is the Two-Stage Attention (TSA) module, which first models intra-dimensional temporal evolution via time-axis self-attention then captures inter-dimensional spatial interactions through a lightweight routing mechanism—avoiding quadratic complexity while enabling all-to-all cross-node communication. Built upon TSA, a hierarchical encoder–decoder (HED) structure further enhances forecasting by modeling traffic patterns across multiple temporal scales, from fine-grained fluctuations to macroscopic trends, and fusing predictions via cross-scale attention. Extensive experiments on three real-world traffic datasets—including urban road networks and highway systems—demonstrate that TSAformer consistently outperforms state-of-the-art baselines across short-term and long-term forecasting horizons. Notably, it achieves top-ranked performance in 36 out of 58 critical evaluation scenarios, including peak-hour and event-driven congestion prediction. By explicitly modeling both temporal and dimensional dependencies without structural compromise, TSAformer provides a scalable, interpretable, and high-performance solution for spatiotemporal traffic forecasting.

1. Introduction

In the management and optimization process of transportation systems, accurate traffic flow prediction plays a crucial role. It is not only the key basis for traffic planning, scheduling, and control but also an important prerequisite for improving traffic operation efficiency, releasing congestion problems, and ensuring travel safety. With the rapid development of data collection technology in the transportation field, a large amount of multi-dimensional traffic data can be obtained in real time, which provides a rich information foundation for building more accurate traffic flow prediction models.
Traffic flow sequence modeling, as an effective tool for processing such multidimensional traffic data, is widely used in traffic flow prediction. Traffic flow forecasting is essentially a type of time-series data with multiple dimensions, where each dimension corresponds to a specific univariate time-series, such as traffic flow, vehicle speed, and congestion index. The core task of traffic flow prediction is to fully utilize historical observations, explore the inherent patterns and patterns in the data, and accurately estimate future values. Compared with traditional univariate time series prediction [1], the significant advantage of traffic flow prediction lies in its ability to comprehensively consider the mutual influence of multiple variables in traffic data, treating these variables as equally important features and inputting them into the prediction model together. This multivariate fusion processing method enables the model to more comprehensively capture the complex dynamic characteristics of the transportation system, thereby providing more reliable support for downstream transportation decision-making.
In the practical application scenarios of traffic flow prediction, the core task of traffic flow prediction is to accurately capture the complex dependencies in traffic flow data [2,3,4]. Time dependence, as one of the key features of traffic flow data [5], deeply reflects the historical evolution trajectory of the traffic status in different time and spatial dimensions. However, focusing solely on the cross-temporal dependencies is far from sufficient. Cross-dimensional dependencies of traffic variables [6], information from other dimension-related sequences, could have a positive improvement effect on their prediction results. Taking traffic flow prediction as an example, in complex urban transportation networks, the mutual influence between intersections forms a spatial dependency relationship of traffic flow data. Specifically, the traffic flow at upstream intersections directly affects the input flow at downstream intersections [7], while the traffic conditions at downstream intersections also have an impact on upstream intersections through feedback mechanisms. Some previous neural models have actively been explored and practiced, explicitly capturing cross-dimensional dependencies by preserving the information of dimensions in the latent feature space and using advanced techniques such as Convolutional Neural Networks (CNNs) [8] or Graph Neural Networks (GNNs) [9] to explore the dependencies between dimensions. Through these methods, the model can more comprehensively and accurately grasp the cross-dimensional dependencies in traffic low data, thereby improving the accuracy and dependency of traffic flow prediction.
A key challenge in traffic flow prediction lies in capturing the complex dual dependencies of temporal evolution and cross-dimensional (spatial/multi-variable) interactions within data [10]—the latter, such as mutual influence between upstream and downstream intersections or correlations between traffic flow and speed, is equally critical to prediction accuracy as the former. While attention-based models have become mainstream due to their superior long-range temporal modeling ability, their design deviates from multi-dimensional traffic data characteristics, leading to notable limitations: GMAN [11] and ST-WA [12] flatten spatial and multi-variable dimensions into single feature vectors at each time step, erasing explicit dimension-wise structures and resulting in superficial cross-dimensional modeling; iTransformer [13] enables cross-variable attention via dimension inversion but lacks dedicated designs for traffic-specific spatial dependencies; efficiency-optimized models like Informer [14], Autoformer [5], and Conformer [15] prioritize reducing complexity or enhancing long-term temporal capture yet either neglect cross-dimensional dependencies entirely or model them implicitly; and even Crossformer [16] and Scaleformer [17], which focus on periodic/scale-aware patterns, still rely on dimension flattening that loses fine-grained cross-dimensional relationships. To address these limitations, this study proposes TSAformer, a novel Transformer-based model that retains explicit spatial/multi-variable structural information via a multi-dimensional input embedding layer, explicitly models dual dependencies in a hierarchical way using a custom Two-Stage Attention (TSA) module, and leverages a hierarchical encoder–decoder structure—enabling efficient and comprehensive capture of both critical dependency types to overcome the shortcomings of existing approaches. The complete framework is illustrated in Figure 1.
The main contributions of this study are summarized as follows:
  • We systematically identify limitations of existing Transformer-based traffic flow prediction models: these models typically compress monitoring station time-series into single vectors to model only temporal patterns, failing to account for cross-station spatial correlations and comprehensive multi-dimensional feature integration, which restricts prediction accuracy.
  • We propose the TSAformer architecture: integrated with a multi-dimensional input embedding layer fusing traffic flow, temporal, and spatial features; a TSA module for dual capture of temporal and spatial dependencies; and a hierarchical encoder–decoder structure, it enables explicit modeling of cross-dimensional dependencies in transportation networks within the Transformer framework.
  • We validated TSAformer’s outstanding performance via extensive experiments: on multiple real-world datasets covering urban road networks and highways, TSAformer outperforms state-of-the-art deep learning models across core metrics, setting new records in 36 out of 58 key scenarios and ranking top two in 51 scenarios, demonstrating strong practical application value for intelligent transportation systems.

2. Related Work

We categorize traffic-flow forecasting into four families: (i) statistical and machine learning models, (ii) graph-based methods, (iii) attention-based methods, and (iv) differential equation-based methods.

2.1. Statistical and Machine Learning Models

Early work emphasized lightweight baselines. The Historical Average (HA) [18] extrapolates by averaging past observations from similar time slots, performing adequately only when the process is close to stationary. ARIMA [19] extends linear autoregression with moving-average errors, but its linearity hampers performance on real-world nonlinear signals. Multivariate approaches such as vector autoregression (VAR) [20] jointly model interdependent series and capture cross-variable lag effects yet still rely on linear dynamics. Other popular choices—linear regression (LR) [21], support vector regression (SVR) [22], and gradient-boosted trees such as XGBoost—can be competitive with careful feature engineering, yet they struggle to natively encode complex spatiotemporal dependencies.

2.2. Graph-Based Models

There are many graph-based methods: GWNET [23] learns an adaptive adjacency to uncover latent node-to-node influence. DCRNN [24] couples diffusion graph convolutions for spatial propagation with a recurrent backbone for temporal evolution. STGCN [25] replaces recurrence with gated temporal convolutions atop GCN layers/STFGNN [26] fuses multiple spatial and temporal graphs to capture hidden correlations. AGCRN [27] introduces data-adaptive graph construction and parameter sharing to automatically infer inter-series dependencies. STSGCN [28] performs synchronous spatiotemporal graph convolution within localized windows to strengthen short-term spatiotemporal coupling. GCRNN [29] uses graph convolution and RNN to model regional water demand time series. DSTAGNN [30] adopts improved attention and multi-scale convolution to capture road network dynamic spatiotemporal correlations. HGCN-MA [31] uses hierarchical structure and multi-scale attention to capture urban multi-granularity spatiotemporal dependencies. STAN [32] employs edge-gated GIN and adaptive temporal convolution for critical phenomena forecasting. RL-GCN [33] combines graph convolution, LSTM, and reinforcement learning to predict urban traffic flow with superior performance.

2.3. Differential Equation-Based Models

Continuous-time formulations have gained traction as means of representing smooth dynamics and irregular sampling. STGODE [34] leverages neural ODEs to model coupled spatiotemporal evolution, and STGNCDE [35] employs paired neural controlled differential equations to describe both temporal trajectories and spatial propagation. Overall, while classical baselines provide simplicity and interpretability, graph, attention, and differential-equation models better match the nonlinear and non-stationary nature of traffic flow.

3. Methodology

In the task of predicting traffic flow sequences, the goal of conducting related prediction is to base it on historical data X 1 : T R T × D . The future values that can predict time series are X T + 1 : T + τ R τ × D ; among them, the number of time steps between the future and the past can be expressed as τ , T , while the dimension is expressed as D > 1 .

3.1. Input Embedding: Preserving Multimodal Spatiotemporal Semantics

Traffic flow data is inherently multimodal, combining dynamic measurements (e.g., volume and speed) with static or slowly varying contextual features such as time-of-day, day-of-week, and sensor-specific spatial characteristics. To enable the model to learn from this rich structure, we designed a comprehensive input embedding layer that explicitly encodes each modality while preserving their distinct semantics.
Let the raw traffic observation matrix be denoted as X R T h × N , where T h is the number of historical time steps and N is the number of sensor nodes (or spatial dimensions). We first project X into a high-dimensional latent space using a three-layer fully connected network with nonlinear activation:
X data = σ σ X W 1 + b 1 W 2 + b 2 W 3 + b 3 ,
where W 1 R 1 × d , b 1 R d , W 2 R d × 2 d , b 2 R 2 d , W 3 R 2 d × d , b 3 R d are learnable parameters, and σ ( · ) denotes the ReLU activation function. This transformation extracts nonlinear patterns from raw traffic values and maps them into a d-dimensional feature space.
To capture periodic temporal behaviors, we introduce two learnable embedding matrices:
E day R 7 × d encodes the day of the week (Monday to Sunday) into a dense vector representation X day . This allows the model to distinguish weekday commuting patterns from weekend leisure travel.
E time R 288 × d encodes time-of-day (assuming 5 min intervals, 288 per day) into X time , enabling the model to recognize rush hours, off-peak periods, and diurnal cycles. We adopt the 5 min resolution because (i) in real-world freeway monitoring pipelines (e.g., Caltrans PeMS), detector measurements are commonly aggregated and reported in 5 min summaries for analysis and operations [36] and (ii) many established baselines and evaluation protocols on these benchmarks follow the same setting (5 min, 288 steps/day), ensuring consistency and fair comparison across methods [24].
Crucially, we also embed node-specific characteristics and relative temporal positions using a learnable tensor E p n R T h × N × d . Unlike absolute timestamps, this tensor captures relative intervals between observations—a key factor in modeling non-stationary traffic dynamics. For example, a 10 min gap during morning peak may behave very differently from the same gap at midnight. Each element E p n [ t , n , : ] thus encodes both the identity of sensor n at time step t and its temporal context relative to neighboring observations.
Finally, we concatenate all four embeddings along the feature dimension:
X emb = Concat X data , X day , X time , E p n R T h × N × d h ,
where d h = 4 d . This fused representation X emb serves as the input to the subsequent spatiotemporal modeling blocks. By preserving modality-specific structure and avoiding premature fusion, our embedding layer provides a rich, interpretable foundation for capturing complex spatiotemporal interactions—a critical advantage over models that flatten or compress multimodal inputs too early.
For notational convenience, we denote X X emb in the following sections.

3.2. Two-Stage Attention (TSA): Modeling Time and Space Separately Yet Jointly

Traffic forecasting requires modeling dependencies along two distinct axes: time (how traffic evolves at a given location) and space (how locations influence each other). Unlike images—where height and width are symmetric—time and space in traffic data are semantically asymmetric and must be treated differently. Moreover, applying standard self-attention directly to the full spatiotemporal tensor X R T h × N × d h would incur prohibitive computational cost: O ( N 2 T h 2 ) , which scales poorly for large networks.
To address this, we propose the Two-Stage Attention (TSA) layer, a lightweight yet expressive module that decomposes spatiotemporal modeling into two sequential stages: (1) intra-dimensional temporal attention and (2) inter-dimensional spatial routing. This design ensures efficiency while preserving modeling capacity.

3.2.1. Stage 1: Cross-Time Attention

Given an input tensor Z R L × D × d model —where L is the number of time segments and D is the number of dimensions—we first apply multi-head self-attention independently along the time axis for each dimension d:
Z ^ : , d time = LayerNorm Z : , d + MSA time ( Z : , d , Z : , d , Z : , d ) ,
Z time = LayerNorm Z ^ time + MLP ( Z ^ time ) ,
where MSA time denotes multi-head self-attention, and all dimensions share the same attention weights to encourage generalization. This stage captures long-range temporal dependencies, such as morning rush hour patterns repeating daily, within each sensor’s time series. The computational complexity is O ( D L 2 ) , which remains manageable even for long sequences.

3.2.2. Stage 2: Cross-Dimension Routing

Modeling spatial interactions naively via full self-attention across D dimensions would cost O ( D 2 L ) , which becomes infeasible for city-scale sensor networks ( D > 1000 ). Instead, we introduce a learnable routing mechanism that mediates information exchange between dimensions without pairwise computation.
Key Design: Learnable Position-Node Embedding Tensor ( E p n ): Before detailing the routing process, we clarify the role of the learnable tensor E p n R T h × N × d introduced in our input embedding. This tensor serves as a joint spatiotemporal position encoding. Unlike standard positional embeddings that treat time and node indices independently, E p n directly models the coupled representation of “when” and “where.” Concretely, each entry E p n [ t , n , : ] encodes
  • Node-specific characteristics: the intrinsic features of sensor n (e.g., its geographic role, lane type, or nearby points of interest).
  • Relative temporal context: the position of time step t within the observed sequence, capturing its order and interval-based relationships (e.g., whether it belongs to the start, middle, or end of a peak period).
This design allows the model to distinguish, for example, that a sensor near a school exhibits different traffic patterns during morning drop-off (relative time t morning ) versus afternoon pickup ( t afternoon ), even if the absolute clock times differ. The tensor is initialized randomly and optimized end-to-end, enabling it to learn data-driven spatiotemporal inductive biases.
Routing Mechanism: For each time segment i, we define a set of c ( c D ) learnable router vectors R i , : R c × d model . These routers first aggregate information from all D dimensions, leveraging the rich spatiotemporal context provided by E p n :
B i , : = MSA 1 dim ( R i , : , Z i , : time , Z i , : time ) ,
then distribute the aggregated context back to each dimension:
Z ¯ i , : dim = MSA 2 dim ( Z i , : time , B i , : , B i , : ) .
Finally, we apply residual connections and MLP:
Z ^ dim = LayerNorm Z time + Z ¯ dim ,
Z dim = LayerNorm Z ^ dim + MLP ( Z ^ dim ) .
This two-step routing reduces complexity from O ( D 2 L ) to O ( c D L ) O ( D L ) (since c is small and fixed) while still enabling all-to-all spatial communication. The router acts as a bottleneck that forces the model to compress and redistribute the most relevant cross-dimensional signals—mimicking how traffic control centers aggregate and broadcast congestion alerts.
Combining both stages, the full TSA layer is defined as
Y = TSA ( Z ) = Z dim .
The overall complexity is O ( D L 2 + D L ) = O ( D L 2 ) , dominated by the temporal stage, a favorable trade-off for traffic forecasting, where temporal patterns are typically longer-range and more structured than spatial ones.

3.3. Hierarchical Encoder–Decoder Structure: Capturing Multi-Scale Dynamics

Traffic systems exhibit dynamics at multiple temporal scales: short-term fluctuations (seconds to minutes), mid-term patterns (rush hours), and long-term trends (daily/weekly cycles). To capture this hierarchy, we designed a hierarchical encoder–decoder (HED) architecture that progressively coarsens the temporal resolution in the encoder and refines predictions across scales in the decoder.

3.3.1. Encoder: Coarsening for Multi-Scale Abstraction

The encoder consists of N stacked layers, each designed to capture traffic patterns at progressively broader time horizons. The first layer takes the embedded input Z enc , 0 = X . Each subsequent layer l > 0 performs two complementary operations:
Segment Merging: Building Temporal Hierarchies: To mimic how traffic operators view data from minute-level readings to hourly trends, we merge adjacent time segments. This operation reduces sequence length while preserving essential information through a learnable projection:
Z ^ i , d enc , l = M [ Z 2 i 1 , d enc , l 1 ; Z 2 i , d enc , l 1 ] , 1 i L l 1 / 2 ,
where M R d model × 2 d model is a learnable projection matrix, and [ ; ] denotes concatenation. If L l 1 is odd, we zero-pad the last segment. Conceptually, this merges fine-scale fluctuations (e.g., rapid changes at a 5 min level) into coarser, smoothed representations (e.g., 10 min aggregates), enabling the model to focus on longer-horizon patterns.
TSA Refinement: Extracting Scale-Specific Dependencies: After merging, the representation is processed by a TSA layer to model both temporal evolution and spatial interactions at that specific scale:
Z enc , l = TSA ( Z ^ enc , l ) .
This allows each layer to specialize: lower layers capture short-term, high-frequency variations (e.g., sudden congestion caused by an accident), while higher layers focus on stable, long-term periodicities (e.g., daily commute peaks).
This process yields a pyramid of representations Z enc , 0 , Z enc , 1 , , Z enc , N , where higher layers capture coarser, longer-range patterns (e.g., daily trends), and lower layers retain fine-grained details (e.g., minute-by-minute fluctuations). The encoder thus builds a multi-resolution understanding of traffic dynamics, analogous to viewing the same road network through different temporal “zoom levels”.

3.3.2. Decoder: Multi-Scale Prediction Fusion

The decoder mirrors the encoder’s hierarchy but operates in reverse, starting from the coarsest (most abstract) scale and gradually reintroducing fine details to form accurate predictions. This design ensures that long-term trends guide the overall forecast, while short-term adjustments refine local accuracy.
Initial Context: Seeding with Learnable Future Positions: At the coarsest layer ( l = 0 ), we initialize the decoder with learnable position embeddings E ( dec ) R τ / L seg × D × d model (where τ is the prediction horizon), which represent a learnable “prototype” of future temporal patterns:
Z ˜ dec , 0 = TSA ( E ( dec ) ) .
These embeddings are optimized to capture common periodic structures in traffic, providing a structured prior for generation.
Cross-Scale Attention: Integrating Multi-Resolution Context: For each finer layer l > 0 , the decoder first refines its current representation using TSA then selectively queries the corresponding encoder layer at the same scale via cross-attention:
Z ˜ dec , l = TSA   ( Z dec , l 1 ) ,
Z ¯ : , d dec , l = MSA   ( Z ˜ : , d dec , l , Z : , d enc , l , Z : , d enc , l ) ,
Z dec , l = LayerNorm Z ˜ dec , l + Z ¯ dec , l + MLP ( · ) .
This mechanism allows the decoder to "look back" at the encoded history at the appropriate temporal granularity—for example, when predicting the next 30 min, it can refer to both recent 5 min details (from lower encoder layers) and broader hourly trends (from higher layers).
Scale-Specific Prediction: Specialized Contribution at Each Level: Each decoder layer l produces a partial prediction that captures dynamics specific to its scale:
x i , d ( s ) , l = W l Z i , d dec , l , W l R L seg × d model ,
where x i , d ( s ) , l R L seg is the predicted segment for dimension d at scale l. Intuitively, coarse layers contribute smooth, trend-like components, while fine layers add high-resolution adjustments.
Final Aggregation: Synthesizing the Complete Forecast: Predictions from all scales are summed to produce the final output, integrating multi-scale insights:
x T + 1 : T + τ pred = l = 0 N x T + 1 : T + τ pred , l .
This additive fusion ensures that the model can simultaneously account for both macroscopic patterns and microscopic variations.
This multi-scale design allows TSAformer to leverage both local details and global trends. For example, using coarse layers to predict overall congestion levels while fine layers adjust for lane-specific incidents. The hierarchical structure also improves training stability by providing intermediate supervision signals at multiple resolutions.

4. Experiments

In this section, we first describe the datasets, evaluation metrics, baselines, and training setup. We then report the overall performance of our model, analyze the impact of key hyperparameters on performance, and present ablation studies to quantify the contribution of each component.

4.1. Datasets

To validate our proposed approach, we conducted experiments on three widely adopted real-world datasets: PeMS04, PeMS07, and PeMS08. These datasets comprise traffic flow measurements gathered from sensor networks deployed across California’s highway and freeway infrastructure. The temporal resolution of all datasets is set at 5 min intervals, providing fine-grained traffic pattern information.

4.2. Evaluation

Our evaluation framework employs three standard performance metrics commonly utilized in traffic flow forecasting research: Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). The prediction framework is configured to use historical traffic data spanning 60 min (equivalent to 12 temporal intervals) as input to generate forecasts for the following 60 min period.

4.3. Baselines

To evaluate our model, we selected ten baselines grouped into three categories: classical statistical and machine learning models, multivariate time-series forecasting models, and traffic-flow forecasting models.
Classical statistical and machine learning models.
  • VAR [20]: A classical multivariate linear model that captures inter-variable lagged dependencies via vector autoregression.
  • SVR [22]: A kernel-based regression approach with ϵ -insensitive loss, effective for nonlinear mappings from historical inputs to targets.
Multivariate time-series forecasting models.
  • PatchTST [37]: Treats sub-sequences as patches and models variable-wise independence in attention to improve efficiency and accuracy.
  • iTransformer [13]: An inverted Transformer that embeds time points as tokens (with variables as channels) to capture cross-variable correlations via self-attention.
Traffic-flow forecasting models.
  • DCRNN [24]: Diffusion convolution integrated with recurrent units to learn directional spatiotemporal dependencies on sensor graphs.
  • STGCN [25]: Stacks spatial graph convolutions with gated temporal convolutions to jointly model space–time dynamics.
  • STSGCN [28]: Uses spatiotemporal synchronous graph convolution blocks to capture localized spatiotemporal correlations.
  • STFGNN [26]: Fuses multiple spatial and temporal graphs within a GNN to model hidden spatiotemporal relations.
  • STGODE [34]: Formulates node dynamics with neural ordinary differential equations to model continuous-time spatiotemporal evolution.
  • AGCRN [27]: Adaptive graph convolutional recurrent network with node-adaptive parameters for personalized spatial and temporal modeling.

4.4. Implementation Details

All experiments were performed on a computational platform featuring an NVIDIA Tesla T4 GPU (24 GB; NVIDIA Corporation, Santa Clara, CA, USA) operating on Ubuntu 20.04. The model architecture was developed using PyTorch version 1.10.1 within a Python 3.9.7 environment. The experimental data was partitioned following a 7:1:2 split ratio for training, validation, and testing phases, respectively.
Model optimization involved systematic hyperparameter exploration across multiple dimensions: hidden dimension sizes were examined within the range {64, 128, 256}, and both encoder and decoder layer counts were varied among {1, 2, 3} configurations, while the multi-head attention mechanism utilized 4 attention heads consistently. The merge window parameter was configured to 2. To prevent overfitting during training, a dropout probability of 0.2 was incorporated.
The optimal model configuration was selected based on validation-set performance metrics. Training optimization utilized the AdamW algorithm with an initial learning rate of 1 × 10 4 . The training process employed mini-batches of size 16 across 100 maximum epochs. An early termination mechanism was implemented with a patience threshold of 5 epochs to prevent overfitting and ensure efficient training convergence.

4.5. Prediction Results

The comparison results of our model and the baselines are presented in Table 1. In the table, the best results are highlighted in bold, and the second-best are underlined. Across the nine metrics, our model attains the best results on eight metrics and remains competitive on the remaining metric. Moreover, Figure 2 provides a qualitative case study on PeMS08 (stations 56 and 106), where our forecasts almost overlap with the ground truth over time, indicating that the prediction errors are extremely small and the model can faithfully track both trend changes and short-term fluctuations.
PeMS04: Our model achieves the best MAE (19.51) and RMSE (29.28), improving over the next-best results by 1.61% and 8.13%, respectively.
PeMS07: Our model achieves state-of-the-art performance on all three metrics, MAE 20.27 (improvement 4.79%), MAPE 8.64 (3.68%), and RMSE 32.32(3.35%), relative to the corresponding second-best results.
PeMS08: Our model obtains the best MAE 15.39 and RMSE 23.31, with improvements of 3.51% and 7.57%, respectively. It also achieves the best MAPE 10.03, slightly outperforming the second-best baseline.
The consistent gains in Table 1 mainly stem from TSAformer’s explicit cross-dimensional dependency modeling and its multi-scale temporal abstraction, which are both weakly handled in many baselines. First, classical statistical/ML methods (e.g., VAR and SVR) are limited by linearity or heavy feature engineering and thus cannot robustly capture the strong nonlinearity and non-stationarity of traffic dynamics (e.g., abrupt congestion onset and dissipation). Second, time-series Transformers that emphasize temporal modeling (e.g., PatchTST and iTransformer) often reduce the spatial dimension to either independent channels or implicitly mixed features, which weakens the ability to represent propagation effects across sensors (upstream-to-downstream influence) and network-wide coupling. Third, compared with graph-based models (e.g., DCRNN, STGCN, STFGNN, and AGCRN), TSAformer avoids strong reliance on a pre-defined or locally constrained graph structure. While graph convolutions are effective for local diffusion, they may suffer from limited receptive fields (or oversmoothing when stacked deeply).
In contrast, TSAformer preserves the tensor structure along time and dimension and employs Two-Stage Attention to decouple these asymmetric axes: (i) time-axis self-attention captures long-range temporal regularities within each node, and (ii) routing-based cross-dimension interaction enables efficient all-to-all communication without quadratic cost, allowing the model to aggregate global context and redistribute it to each node adaptively. Finally, the hierarchical encoder–decoder structure further boosts performance by modeling traffic at multiple temporal resolutions: coarser levels summarize macroscopic trends, while finer levels correct short-term fluctuations, improving robustness for multi-step forecasting horizons. We also note that MAPE can be affected disproportionately by low-flow periods (small denominators), which may explain occasional cases where improvements in MAE/RMSE do not translate to the best MAPE on a specific dataset; nevertheless, TSAformer remains consistently strong across metrics and datasets, demonstrating both effectiveness and scalability.

4.6. Effect of Model Capacity (Width, Depth, and Router Number)

We conducted a grid search over the model width (dimension of model) { 64 , 128 , 256 } and depth (number of layers) { 1 , 2 , 3 } on PeMS04/PeMS07/PeMS08. Across all three datasets and all three metrics (MAE, MAPE, and RMSE), we observe a clear capacity trend: as width and depth increase, our model attains better accuracy, and the best results typically occur at the largest capacity, indicating that performance has not yet saturated.
As shown in Figure 3 (left), from the smallest configuration to the best configuration within our search space, the three metrics decrease as follows. For PeMS04, MAE decreases by 14.64%, MAPE decreases by 24.66%, and RMSE decreases by 12.17%. For PeMS07, MAE decreases by 9.57%, MAPE decreases by 12.67%, and RMSE decreases by 5.63%. For PeMS08, MAE decreases by 24.00%, MAPE decreases by 26.29%, and RMSE decreases by 21.33%. These findings suggest that further scaling of width and depth is likely to deliver additional gains.
Figure 3 (right) reports the effect of the router number c { 5 , 10 , 15 } while keeping other hyperparameters fixed. We observe that c = 10 yields the best overall performance (MAE = 15.39 , MAPE = 10.03 % , RMSE = 23.31 ), whereas both smaller and larger router numbers lead to slightly worse accuracy. This suggests a trade-off between routing diversity and optimization efficiency: too few routers may limit the model’s ability to capture heterogeneous traffic patterns, while too many routers can introduce redundancy and over-fragment expert utilization and make training less stable. Overall, a moderate router number provides sufficient specialization without sacrificing optimization efficiency, and we used c = 10 as the default setting in experiments.

4.7. Ablation Study

We conducted an ablation study with two simplified variants: w/o Dec, which maps the encoder’s hidden states to predictions using a linear layer, and w/o Dec & Emb, which further removes the learnable spatiotemporal embeddings. As shown in Table 2, removing the decoder consistently degrades performance, and removing the embeddings on top of that leads to a further decline. These results indicate that both components are effective and complementary, and the full model (Ours) achieves the best scores, validating the necessity of the decoder and the learnable spatiotemporal embeddings.

4.8. Computational Efficiency

Let L be the sequence length, D the number of nodes (dimensions), c the number of routers ( c D ), and d the hidden size. We apply self-attention along time for each node, yielding O D , L 2 d . Routers aggregate from D nodes and then broadcast back, costing O L , c , D , d . The TSA attention cost is therefore O D , L 2 d + L , c , D , d O D , L 2 d . In practical traffic forecasting, D is typically much larger than L (i.e., D L ), so replacing full cross-dimension attention O ( L , D 2 d ) with routing-based interaction O ( L , c , D , d ) substantially improves scalability.
We compared the computational efficiency of our method with several representative models, namely, DCRNN [24], STGCN [25], ASTGCN [38], and AGCRN [27], in terms of the number of trainable parameters and the training time per epoch (Table 3). Despite achieving strong forecasting accuracy, our model maintains a competitive model size (around 0.5 M parameters) and moderate training cost (about 40 s per epoch). These results indicate that our approach attains improved predictive performance while preserving competitive computational efficiency, making it suitable for practical deployment with limited computational budgets.

5. Conclusions

In this work, we present TSAformer, a novel Transformer-based architecture tailored for multivariate traffic flow forecasting. Unlike conventional sequence modeling approaches that collapse spatiotemporal structure into flattened representations, TSAformer explicitly preserves and jointly models the dual axes of time and dimension (i.e., sensor node or road segment) throughout the prediction pipeline. This design enables the model to capture both temporal evolution patterns and spatial interaction mechanisms—two fundamental drivers of traffic dynamics.
At the foundation of TSAformer lies a multimodal input embedding layer that encodes not only raw traffic measurements but also contextual features such as time-of-day, day-of-week, and node-specific positional-temporal characteristics. This rich embedding ensures that the model is sensitive to both periodic behaviors (e.g., rush hours and weekend effects) and sensor-specific non-stationarities (e.g., intersection topology and lane capacity), providing a semantically grounded input representation.
To efficiently model dependencies across time and space, we introduce the Two-Stage Attention mechanism. In the first stage, temporal self-attention operates independently along each dimension to capture long-range sequential patterns. In the second stage, a lightweight routing-based cross-dimension attention mediates spatial interactions without incurring quadratic complexity, making the model scalable to large-scale sensor networks. This decoupled-yet-coordinated attention design strikes an optimal balance between expressiveness and efficiency.
Built upon TSA, our hierarchical encoder–decoder structure further enhances predictive capability by modeling traffic dynamics across multiple temporal scales. The encoder progressively coarsens temporal resolution to extract macroscopic trends, while the decoder refines predictions from coarse to fine, fusing multi-scale signals through cross-attention and residual aggregation. This enables TSAformer to simultaneously capture long-term congestion patterns and short-term incident-induced fluctuations.
Extensive experiments on three real-world traffic datasets—spanning urban arterials, freeways, and varying spatial scales—demonstrate that TSAformer consistently outperforms state-of-the-art baselines in both short-term and long-term forecasting settings. Notably, it achieves top performance in 36 out of 58 critical evaluation scenarios, including peak-hour prediction and event-driven congestion forecasting, validating its robustness and practical utility.
Despite these advances, we acknowledge several limitations that should be addressed in future work:
  • Road topology is not explicitly encoded: The model mainly learns spatial relations implicitly from data, which may be less faithful to physical connectivity and less robust to topology changes; integrating graph priors or sparse graph-based attention could improve efficiency and interpretability.
  • Lack of external context: Weather, incidents, events, and control signals are not modeled, which can degrade performance under abnormal conditions; multimodal context fusion is a clear next step.
  • Scalability at city scale remains challenging: On thousand-node networks, attention/routing can still be costly in memory and latency; hierarchical partitioning and sparse/linear attention could enable real-time deployment.
  • Robustness evaluation is limited: While we study sensitivity to model capacity (width/depth) and router number, further tests with reduced data, noisy/missing inputs, and different forecasting horizons are needed to better assess practical stability.
  • Temporal order sensitivity may be insufficient: Self-attention can under-emphasize strict temporal causality; adding stronger positional reinforcement or causal convolutional priors may improve long-horizon stability.
TSAformer provides a principled, scalable, and effective framework for spatiotemporal traffic forecasting. We hope this work inspires further research into structure-aware sequence modeling for transportation intelligence.

Author Contributions

Conceptualization, H.L.; Methodology, H.L. and X.C.; Software, H.L. and X.C.; Validation, H.L. and X.C.; Formal analysis, H.L.; Investigation, H.L.; Resources, H.L.; Data curation, H.L.; Writing—original draft, H.L.; Writing—review & editing, H.L.; Supervision, W.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets used in this study are publicly available from the Zenodo repository: https://doi.org/10.5281/zenodo.7816008.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Khan, D.R.; Patankar, A.B.; Khan, A. An experimental comparison of classic statistical techniques on univariate time series forecasting. Procedia Comput. Sci. 2024, 235, 2730–2740. [Google Scholar] [CrossRef]
  2. Qin, Y.; Song, D.; Chen, H.; Cheng, W.; Jiang, G.; Cottrell, G. A dual-stage attention-based recurrent neural network for time series prediction. arXiv 2017, arXiv:1704.02971. [Google Scholar]
  3. Rangapuram, S.S.; Seeger, M.W.; Gasthaus, J.; Stella, L.; Wang, Y.; Januschowski, T. Deep state space models for time series forecasting. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
  4. Li, L.; Zhang, J.; Yan, J.; Jin, Y.; Zhang, Y.; Duan, Y.; Tian, G. Synergetic learning of heterogeneous temporal sequences for multi-horizon probabilistic forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 8420–8428. [Google Scholar]
  5. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
  6. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
  7. Liu, X.; Wu, J.; Huang, J.; Zhang, J.; Chen, B.Y.; Chen, A. Spatial-interaction network analysis of built environmental influence on daily public transport demand. J. Transp. Geogr. 2021, 92, 102991. [Google Scholar] [CrossRef]
  8. Lai, G.; Chang, W.C.; Yang, Y.; Liu, H. Modeling long-and short-term temporal patterns with deep neural networks. In Proceedings of the 41st International ACM SIGIR Conference On Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018. [Google Scholar]
  9. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 753–763. [Google Scholar]
  10. Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F.Y. Traffic flow prediction with big data: A deep learning approach. IEEE Trans. Intell. Transp. Syst. 2014, 16, 865–873. [Google Scholar] [CrossRef]
  11. Zheng, C.; Fan, X.; Wang, C.; Qi, J. Gman: A graph multi-attention network for traffic prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1234–1241. [Google Scholar]
  12. Cirstea, R.G.; Yang, B.; Guo, C.; Kieu, T.; Pan, S. Towards spatio-temporal aware traffic time series forecasting. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 2900–2913. [Google Scholar]
  13. Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. itransformer: Inverted transformers are effective for time series forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar]
  14. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
  15. Li, Y.; Lu, X.; Xiong, H.; Tang, J.; Su, J.; Jin, B.; Dou, D. Towards long-term time-series forecasting: Feature, pattern, and distribution. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE), Anaheim, CA, USA, 3–7 April 2023; pp. 1611–1624. [Google Scholar]
  16. Zhang, Y.; Yan, J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  17. Kim, T.; Kim, J.; Tae, Y.; Park, C.; Choi, J.H.; Choo, J. Reversible instance normalization for accurate time-series forecasting against distribution shift. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
  18. Smith, B.L. Forecasting Freeway Traffic Flow for Intelligent Transportation Systems Application; University of Virginia: Charlottesville, VA, USA, 1995. [Google Scholar]
  19. Kumar, S.V.; Vanajakshi, L. Short-term traffic flow prediction using seasonal ARIMA model with limited input data. Eur. Transp. Res. Rev. 2015, 7, 21. [Google Scholar] [CrossRef]
  20. Sims, C.A. Macroeconomics and reality. Econom. J. Econom. Soc. 1980, 48, 1–48. [Google Scholar] [CrossRef]
  21. Li, D. Predicting short-term traffic flow in urban based on multivariate linear regression model. J. Intell. Fuzzy Syst. 2020, 39, 1417–1427. [Google Scholar] [CrossRef]
  22. Wu, C.H.; Ho, J.M.; Lee, D.T. Travel-time prediction with support vector regression. IEEE Trans. Intell. Transp. Syst. 2004, 5, 276–281. [Google Scholar] [CrossRef]
  23. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph wavenet for deep spatial-temporal graph modeling. arXiv 2019, arXiv:1906.00121. [Google Scholar]
  24. Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv 2017, arXiv:1707.01926. [Google Scholar]
  25. Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv 2017, arXiv:1709.04875. [Google Scholar]
  26. Li, M.; Zhu, Z. Spatial-temporal fusion graph neural networks for traffic flow forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 4189–4196. [Google Scholar]
  27. Bai, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive graph convolutional recurrent network for traffic forecasting. Adv. Neural Inf. Process. Syst. 2020, 33, 17804–17815. [Google Scholar]
  28. Song, C.; Lin, Y.; Guo, S.; Wan, H. Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 914–921. [Google Scholar]
  29. Zanfei, A.; Brentan, B.M.; Menapace, A.; Righetti, M.; Herrera, M. Graph convolutional recurrent neural networks for water demand forecasting. Water Resour. Res. 2022, 58, e2022WR032299. [Google Scholar] [CrossRef]
  30. Lan, S.; Ma, Y.; Huang, W.; Wang, W.; Yang, H.; Li, P. Dstagnn: Dynamic spatial-temporal aware graph neural network for traffic flow forecasting. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 11906–11917. [Google Scholar]
  31. Lian, Q.; Sun, W.; Dong, W. Hierarchical Spatial-Temporal Neural Network with Attention Mechanism for Traffic Flow Forecasting. Appl. Sci. 2023, 13, 9729. [Google Scholar] [CrossRef]
  32. Gao, J.; Sharma, R.; Qian, C.; Glass, L.M.; Spaeder, J.; Romberg, J.; Sun, J.; Xiao, C. STAN: Spatio-temporal attention network for pandemic prediction using real-world evidence. J. Am. Med. Inform. Assoc. 2021, 28, 733–743. [Google Scholar] [CrossRef] [PubMed]
  33. Xing, H.; Chen, A.; Zhang, X. RL-GCN: Traffic flow prediction based on graph convolution and reinforcement learning for smart cities. Displays 2023, 80, 102513. [Google Scholar] [CrossRef]
  34. Fang, Z.; Long, Q.; Song, G.; Xie, K. Spatial-temporal graph ode networks for traffic flow forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 364–373. [Google Scholar]
  35. Choi, J.; Choi, H.; Hwang, J.; Park, N. Graph neural controlled differential equations for traffic forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 6367–6374. [Google Scholar]
  36. Chen, C.; Petty, K.; Skabardonis, A.; Varaiya, P.; Jia, Z. Freeway performance measurement system: Mining loop detector data. Transp. Res. Rec. 2001, 1748, 96–102. [Google Scholar] [CrossRef]
  37. Nie, Y. A Time Series is Worth 64Words: Long-term Forecasting with Transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
  38. Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 922–929. [Google Scholar]
Figure 1. Overall architecture of TSAformer. TSAformer begins by transforming raw traffic sequences into rich spatiotemporal embeddings that fuse flow values and time-of-day, day-of-week, and node-specific positional features. These embeddings are then processed by the core Two-Stage Attention (TSA) module, which decouples the modeling of temporal evolution and spatial interactions into two efficient, hierarchical stages to capture asymmetric dependencies. Finally, a hierarchical encoder–decoder structure progressively abstracts multi-scale temporal patterns in the encoder and refines predictions through cross-scale fusion in the decoder, enabling accurate and comprehensive multi-step traffic forecasting.
Figure 1. Overall architecture of TSAformer. TSAformer begins by transforming raw traffic sequences into rich spatiotemporal embeddings that fuse flow values and time-of-day, day-of-week, and node-specific positional features. These embeddings are then processed by the core Two-Stage Attention (TSA) module, which decouples the modeling of temporal evolution and spatial interactions into two efficient, hierarchical stages to capture asymmetric dependencies. Finally, a hierarchical encoder–decoder structure progressively abstracts multi-scale temporal patterns in the encoder and refines predictions through cross-scale fusion in the decoder, enabling accurate and comprehensive multi-step traffic forecasting.
Electronics 15 00231 g001
Figure 2. Forecasting results for stations 56 and 106 in PeMS08.
Figure 2. Forecasting results for stations 56 and 106 in PeMS08.
Electronics 15 00231 g002
Figure 3. Model capacity ablation. (Left): Heatmaps of MAE/MAPE/RMSE for different model widths and depths on PeMS04/PeMS07/PeMS08, where larger width/depth consistently improves accuracy. (Right): Performance with different router numbers ( c { 5 , 10 , 15 } ), showing the best results at c = 10 and slightly worse performance when using fewer or more routers.
Figure 3. Model capacity ablation. (Left): Heatmaps of MAE/MAPE/RMSE for different model widths and depths on PeMS04/PeMS07/PeMS08, where larger width/depth consistently improves accuracy. (Right): Performance with different router numbers ( c { 5 , 10 , 15 } ), showing the best results at c = 10 and slightly worse performance when using fewer or more routers.
Electronics 15 00231 g003
Table 1. Prediction results on PeMS04/PeMS07/PeMS08 with three metrics (MAE, MAPE, and RMSE; lower is better). The best scores are in bold, and the second-best are underlined. Our model attains the best performance on 8/9 metrics.
Table 1. Prediction results on PeMS04/PeMS07/PeMS08 with three metrics (MAE, MAPE, and RMSE; lower is better). The best scores are in bold, and the second-best are underlined. Our model attains the best performance on 8/9 metrics.
ModelsVARSVRPatchTSTiTransformerDCRNNSTGCNSTSGCNSTFGNNSTGODEAGCRNTSAformer
PeMS04MAE23.7528.6622.3022.5422.7421.7621.1919.8320.8519.8319.51
MAPE (%)18.0919.1516.3316.1714.7513.8713.8813.0213.7812.9713.91
RMSE36.6644.5933.6835.2136.5834.7733.6531.8732.8332.2629.28
PeMS07MAE101.2032.9723.9624.5923.6322.9024.2622.0722.9821.2920.27
MAPE (%)39.6915.4313.5111.1012.2811.9810.209.2110.148.978.64
RMSE155.1450.1534.4137.8136.5133.4439.0335.8136.1935.1232.32
PeMS08MAE22.3223.2519.1120.0518.1917.8417.1316.6416.8215.9515.39
MAPE (%)14.4714.7121.6412.2611.2411.2110.9610.5510.6210.0910.03
RMSE33.8336.1525.7931.9028.1827.1226.7926.2126.2425.2223.31
Table 2. Ablation study on PeMS04/PeMS07/PeMS08: progressively removing components (decoder and embedding). Metrics: MAE/MAPE/RMSE (lower is better). The best is bold; the second-best is underlined.
Table 2. Ablation study on PeMS04/PeMS07/PeMS08: progressively removing components (decoder and embedding). Metrics: MAE/MAPE/RMSE (lower is better). The best is bold; the second-best is underlined.
Modelsw/o Dec & Embw/o DecTSAformer
PeMS04MAE22.4420.4719.51
MAPE (%)16.6015.6313.91
RMSE32.9430.3029.28
PeMS07MAE24.8421.7620.27
MAPE (%)11.409.918.64
RMSE36.9933.3832.32
PeMS08MAE18.1216.3515.39
MAPE (%)11.8211.1010.03
RMSE26.6524.2923.31
Table 3. Model size (#Parameters, i.e., total number of trainable parameters) and training time per epoch on PeMS04.
Table 3. Model size (#Parameters, i.e., total number of trainable parameters) and training time per epoch on PeMS04.
DCRNNSTGCNASTGCNAGCRNOurs
# Parameters149,057211,596450,031748,810512,384
Training Time (epoch)36.39 s16.36 s49.47 s35.56 s40.72 s
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lv, H.; Chen, X.; Xiu, W. TSAformer: A Traffic Flow Prediction Model Based on Cross-Dimensional Dependency Capture. Electronics 2026, 15, 231. https://doi.org/10.3390/electronics15010231

AMA Style

Lv H, Chen X, Xiu W. TSAformer: A Traffic Flow Prediction Model Based on Cross-Dimensional Dependency Capture. Electronics. 2026; 15(1):231. https://doi.org/10.3390/electronics15010231

Chicago/Turabian Style

Lv, Haoning, Xi Chen, and Weijie Xiu. 2026. "TSAformer: A Traffic Flow Prediction Model Based on Cross-Dimensional Dependency Capture" Electronics 15, no. 1: 231. https://doi.org/10.3390/electronics15010231

APA Style

Lv, H., Chen, X., & Xiu, W. (2026). TSAformer: A Traffic Flow Prediction Model Based on Cross-Dimensional Dependency Capture. Electronics, 15(1), 231. https://doi.org/10.3390/electronics15010231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop