You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

26 September 2025

Patch-Based Transformer–Graph Framework (PTSTG) for Traffic Forecasting in Transportation Systems

and
Russian Federation, Moscow Technical University of Communication and Informatics, 111024 Moscow, Russia
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Computer Vision of Edge AI on Automobile

Abstract

Accurate traffic forecasting underpins intelligent transportation systems. We present PTSTG, a compact spatio-temporal forecaster that couples a patch-based Transformer encoder with a data-driven adaptive adjacency and lightweight node graph blocks. The temporal module tokenizes multivariate series into fixed-length patches to capture short- and long-range patterns in a single pass, while the graph module refines node embeddings via learned inter-node aggregation. A horizon-specific head emits all steps simultaneously. On standard benchmarks (METR-LA, PEMS-BAY) and the LargeST (SD) split with horizons { 3 ,   6 ,   12 } { 15 ,   30 ,   60 } minutes, PTSTG delivers competitive point-estimate results relative to recent temporal graph models. On METR-LA/PEMS-BAY, it remains close to strong baselines (e.g., DCRNN) without surpassing them; on LargeST, it attains favorable average RMSE/MAE while trailing the strongest hybrids on some horizons. The design preserves a compact footprint and single-pass, multi-horizon inference, and offers clear capacity-driven headroom without architectural changes.

1. Introduction

Accurate traffic forecasting has emerged as a fundamental element of intelligent transportation systems (ITS), contributing to congestion mitigation, improvement of travel time reliability, facilitation of navigation services, and advancement of smart city mobility planning. The growing accessibility of traffic sensors and extensive urban mobility data has attracted considerable interest in data-driven forecasting methodologies from both academic and industrial sectors. Deep learning approaches have become a dominant paradigm for modeling nonlinear spatio-temporal dynamics in traffic data, offering end-to-end learning of patterns that are difficult to encode heuristically.
Early sequence models, including LSTMs [1] and GRUs [2], focused on temporal dependencies at each sensor location. While these methods achieved notable gains for short horizons, their reliance on step-by-step recurrence and limited receptive fields complicates learning over extended look-backs. In parallel, graph neural networks (GNNs) such as STGCN [3] and DCRNN [4] embedded the road-network structure via graph convolutions and diffusion operators, effectively capturing localized spatial propagation of traffic states across adjacent segments. However, these models typically assume a fixed, topology-driven adjacency, which may under-represent correlations that arise from behavioral or operational factors (e.g., coordinated signals, detours, and demand shifts).
Transformer-based architectures [5] introduced global self-attention over sequences and have recently become compelling for long-horizon forecasting. Models like Informer [6], Autoformer [7], and PatchTST [8] leverage sparsity, decomposition, or patching to scale attention and improve generalization on complex time series. In particular, the patching mechanism in PatchTST tokenizes a time series into short overlapping fragments, preserving local temporal semantics while shortening the attention span and stabilizing optimization on long contexts. Despite these advances on the temporal axis, most Transformer applications in traffic forecasting process multivariate sensor streams either independently or with weak inductive bias for spatial structure, leaving on the table systematic gains achievable from explicitly modeling inter-sensor interactions.
Two practical gaps are therefore central. First, static or manually specified graphs are not sufficient to reflect the evolving and context-dependent coupling between locations; correlations can strengthen or weaken with time-of-day, incidents, and route substitution effects. Second, purely temporal Transformers—although strong at capturing periodicities and long-range trends—do not by themselves enforce relational constraints that are natural in transportation networks. Bridging these gaps calls for a lightweight hybrid that preserves the optimization benefits of patch-level temporal encoding and learns spatial relations directly from data, without hard-wiring the graph.
We address this by proposing PTSTG (PatchTST-Graph), a compact spatio-temporal forecaster that couples a patch-based Transformer encoder with an adaptive, data-driven adjacency and a small number of node-level graph refinement blocks. The temporal module unfolds each sensor’s history into overlapping patches and encodes them with a Transformer encoder; averaging over patch tokens yields a stable node embedding that summarizes both local fluctuations and long-range periodic structure. On top of these embeddings, PTSTG learns an adjacency matrix parameterized by latent node embeddings and normalized row-wise; this mechanism induces data-driven inter-node aggregation that is not restricted to geographic proximity. A lightweight graph then blocks propagating information across nodes with residual connections and feed-forward refinement, and a compact prediction head emits all horizons in a single pass.
This design is driven by operational constraints typical for ITS deployments. First, efficiency: the patching strategy reduces attention length and memory footprint, while single-pass multi-horizon decoding avoids iterative rollouts and error accumulation. Second, scalability: the adaptive adjacency is learned end-to-end and can be sparsified at inference (e.g., top-k neighbors per row) to keep compute and memory subquadratic in the number of sensors. Third, robustness: by not depending exclusively on a fixed road topology, the model can capture functional relations between locations that are distant in the graph yet tightly coupled in demand patterns, which is often observed under re-routing and peak-hour regimes. Finally, interpretability: the learned adjacency admits qualitative inspection (e.g., corridors that co-congest), complementing attention profiles over temporal patches.
Our contributions are as follows:
  • Hybrid yet compact architecture. We integrate patch-based temporal encoding with a minimal graph refinement stack over a learned adjacency, yielding a single-pass, multi-horizon forecaster that is straightforward to train and deploy.
  • Adaptive, data-driven spatial coupling. We employ a learnable adjacency mechanism that infers inter-node relations from data rather than relying on predefined topology, allowing PTSTG to account for dynamic correlations induced by routing, signal coordination, and demand shifts.
  • Competitive accuracy with favorable efficiency. On standard traffic benchmarks, PTSTG targets strong accuracy at short and medium horizons while maintaining a compact parameterization. Under moderate capacity scaling (width and graph depth), the approach narrows the long-horizon gap, indicating clear headroom without architectural complexity.
  • Practical modeling bias. Patching preserves short-term temporal structure and stabilizes optimization on long sequences; learned inter-node aggregation captures spatial effects that purely temporal models overlook. Together, these biases form a pragmatic recipe for traffic forecasting systems.
We explicitly do not position PTSTG as an accuracy-dominant model on all benchmarks. Instead, our contribution is a compact and scalable hybrid:
  • Single-pass multi-horizon decoding.
  • Patch-level temporal encoding that shortens the attention span and stabilizes optimization.
  • Lightweight data-driven spatial refinement. The experiments aim to show competitive accuracy together with deployment-friendly efficiency.
In summary, PTSTG operationalizes a simple principle: keep temporal modeling efficient through patch-level attention and let the spatial structure be discovered from data. This makes PTSTG a compact, efficiency-oriented alternative with a clear scaling path to large sensor networks via sparse inter-node mixing, rather than a claim of established superiority on ultra-large deployments.

3. PTSTG: PatchTST-Graph for Spatio-Temporal Forecasting

PTSTG (PatchTST-Graph) is a modular spatio–temporal forecaster that combines patch-level temporal encoding with adaptive, data-driven spatial aggregation. The model ingests a fixed look-back window for each node and returns the full multi-step horizon in a single forward pass. It isolates temporal pattern extraction from spatial coupling, which simplifies analysis, preserves efficiency, and scales to large sensor networks.
  • Problem statement.
Given historical observations,
X R B × T in × N × C in ,
and targets,
Y R B × H × N × C out ,
with batch size B, look-back T in , number of nodes N, input/output channel dimensions C in , C out , and horizon H, the goal is to learn a mapping Φ θ such that
Y ^ = Φ θ ( X ) R B × H × N × C out .
We adopt per-node, per-channel standardization on the training set and apply the task loss outside the architecture. Symbols used below: d (embedding width), P (patch length), s (stride), L   =   ( T in     P ) / s   +   1 (number of patches), r (graph rank), and L g (number of graph blocks).
  • Handling exogenous variables.
PTSTG is agnostic to the presence of exogenous covariates and supports three complementary types:
(i) Node-wise time-varying covariates  Z R B × T in × N × C exo (e.g., per-node weather or incident flags). These can be concatenated channel-wise with X:
X = Concat channels ( X , Z ) R B × T in × N × ( C in + C exo ) ,
and processed by the same patching and Transformer encoder.
(ii) Global time-varying covariates  G R B × T in × C g (e.g., city-level weather, calendar/event indicators). We project G via a small MLP and broadcast-add to patch tokens before the encoder:
E ˜ i = E i + ϕ ( G ) , ϕ : R C g R d .
(iii) Node-wise static features  S R N × C s (e.g., functional class, lanes). We embed S and fuse it into node embeddings after temporal pooling:
h i h i + ψ ( S i ) , ψ : R C s R d .
Beyond feature concatenation, exogenous variables can condition the learned adjacency. Concretely, let E 1 , E 2 be the node-embedding matrices used in (8). We apply feature-wise linear modulation (FiLM) or gating using global or node-wise covariates:
E ^ 1 = γ 1 E 1 + β 1 , E ^ 2 = γ 2 E 2 + β 2 , ( γ · , β · ) = f cond ( G , S ) ,
then form A as in (9) but with E ^ 1 , E ^ 2 . This allows A to adapt to weather/events without changing the overall PTSTG design. Calendar embeddings (time-of-day, day-of-week, holidays) can be included as part of G or Z.

3.1. Patch-Based Temporal Encoder

The temporal module follows the PatchTST principle [8]. For node i with sequence x i R T in × C in , we extract L overlapping fragments of length P and stride s,
U i = P ( x i ; P , s ) R L × ( P C in ) .
Each fragment is linearly projected to R d , positional encodings are added, and a stack of pre-norm Transformer encoder layers is applied,
E i = U i W p + b p R L × d , Z i = TransformerEncoder LN ( E i + PE ) R L × d .
A node-level representation is obtained by pooling over patch tokens,
h i = 1 L = 1 L Z i R d ,
which preserves the short-range temporal structure captured within patches while summarizing long-range patterns. Stacking over nodes yields
H ( 0 ) = Stack ( h 1 , , h N ) R B × N × d .

3.2. Adaptive Graph Learning

To model spatial dependencies without a fixed topology, PTSTG learns a row-stochastic adjacency from data. Two trainable node-embedding matrices E 1 , E 2 R N × r parameterize a nonnegative affinity,
S = ReLU ( E 1 E 2 ) R N × N , S ˜ = 1 2 ( S + S ) ,
and a row-wise softmax with optional temperature τ   >   0 produces
A = softmax row S ˜ τ R N × N , A 1 = 1 .
This mechanism follows adaptive-graph practice in spatio–temporal GNNs [16,18]. For scalability, A may be sparsified to top-k neighbors per row with re-normalization; this is an inference/training-time acceleration and does not alter the architecture.

3.3. Node Graph Block

Let H ( ) R B × N × d be node embeddings at layer . Each block performs linear cross-node mixing via A, residual pre-norm, and a position-wise MLP:
H ^ = H ( ) + ( A H ( ) ) W , W R d × d ,
H ˜ = LN ( H ^ ) ,
H ( + 1 ) = H ˜ + MLP ( H ˜ ) ,
where MLP ( x ) = Drop W 2 GELU Drop W 1 ( x ) with W 1 R d × 4 d and W 2 R 4 d × d . The product ( A H ( ) ) is evaluated with batch broadcasting: for each b, ( A H ( ) ) b = A H b ( ) . No attention is used across nodes beyond this mixing, keeping spatial refinement computationally light.

3.4. Prediction Head

After L g graph blocks, node embeddings are normalized and mapped to the full horizon,
H ¯ = LN H ( L g ) , Y = MLP head ( H ¯ ) R B × N × ( H C out ) ,
and reshaped to
Y ^ = Reshape ( Y ) R B × H × N × C out .
Single-pass decoding avoids auto-regressive rollouts and associated error accumulation. The overall pipeline is illustrated in Figure 1.
Figure 1. PTSTG overview: patch-level temporal encoding, learned adjacency A, lightweight node-graph refinement, and single-pass multi-horizon head.

3.5. Algorithmic Description

  • Notation
The overall training and inference procedure of PTSTG is summarized in Algorithm 1, which outlines the patch-based temporal encoding, adaptive adjacency learning, and graph refinement loop. To highlight the differences between our method and a strong temporal baseline, a direct side-by-side comparison of PatchTST and PTSTG is provided in Table 1.
Table 1. Side-by-side comparison of PatchTST and the proposed PTSTG.
We use P (patch length), s (stride), L (number of patches), d (width), r (graph rank), and L g (graph depth). Softmax is row-wise unless specified.
Algorithm 1 PTSTG forward pass and training loop
Input: 
Training set D = { ( X , Y ) } ; patch length P; stride s; embedding size d; number of graph layers L g ; optimizer Adam; loss L .
Output: 
Multi-horizon prediction Y ^ .
  1:
Initialize Transformer encoder, projection layers, graph embeddings E 1 ,   E 2 , graph-block weights { W } , and head.
  2:
for each minibatch ( X , Y )  do
  3:
  Temporal encoding:  for each node i, U i P ( X [ : , : , i , : ] ; P , s ) ; Z i TransformerEncoder ( LN ( U i W p + b p + PE ) ) ; h i 1 L Z i .
  4:
    H ( 0 ) Stack ( h 1 , , h N ) .
  5:
   Adaptive adjacency:  S ReLU ( E 1 E 2 ) ; S ˜ 1 2 ( S + S ) ; A softmax row ( S ˜ ) .
  6:
  Graph refinement: for = 0 , , L g 1 : H ^ H ( ) + ( A H ( ) ) W ; H ˜ LN ( H ^ ) ; H ( + 1 ) H ˜ + MLP ( H ˜ ) .
  7:
   Prediction:  Y ^ Reshape MLP head ( LN ( H ( L g ) ) ) .
  8:
   Compute J = L ( Y ^ , Y ) and update parameters.
  9:
end for

3.6. Complexity and Practical Considerations

  • Computational complexity.
Let L = ( T in P ) / s + 1 be the number of patch tokens. One temporal layer costs O ( B N L 2 d ) for self-attention and O ( B N L d 2 ) for the feed-forward sublayer; the dominant activation memory is O ( B N L 2 d ) . Learning the adaptive graph requires an E 1 E 2 product of cost O ( N 2 r ) . Each node-graph block costs O ( B N 2 d ) for ( A H ) and O ( B N d 2 ) for its MLP. With row-wise top-k sparsification, the spatial term reduces to O ( B N k d ) and storage O ( N k ) . Empirical efficiency curves (latency, throughput, and peak memory vs. N) will be included in the artifact.
  • Training objective.
We use a direct multi-horizon objective,
L ( Y ^ , Y ) = t = 1 H w t Y ^ ( : , t , : , : ) Y ( : , t , : , : ) p , p { 1 , 2 } ,
with nonnegative weights w t (uniform unless stated). Optimization choices (optimizer, schedules, and regularization) are orthogonal to the architecture and specified in the experimental section.
  • Properties of the learned adjacency.
Row-stochasticity:   A 1   =   1 by construction in (9). Asymmetry: The row-softmax generally breaks symmetry; if needed, symmetric normalization D 1 / 2 A D 1 / 2 or Sinkhorn projections can be applied, or mutual top-k selection can enforce symmetric sparsity. Stability: Row-stochastic mixing corresponds to convex combinations of neighbor embeddings and helps numerical stability.
  • Implementation notes.
We use pre-norm Transformers with GELU; linear layers are initialized with truncated normal; sinusoidal positional encodings are added before the encoder. Mean pooling over patch tokens is the default; attention pooling is a drop-in replacement under the same interface. For large N, we recommend top-k sparsification of A at train and test time with row re-normalization.
  • Exogenous variables.
PTSTG readily accommodates exogenous covariates without architectural changes. Node-wise time-varying features Z R B × T in × N × C exo (e.g., per-sensor weather or incident flags) can be concatenated with X along the channel dimension and encoded by the same patch encoder. Global covariates G R B × T in × C g (e.g., city-level weather, calendar/event indicators) are projected with a small MLP and broadcast-added to patch tokens; static node attributes S R N × C s are embedded and added to H ( 0 ) . Optionally, these covariates can condition the learned adjacency by FiLM-style modulation of the node-embedding matrices ( E 1 ,   E 2 ) that construct A, enabling context-aware spatial coupling.
  • Limitations.
Without sparsification, ( A H ) scales quadratically in N. Mean pooling may discard fine-grained temporal cues for highly nonstationary series. The learned A is input-agnostic in this formulation; time-varying graphs and multi-scale patching are compatible extensions but are outside the core design.
  • Dynamic (input-conditioned) adjacency.
In the current form, the learned adjacency A is input-agnostic (static across samples), which trades flexibility for stability and efficiency. This may limit the ability to capture regime-dependent spatial couplings (e.g., incidents, rush hours). A natural extension is an input-conditioned graph: produce A t from the current window by modulating the node-embedding factors ( E 1 ,   E 2 ) with a small hypernetwork f ϕ that summarizes the temporal encoder outputs { h i } :
A t = softmax row ReLU ( E ˜ 1 E ˜ 2 ) τ .

4. Experiments

We evaluate PTSTG on widely used traffic speed benchmarks against strong temporal, graph-based, and hybrid baselines. We report accuracy at short/medium horizons and analyze the effect of patch-based temporal encoding and adaptive adjacency, along with efficiency considerations.

4.1. Datasets

We follow the established protocol of Li et al. [4], Wu et al. [16].
METR–LA. Traffic speed recorded every 5 min by N = 207 loop detectors in Los Angeles, spanning March–June 2012. Sequences are time-aligned; missing values are imputed by linear interpolation within short gaps and masked when gaps exceed a preset threshold.
PEMS–BAY. Traffic speed from N = 325 sensors in the Bay Area, sampled every 5 min between January–May 2017. Preprocessing mirrors METR–LA (alignment, short-gap interpolation, masking of long gaps).
LargeST. A large-scale benchmark suite of urban road networks across multiple cities, with tens of thousands of sensors per city and 5-min sampling. It provides standardized train/validation/test splits and graph construction (e.g., geo-kNN) for consistent comparison at scale. We adopt the official splits and evaluation protocol.
All series are standardized (z-score) per node and channel using training statistics only. We use the conventional chronological split of 70 % / 10 % / 20 % into train/validation/test for METR–LA and PEMS–BAY; for LargeST we use the official splits. Unless stated otherwise, we report horizons { 3 , 6 , 12 } steps (15, 30, 60 min).

4.2. Evaluation Metrics

We report mean absolute error (MAE) and root mean square error (RMSE). Let { y ^ t , n , y t , n } denote predictions and ground truth for time t and node n, aggregated over all test timestamps T and nodes N ( M   =   | T | | N | ):
MAE = 1 M t T n N y ^ t , n y t , n , RMSE = 1 M t T n N y ^ t , n y t , n 2 .
For mean absolute percentage error (MAPE) we clamp the denominator to avoid division by near-zero values (we use ε = 1.0 in speed units for our runs):
MAPE = 100 × 1 M t T n N y ^ t , n y t , n max ( | y t , n | , ε ) .
Because reported baselines often differ in ε and missing-value handling, MAE/RMSE are prioritized for cross-paper comparability, and MAPE is provided for reference only.

4.3. Baselines

We compare against representative methods spanning temporal-only, graph-only, and hybrid spatio–temporal modeling paradigms. When official code is available, we use it with recommended settings; otherwise, we re-implement according to the papers and tune on the validation set.
Classical/temporal baselines. Historical Average (HA), ARIMA (with Kalman filtering), Vector Auto-Regression (VAR), Support Vector Regression (SVR), Feedforward Neural Network (FNN), and FC-LSTM.
Graph spatio–temporal baselines. STGCN [3], DCRNN [4], Graph WaveNet (GWNET) [16], AGCRN [18], GMAN [19], DGCRN [27].
Hybrid/attention baselines. STTN [22], ASTGNN [24], D 2 STGNN [30], and a channel-independent PatchTST variant [8] as a strong temporal encoder without explicit graph coupling.
This set covers temporal Transformers, fixed-graph and adaptive/dynamic-graph paradigms. On METR–LA and PEMS–BAY we report a representative subset with stable, widely reused baselines to avoid confounding implementation details; a broader set of strong hybrids is included on the large-scale LargeST benchmark. We do not claim strict SOTA across all datasets; instead, we position PTSTG as a compact, competitive alternative with clear headroom under capacity scaling.

4.4. PTSTG Training Setup

Optimization and regularization. We use the Adam optimizer with cosine learning-rate decay and linear warm-up, early stopping on validation MSE, gradient clipping by value, weight decay, and dropout in the feed-forward sublayers. Unless noted, results are single-run point estimates obtained with a fixed random seed (2023). We do not conduct statistical significance tests in this version.
Input/output. A fixed look-back window T in is fed per node; the model outputs the full horizon H in a single pass. Per-node/channel z-score standardization is fit on training data and inverted at evaluation.
Model configuration. PTSTG comprises a channel-independent patch-based Transformer encoder, a learnable low-rank adaptive adjacency with row-wise temperature τ   =   1.0 , and a lightweight stack of node graph blocks. Unless stated otherwise, we do not sparsify A during training; at inference we optionally apply top-k row-wise sparsification with re-normalization (we report k when used). The multi-horizon loss in Equation (15) uses uniform weights w t   =   1 .

4.5. Main Results

  • METR–LA.
PTSTG provides competitive MAE/RMSE/MAPE across 15/30/60 min without outperforming the strongest diffusion-based baseline (DCRNN) in our point-estimate runs (Table 2). The gap remains moderate at 15–30 min and increases at 60 min.
Table 2. Performance on METR–LA at 15/30/60 min (lower is better). Baseline values are taken verbatim from the literature(e.g., [4,16]); we did not rerun these baselines. PTSTG values are our single-run point estimates (seed = 2023). Minor discrepancies across publications may stem from preprocessing choices and MAPE conventions ( ε ). Best numbers in bold.
  • PEMS–BAY.
A similar pattern holds: PTSTG is competitive among recent graph–temporal models but does not surpass the strongest baseline (Table 3). Short-horizon differences are small, while the one-hour horizon exhibits a larger margin.
Table 3. Performance on PEMS–BAY at 15/30/60 min (lower is better). Baseline values are taken verbatim from the literature (e.g., [4,16]); we did not rerun these baselines. PTSTG values are our single-run point estimates (seed = 2023). Due to heterogeneous MAPE definitions in prior work, MAE/RMSE should be taken as the primary basis for comparison. Best numbers in bold.
  • LargeST.
On a large-scale city network, PTSTG achieves favorable average RMSE/MAE and a compact parameterization (Table 4). It remains behind the top hybrid (e.g., D 2 STGNN) on some individual horizons, but attains competitive average performance and efficiency.
Table 4. Results on the LargeST benchmark (SD) with horizons { 3 ,   6 ,   12 } (lower is better). Baseline values are taken verbatim from official LargeST reports and follow-up papers (e.g., [31] for D 2 STGNN); we did not rerun these baselines. PTSTG values are our single-run point estimates (seed = 2023). Numbers can vary across city coverage, graph construction, and MAPE denominator clamping. Best numbers in bold.
As summarized in Table 5, PTSTG demonstrates its computational footprint on a single consumer GPU.
Table 5. PTSTG computational footprint on a single consumer GPU. Hardware: NVIDIA GeForce RTX 4060 Ti (8 GB), FP32; PyTorch v2.8.0 setup as in Section 4.7. Input: batch B = 32 , Seq _ len = 36 , Horizon = 12 . FLOPs are per forward pass at B = 1 ; latency/throughput(THPT) are per mini-batch ( B = 32 ).

4.6. Ablations and Efficiency

We qualitatively assess three design choices—(i) patch tokens vs. step-wise tokens, (ii) learned vs. fixed-distance adjacency, and (iii) graph depth L g . In our experiments, patching stabilized optimization and improved short/medium-range accuracy; learned adjacency consistently outperformed a distance kNN graph; and increasing L g benefited 30–60 min horizons. Numeric ablations (no-patch vs. patch; learned-A vs. distance-A vs. none; sweeps over L g and graph rank; sensitivity to top-k) and efficiency profiles (latency/throughput/VRAM at N { 200 ,   1 k ,   10 k } ) will be released with the public artifact to ensure exact reproducibility without space constraints.

4.7. Implementation Details

PTSTG is implemented in PyTorch. We standardize each node/channel with z-scores computed on the training split and invert the transform at evaluation. Models are trained with Adam optimizer, cosine learning-rate decay with linear warm-up, early stopping on validation MSE, gradient clipping by value, dropout in the feed-forward sublayers, and weight decay. The model consumes a fixed look-back window ( Seq _ len ) and outputs the full horizon (Horizon) in a single pass. All experiments use a 5-min sampling step. Exact training/optimization and architectural hyperparameters per dataset are given in Table 6 and Table 7.
Table 6. Training and optimization hyperparameters for PTSTG. Seq _ len is the look-back window in 5-min steps; Horizon is the prediction length in steps.
Table 7. Architectural hyperparameters of PTSTG. d model is the embedding width; n_heads/n_layers refer to the PatchTST encoder; patch_len/stride defines temporal tokenization; graph_rank parameterizes the low-rank adaptive adjacency; graph_layers is the number of node-graph blocks.
  • Reporting choices and protocol clarifications.
MAPE is computed on de-normalized speeds with a denominator clamp ε = 1.0 (speed unit) to avoid division by near-zero values. Row-wise softmax temperature for the learned adjacency is τ = 1 . Unless stated otherwise, no top-k sparsification is applied to A during training or evaluation in the reported numbers (i.e., k is not used). Missing values are linearly interpolated within short gaps and masked when gaps exceed a predefined threshold, following prior work.
  • Statistical reporting.
All reported numbers are single-run point estimates with a fixed random seed (2023) under a unified training/evaluation protocol. While multi-seed reporting (mean ± std) is preferable, it was beyond our computing budget for this manuscript. To facilitate replication, we release training scripts and configs that support multi-seed runs out-of-the-box; future work will include aggregated statistics.
  • Reproducibility and artifact availability.
All code and configuration files needed to reproduce our experiments are publicly available at https://github.com/mtuciru/PTSTG (accessed on 23 September 2025).

Societal and Deployment Considerations

Fairness across road types and regions. Model errors can vary systematically across functional classes (e.g., freeways vs. arterials) and neighborhoods. To avoid reinforcing inequities (e.g., systematically worse travel-time estimates on minor roads), we recommend stratified evaluation by road class and district, reporting per-stratum MAE/RMSE and gap metrics, and rebalancing training where harmful gaps are observed.
Robustness to sensor failures and missing data. Real-world networks exhibit outages, drift, and asynchronous timestamps. Although PTSTG tolerates short gaps via interpolation and masking, deployment should include robustness tests that randomly mask nodes/intervals at inference time, impute with model-based or graph-aware methods, and maintain fallbacks (historical profiles, neighbor-based heuristics) when coverage drops.
Distribution shift and incidents. The learned, input-agnostic adjacency may encode spurious correlations and under-react to regime shifts (incidents, events, weather). Mitigations include regularization and sparsification of A, periodic retraining with recent data, input-conditioned or time-varying graphs, explicit exogenous covariates, and online monitoring of calibration and error drift.
Interpretability and accountability. While row-stochastic A is amenable to inspection, links need not correspond to physical roads. We advocate publishing qualitative audits (e.g., top-k neighbors per node with geographic overlays) and documenting changes to A over time to support operator review.

5. Conclusions

We introduced PTSTG, a compact hybrid forecaster that combines a PatchTST-style, channel-independent temporal encoder with a learned row-stochastic adjacency and a small stack of node-graph refinement blocks. The model performs single-pass, multi-horizon decoding and cleanly separates temporal tokenization from spatial aggregation. Across METR–LA, PEMS–BAY, and LargeST (SD), PTSTG delivers competitive accuracy while emphasizing efficiency (short-attention over patch tokens, no auto-regressive rollout), compactness, and scalability to large sensor sets. Capacity sweeps indicate predictable gains with modest increases in embedding width and graph depth, suggesting headroom without changing the architecture.
Our contribution is positioned primarily in terms of practical trade-offs rather than universal accuracy dominance: PTSTG offers a lightweight temporal path, a data-driven spatial coupling that avoids fixed topologies, and a restrained parameter budget that accelerates training and reduces memory usage. Limitations remain. Dense inter-node mixing scales quadratically with the number of sensors (mitigated by top-k sparsification). The learned adjacency is currently input-agnostic and may under-react to sudden regime shifts. Mean pooling over patch tokens can lose fine-grained temporal detail in highly non-stationary segments.
Regarding future work vis-à-vis existing literature, several themes are related but distinct in scope. Multi-scale temporal modeling has been explored via decomposition/frequency designs (e.g., Autoformer, FEDformer, ETSformer, Crossformer) [7,9,10,12]; our aim is a PatchTST-native multi-scale tokenization (short/long patches in parallel) that reuses the same lightweight learned graph, with scale-specific pooling and a minimal cross-scale gating head to preserve the model’s simplicity. For dynamic spatial relations, prior dynamic/adaptive graphs exist (e.g., Graph WaveNet, AGCRN, DGCRN) [16,18,27]; our extension will keep the row-stochastic parameterization but modulate it with low-cost, input-conditioned factors (e.g., a small hypernetwork) to maintain efficiency. Finally, we plan to incorporate exogenous variables (weather, calendar/events, incidents) through additive or cross-attention side paths that feed both the temporal encoder and the graph learner, and to provide uncertainty estimates via lightweight distributional heads or ensembling. Together, these steps target improved robustness and interpretability while retaining PTSTG’s core advantages in efficiency and deployability for intelligent transportation systems.

Author Contributions

Conceptualization, M.G.; methodology, M.G. and G.M.; software, G.M.; validation, G.M. and M.G.; formal analysis, G.M.; investigation, G.M.; resources, M.G.; data curation, G.M.; writing—original draft preparation, G.M.; writing—review and editing, M.G.; visualization, G.M.; supervision (scientific advising), M.G.; project administration, M.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

All datasets used in this study are publicly available: METR–LA and PEMS–BAY from their original maintainers (as distributed with the DCRNN release), and LargeST (San Diego, “SD”) from the dataset authors’ official repository. We do not redistribute third-party data; our repository provides scripts to download and prepare them. Code and exact run configurations to reproduce the reported experiments are available at https://github.com/mtuciru/PTSTG (accessed on 23 September 2025). No new proprietary data were created.

Acknowledgments

The authors thank colleagues at the Moscow Technical University of Communication and Informatics for helpful discussions and feedback during this work. We also acknowledge the open-source communities behind PyTorch and related libraries that made this research possible.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
ITSIntelligent Transportation Systems
GNNGraph Neural Network
RNNRecurrent Neural Network
LSTMLong Short-Term Memory
GRUGated Recurrent Unit
STGCNSpatio-Temporal Graph Convolutional Network
DCRNNDiffusion Convolutional Recurrent Neural Network
STSGCNSpatial-Temporal Synchronous Graph Convolutional Network
GWNetGraph WaveNet
AGCRNAdaptive Graph Convolutional Recurrent Network
ASTGNNAttention-based Spatio-Temporal Graph Neural Network
STAEformerSpatio-Temporal Adaptive Embedding Transformer
STGormerSpatio-Temporal Graph Transformer
DGCRNDynamic Graph Convolutional Recurrent Network
D 2 STGNNDecoupled Dynamic Spatial-Temporal Graph Neural Network
InformerLong-Sequence Transformer with ProbSparse Attention
AutoformerDecomposition Transformer with Auto-Correlation
FEDformerFrequency Enhanced Decomposed Transformer
PatchTSTPatch-based Time-Series Transformer
CrossformerTransformer Utilizing Cross-Dimension Dependency
ETSformerExponential Smoothing Transformer

References

  1. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  2. Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
  3. Yu, B.; Yin, H.; Zhu, Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-2018), Stockholm, Sweden, 13–19 July 2018; pp. 3634–3640. [Google Scholar] [CrossRef]
  4. Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. arXiv 2017, arXiv:1707.01926. [Google Scholar]
  5. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  6. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, J.; Xiong, Z.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
  7. Wu, H.; Xu, J.; Wang, J.; Long, M.; Jiang, J.; Wang, C. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Online, 6–14 December 2021; Volume 34, pp. 22419–22430. [Google Scholar]
  8. Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-Term Forecasting with Transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
  9. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-Term Series Forecasting. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
  10. Zhang, Y.; Yan, J. Crossformer: Transformer Utilizing Cross-Dimension Dependency for Multivariate Time Series Forecasting. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  11. Liu, D.; Zhou, H.; Wu, J.; Deng, W.; Chen, W.; Zhang, C.; Ma, Z.; Sun, L. Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25 April 2022. [Google Scholar]
  12. Woo, G.; Liu, C.; Sahoo, D.; Kumar, A.; Hoi, S.C. ETSformer: Exponential Smoothing Transformers for Time-Series Forecasting. arXiv 2022, arXiv:2202.01381. [Google Scholar] [CrossRef]
  13. Zeng, A.; Chen, M.; Zhang, L.; Xu, Q.; Yu, Y.; Wang, L. Are Transformers Effective for Time Series Forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 14800–14808. [Google Scholar]
  14. Yu, C.; Lin, H.; Dong, W.; Fang, S.; Yuan, Q.; Yang, C. TripChain2RecDeepSurv: A Novel Framework to Predict Transit Users’ Lifecycle Behavior Status Transitions for User Management. Transp. Res. Part C Emerg. Technol. 2024, 167, 104818. [Google Scholar] [CrossRef]
  15. Song, C.; Lin, Y.; Guo, S.; Wan, H. Spatial-Temporal Synchronous Graph Convolutional Networks: A New Framework for Spatial-Temporal Network Data Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 914–921. [Google Scholar]
  16. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Graph WaveNet for Deep Spatial-Temporal Graph Modeling. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 10–16 August 2019; pp. 1907–1913. [Google Scholar]
  17. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual, 6–10 July 2020; pp. 753–763. [Google Scholar]
  18. Bai, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting. In Proceedings of the Annual Conference on Neural Information Processing Systems 2020 (NeurIPS 2020), Virtual, 6–12 December 2020; Volume 33, pp. 17804–17815. [Google Scholar]
  19. Zheng, C.; Fan, X.; Wang, C.; Qi, J. GMAN: A Graph Multi-Attention Network for Traffic Prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1234–1241. [Google Scholar]
  20. Cao, D.; Wang, Y.; Li, J.; Zhou, H.; Li, L. Spectral Temporal Graph Neural Network for Multivariate Time-Series Forecasting. In Proceedings of the Annual Conference on Neural Information Processing Systems 2020 (NeurIPS 2020), Virtual, 6–12 December 2020; Volume 33, pp. 17766–17778. [Google Scholar]
  21. Grigsby, J.; Wang, Z.; Nguyen, N.H.; Qi, Y. Long-Range Transformers for Dynamic Spatiotemporal Forecasting. arXiv 2021, arXiv:2109.12218. [Google Scholar]
  22. Xu, K.; Liu, X.; Zheng, H.; Guan, H. Spatial-Temporal Transformer Networks for Traffic Flow Forecasting. arXiv 2020, arXiv:2001.02908. [Google Scholar]
  23. Feng, A.; Tassiulas, L. Adaptive Graph Spatial-Temporal Transformer Network for Traffic Flow Forecasting. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM), Atlanta, GA, USA, 17–21 October 2022; pp. 3933–3937. [Google Scholar]
  24. Guo, S.; Lin, Y.; Wan, H.; Li, X.; Cong, G. Attention Based Spatio-Temporal Graph Neural Network for Traffic Forecasting. IEEE Trans. Knowl. Data Eng. 2021, 34, 5415–5428. [Google Scholar] [CrossRef]
  25. Liu, H.; Dong, Z.; Jiang, R.; Deng, J.; Chen, Q.; Song, X. STAEformer: Spatio-Temporal Adaptive Embedding Makes Vanilla Transformer SOTA for Traffic Forecasting. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023. [Google Scholar]
  26. Zhou, J.; Liu, E.; Chen, W.; Zhong, S.; Liang, Y. Navigating Spatio-Temporal Heterogeneity: A Graph Transformer Approach for Traffic Forecasting. arXiv 2024, arXiv:2408.10822. [Google Scholar] [CrossRef]
  27. Li, F.; Wu, J.; Xu, J.; Long, M.; He, D. Dynamic Graph Convolutional Recurrent Network for Traffic Prediction: Benchmark and Solution. ACM Trans. Knowl. Discov. Data 2023, 17, 1–21. [Google Scholar] [CrossRef]
  28. Shang, C.; Chen, J.; Bi, J. Discrete Graph Structure Learning for Forecasting Multiple Time Series. arXiv 2021, arXiv:2101.06861. [Google Scholar] [CrossRef]
  29. Li, M.; Zhu, Z. Spatial-Temporal Fusion Graph Neural Networks for Traffic Flow Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 4189–4196. [Google Scholar]
  30. Shao, Z.; Zhang, Z.; Wei, W.; Wang, F.; Xu, Y.; Cao, X.; Jensen, C.S. Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting. Proc. Vldb Endow. (PVLDB) 2022, 15, 2733–2746. [Google Scholar] [CrossRef]
  31. Fang, Y.; Liang, Y.; Hui, B.; Shao, Z.; Deng, L.; Liu, X.; Jiang, X.; Zheng, K. Efficient Large-Scale Traffic Forecasting with Transformers: A Spatial Data Management Perspective. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Toronto, ON, Canada, 3–7 August 2025. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.