ST-MAFNet: Spatio-Temporal Multi-Scale Adaptive Fusion Network for Traffic Forecasting

Guo, Feng; Wang, Xunhuang; Zou, Fumin; Zou, Lei; Fang, Tao; Wu, Xueming; Jiang, Haocai; Weng, Jianqing

doi:10.3390/ai7060217

Open AccessArticle

ST-MAFNet: Spatio-Temporal Multi-Scale Adaptive Fusion Network for Traffic Forecasting

by

Feng Guo

^1,2,*

,

Xunhuang Wang

^1,2

,

Fumin Zou

^1,2

,

Lei Zou

^1,2

,

Tao Fang

^1,2,

Xueming Wu

^1,2,

Haocai Jiang

^1,2 and

Jianqing Weng

^1,2

¹

Fujian Key Laboratory of Automotive Electronics and Electric Drive, Fujian University of Technology, Fuzhou 350118, China

²

Renewable Energy Technology Research Institute, Fujian University of Technology, Ningde 352101, China

^*

Author to whom correspondence should be addressed.

AI 2026, 7(6), 217; https://doi.org/10.3390/ai7060217

Submission received: 5 May 2026 / Revised: 3 June 2026 / Accepted: 6 June 2026 / Published: 12 June 2026

(This article belongs to the Section AI Systems: Theory and Applications)

Download

Browse Figures

Versions Notes

Abstract

Accurate traffic flow prediction is fundamental to Intelligent Transportation Systems (ITSs), critical for transportation management and logistics. Despite advances in spatio-temporal prediction methods, existing approaches suffer from two key limitations: (i) multi-scale fusion methods inadequately capture hierarchical constraints between cross-scale features, and (ii) models rely on single spatio-temporal views, neglecting multi-source relationship complementarity. To address these issues, we propose ST-MAFNet, a spatio-temporal multi-scale adaptive fusion network comprising three key components, specifically, a Cross-Scale Hierarchical Anchoring strategy (CSHA) that anchors short-term predictions with multi-scale temporal patterns to mitigate noise; a Dual Spatial Perception Module (DSPM) that learns node heterogeneity and dynamic correlations through node embeddings and adaptive graph attention; and a Spatio-Temporal Adaptive Fusion Module (STAFM) that captures time-varying connectivity by integrating multi-scale temporal features with multi-source spatial relationships. Experiments on four real-world datasets demonstrate that ST-MAFNet is particularly effective for short-term traffic forecasting. Compared with the best previously reported MAE results, ST-MAFNet reduces MAE by 2.95%, 1.43%, 1.25%, and 0.37% on PEMS03, PEMS04, PEMS07, and PEMS08, respectively, and achieves the best or second-best performance on most evaluation metrics.

Keywords:

adaptive fusion; graph learning; multi-scale spatio-temporal representation; time-varying dependencies; traffic forecasting

Graphical Abstract

1. Introduction

Traffic flow forecasting plays a critical role in infrastructure optimization, public transportation management, and emergency response operations [1,2,3]. As urbanization accelerates worldwide, growing traffic congestion has made accurate and timely traffic flow prediction increasingly essential for route planning, congestion mitigation, and decision support.

Early research treated traffic flow prediction as a pure time series regression problem, focusing on temporal representation extraction with limited adoption of graph learning methods [4,5,6,7]. Although these methods predict traffic data by formulating temporal trends, they overlook the inherent complex spatial dependencies in road networks, which limits their prediction performance.

As researchers have increasingly recognized the importance of spatial dimensions, graph learning methods have gained prominence in the field of traffic flow prediction. Graph learning methods mainly include two forms: static graph learning and dynamic graph learning. Static graph learning [8,9] provides spatial structure priors for models, while dynamic graph learning [3,10,11,12] captures dynamic spatial correlations based on temporal pattern variations. Subsequently, methods like STGAT [13] and STJGCN [14] integrate spatial and temporal dimensions within a unified framework to capture dynamic spatio-temporal dependencies. Compared with traditional approaches, these integrated spatio-temporal frameworks consistently demonstrate superior predictive performance.

In recent years, pre-training methods [15,16] have provided new research paradigms across diverse domains. Currently, methods that combine masked pre-training with self-supervised learning [15] to encourage models to learn long-term spatio-temporal contexts and apply them to downstream traffic flow prediction have significantly improved prediction performance. However, due to the discrepancy between pre-training objectives and downstream tasks, these methods still face challenges in practical applications. Moreover, these methods require substantial data and computational resources, making it difficult to achieve expected advantages in data-scarce scenarios.

Despite significant progress, existing methods face two key limitations:

(1): Neglecting complementary multi-source spatio-temporal dependencies. Existing models predominantly rely on predefined spatial structures and periodic temporal patterns, while overlooking the complementary dynamics from multiple spatio-temporal sources. Traffic flow exhibits directional propagation characteristics and implicit dependencies between non-adjacent nodes with similar behavioral patterns, which cannot be adequately captured by static topological structures alone. As shown in Figure 1, the flow consistency across spatially distant nodes underscores the necessity of integrating multi-source spatio-temporal information.
(2): Overlooking anchor-refinement interactions between temporal scales. Traffic flow demonstrates multi-scale temporal dynamics, with short-term variations occurring within longer-term patterns at different time scales. Existing approaches rely on stacked dilated convolutions to learn multi-scale dependencies, but fail to capture the hierarchical structure underlying these temporal relationships. As illustrated in Figure 2, multi-scale temporal patterns (e.g., peak-hour trends) anchor short-term predictions, while short-term fluctuations refine local variations within this framework.

To further support the motivation in Figure 1 and Figure 2, we quantitatively analyze long-range node similarity in the PEMS datasets. Using Pearson correlations computed from traffic-flow sequences with a high-similarity threshold of 0.8, we find that 98.96%, 99.64%, 99.85%, and 99.53% of highly correlated sensor pairs in PEMS03, PEMS04, PEMS07, and PEMS08, respectively, are not directly connected in the physical road graph. This finding indicates that traffic similarity is not limited to adjacent sensors and motivates adaptive spatial dependency modeling. The ablation results further show that removing CSHA increases the average MAE from 14.16 to 23.58 on PEMS03, from 17.88 to 31.59 on PEMS04, and from 13.41 to 18.84 on PEMS08, confirming that cross-scale anchoring provides measurable gains beyond simple multi-scale feature extraction.

To address these limitations, we propose a spatio-temporal multi-scale adaptive fusion network (ST-MAFNet). The main contributions are summarized as follows:

A Cross-Scale Hierarchical Anchoring (CSHA) strategy is proposed for progressively transferring multi-scale temporal patterns to short-term prediction layers, effectively mitigating noise and drift in short-term predictions.
A Dual Spatial Perception Module (DSPM) is designed to capture spatial dependencies from both node representation and graph structure perspectives, encoding time-varying spatial dependencies across various temporal contexts.
A Spatio-Temporal Adaptive Fusion Module (STAFM) is introduced to dynamically integrate multi-source characteristics, capturing complex dependencies across spatial and temporal dimensions.
Extensive experiments on four real-world datasets (PEMS03, PEMS04, PEMS07, and PEMS08) demonstrate that ST-MAFNet is especially effective for short-term forecasting and achieves the best or second-best results on most metrics.

The remainder of this paper is organized as follows. Section 2 reviews related work on traffic flow forecasting. Section 3 formulates the problem. Section 4 presents the proposed ST-MAFNet framework. Section 5 evaluates model performance through extensive experiments. Section 6 concludes the paper and discusses future directions.

2. Related Work

2.1. Time Series Forecasting

Benefiting from rapid advances in computational power, time series forecasting has transitioned from early statistical methods to research paradigms dominated by deep learning. ARIMA [5] and its variants [6] capture linear relationships in multivariate time series. While these methods offer high interpretability and ease of training, their reliance on linear assumptions limits their ability to identify complex nonlinear traffic patterns.

Machine learning methods demonstrate superior prediction performance compared to statistical approaches, owing to their ability to learn nonlinear relationships. Representative methods include KNN [17], SVR [18], and Bayesian networks [19], which leverage learnable nonlinear architectures to capture complex nonlinear dependencies. However, these methods struggle to capture deep-level complex traffic dynamics, particularly in highly nonlinear scenarios.

RNN variants such as LSTM and GRU have been widely adopted in traffic prediction [20,21], owing to their ability to mitigate gradient vanishing and capture long-term temporal dependencies through gating mechanisms. Nevertheless, these methods still face error accumulation issues in ultra-long sequences and multi-step predictions. In recent years, Transformer-based architectures (e.g., Informer [22]) have demonstrated significant advantages in learning nonlinear dependencies. These models capture global dependencies through attention mechanisms but incur high computational costs. Their performance is highly dependent on dataset characteristics, making it difficult to guarantee universal advantages. PDFormer [23] clusters traffic patterns using DTW [24] and extracts propagation delay features through attention mechanisms, providing a new interpretable modeling approach for spatio-temporal forecasting.

Recent studies such as DLinear [25] and BasicTS [26] challenge the assumption that complex models outperform simple models. On datasets lacking stable periodicity or containing significant noise, simple models may outperform complex ones. Conversely, on datasets with clear periodicity, complex models could better encode temporal patterns. In summary, there is no one-size-fits-all solution for time series forecasting; rather, the key lies in whether the inductive bias imposed on the model aligns with the temporal structure of the data.

2.2. Multi-Scale Temporal Modeling

Traffic flow prediction is increasingly shifting towards multi-scale modeling approaches from traditional single-scale methods. Early methods adopt single-scale modeling, such as DCRNN [8] using fixed-step GRU, STGCN [27] employing single-kernel temporal convolutions, and ASTGCN [28] based on fixed-window attention mechanisms. These methods rely on single-scale temporal encoders, which struggle to simultaneously capture short-term fluctuations and long-term periodic patterns. In practice, traffic flow exhibits multi-scale temporal dependencies: short-term fluctuations reflect immediate states, while long-term patterns reveal periodic properties. Therefore, multi-scale temporal modeling becomes crucial for identifying these complex dependencies.

To address this limitation, multi-scale parallel extraction approaches have been developed. Models like TCN [29] and GWNet [9] extract multi-scale features in parallel through convolution kernels of different scales, while InceptionTime [30] captures patterns across different time spans through multi-branch structures. However, these methods extract features at each scale independently, lacking cross-scale interaction mechanisms and thus failing to fully capture complementary information across different temporal granularities.

Unlike parallel extraction methods, long-sequence modeling approaches, including Autoformer [31] and iTransformer [32], construct temporal pyramids through downsampling to achieve hierarchical predictions from coarse to fine granularities. While effective in capturing multi-scale patterns, these methods suffer from unidirectional information flow that prevents effective establishment of anchoring mechanisms to short-term predictions, rendering predictions susceptible to noise interference. Consequently, devising effective cross-scale hierarchical anchoring mechanisms is imperative for enhancing short-term prediction performance.

2.3. Spatio-Temporal Graph Neural Networks

In spatio-temporal forecasting tasks such as traffic flow prediction, network traffic prediction and environmental prediction, spatial correlation information can reveal underlying patterns beyond temporal periodicity, driving the development of Spatio-temporal Graph Neural Networks (STGNNs). They formulate forecasting as a graph learning task, where sensor locations are represented as nodes and road connectivity between sensors is represented as edges. By leveraging these graph structures to represent spatial correlations and integrating temporal dependencies, STGNNs effectively capture complex dynamic spatio-temporal dependencies.

Extensive research has demonstrated the critical role of spatial structures in traffic prediction. DCRNN [8] employs diffusion convolutions to represent directed traffic dependencies, while GWNet [9] captures multi-scale spatial dependencies through stacked graph convolutions. These methods are suitable for scenarios with stable topological structures (e.g., road networks). While these methods capture inherent physical structures to improve prediction accuracy and ensure predictions comply with actual traffic propagation constraints, predefined graph structures lack flexibility in adapting to dynamic network changes or revealing implicit relationships.

To address this issue, AGCRN [33] introduces an adaptive graph convolution module to infer spatial correlations through dynamic graph structure learning rather than relying on fixed road network topology. MTGNN [34] and StemGNN [35] further enrich spatial relationship modeling through multi-hop neighbor exploration and frequency-domain transformations. Unlike the above methods, STAEformer [36] implicitly captures spatial dependencies between nodes through learnable node embeddings combined with Transformer self-attention mechanisms.

Dynamic correlation modeling methods focus on learning time-varying spatial dependencies but neglect node heterogeneity features, resulting in different types of nodes being treated equally. Node embedding methods represent node characteristics through static embeddings, lacking the ability to model dynamic spatial correlations and struggling to adapt to traffic propagation patterns across different time periods. In real-world traffic flow patterns, node heterogeneity and structural topology provide complementary static spatial information, while dynamic spatial correlations capture time-varying traffic propagation patterns.

To clarify the relationship between existing methods and ST-MAFNet, Table 1 summarizes representative methods, their limitations, the corresponding ST-MAFNet modules, and the expected benefits. Compared with prior studies, ST-MAFNet goes beyond independently extracting multi-scale temporal features or learning adaptive spatial representations. Specifically, CSHA introduces coarse-to-fine anchor-refinement interactions across temporal scales, DSPM jointly models node heterogeneity and dynamic spatial correlations, and STAFM integrates temporal embeddings, adaptive spatial features, and bidirectional graph propagation features within a unified fusion process.

3. Problem Statement

Consider a traffic network represented as a graph

G = (V, E, A)

, where

V = {v_{1}, \dots, v_{N}}

is the set of N nodes (sensors),

E

is the set of edges, and

A \in R^{N \times N}

is the adjacency matrix. At time step t, traffic observations are denoted as

X_{t} \in R^{N \times C}

, where C is the number of features (e.g., flow, time of day, and day of week).

The traffic flow forecasting problem aims to learn a function

f_{θ}

that maps historical observations

X_{t - P + 1 : t} = [X_{t - P + 1}, \dots, X_{t - 1}, X_{t}] \in R^{P \times N \times C}

to future predictions

Y = [Y_{t + 1}, Y_{t + 2}, \dots, Y_{t + P^{'}}] \in R^{P^{'} \times N \times C_{o}}

.

Y = f_{θ} (X_{t - P + 1 : t}; G)

(1)

where P and

P^{'}

denote the lengths of historical and future windows, respectively,

C_{o}

denotes the number of target features, and

θ

represents the learnable parameters. In this study, the input includes traffic flow, time-of-day, and day-of-week features, whereas the prediction target is traffic flow only; thus,

C_{o} = 1

.

4. Methodology

4.1. Framework Overview

Figure 3 illustrates the overall architecture of the proposed ST-MAFNet, which consists of three core modules: Multi-Scale Temporal Encoder (MST-Encoder), Dual Spatial Perception Module (DSPM), and Spatio-Temporal Adaptive Fusion Module (STAFM).

Step 1: Multi-scale temporal decomposition. Given the input spatio-temporal sequence

X_{t - P + 1 : t}

, the MST-Encoder leverages S cascaded temporal convolutional layers with varying receptive fields to decompose the input into multi-granularity temporal representations

H_{F} = {{H_{F}}^{(1)}, {H_{F}}^{(2)}, \dots, {H_{F}}^{(S)}}

, where S denotes the number of multi-scale temporal patterns extracted. These hierarchical features establish the foundation for cross-scale hierarchical anchoring.

Step 2: Multi-source spatial perception. Traffic flow exhibits both invariant topological constraints and time-varying spatial dependencies. To capture these complementary characteristics, DSPM operates through dual pathways: (i) static spatial encoding projects predefined graph structures into node embeddings to preserve topological priors; (ii) dynamic spatial learning employs learnable node embeddings with adaptive graph attention to infer implicit correlations and node heterogeneity.

Step 3: Cross-scale hierarchical fusion. STAFM integrates multi-scale temporal representations with multi-source spatial embeddings through an adaptive fusion strategy. The cross-scale hierarchical anchoring mechanism enables coarse-grained patterns to progressively constrain and refine fine-grained predictions. The fused representations are subsequently decoded to generate future traffic state predictions.

4.2. MST-Encoder

The historical spatio-temporal data

X_{t - P + 1 : t}

is fed into the multi-scale temporal encoder, which employs a cascaded temporal convolutional network architecture to progressively extract multi-scale temporal features from fine-grained to coarse-grained representations, as illustrated in Figure 3a. Specifically, we initialize

X^{(0)} = X_{t - P + 1 : t}

and extract multi-scale features

X^{(s)} \in R^{p_{s} \times N \times C}

layer by layer through S cascaded temporal convolutional layers:

X^{(s)} = {Conv 2 D}_{1 \times k} (TCNs (X^{(s - 1)}) + X^{(s - 1)})

(2)

where

k = ⌊P / S⌋ + 1

denotes the temporal convolution kernel size,

p_{s} = P - s (k - 1)

represents the output time steps at each layer, and

s = 1, 2, \dots, S

. To make residual addition well-defined, the TCN and residual branches are aligned before addition: both use the same hidden dimension; temporal padding is applied before convolution, and the residual branch is cropped to the output temporal length. The residual connections ensure effective information propagation across layers. Shallow-layer features preserve fine-grained short-term fluctuation patterns, while deep-layer features capture coarse-grained multi-scale patterns. At each layer, the temporal dimension is encoded, yielding the node representations at that scale

{H_{F}}^{(s)} \in R^{N \times d_{m s t}}

:

H_{F}^{(s)} = TemporalEncoding (X^{(s)})

(3)

where

d_{m s t}

denotes the feature dimension for representing both fine- and coarse-grained patterns. The TemporalEncoding operation takes the last temporal state of each scale-specific TCN output and permutes it into node-wise representations. It converts

X^{(s)} \in R^{B \times d_{m s t} \times N \times p_{s}}

into

H_{F}^{(s)} \in R^{B \times N \times d_{m s t}}

, giving each node a fixed-size temporal representation at the corresponding scale. Under the implemented setting

P = 12

,

S = 4

,

d_{m s t} = 64

, and

k = 4

, the cascaded MST-Encoder contains 65,792 trainable parameters. A parallel progressive-depth alternative that uses independent branches with depths

1, 2, \dots, S

to generate the same scale-specific receptive fields contains 164,480 trainable temporal-encoder parameters. Thus, the cascaded design reduces the temporal-encoder parameter count by 60.0% while enabling information sharing across scales through residual connections. Finally, the resulting multi-scale feature set

H_{F}

serves as input for cross-scale hierarchical anchoring and dual spatial perception.

4.3. DSPM

Traffic flow forecasting faces two major spatial modeling challenges: (i) Node heterogeneity: different road segments exhibit distinct traffic patterns due to variations in functional positioning, road hierarchy, and surrounding environments. (ii) Dynamic spatial correlations: the mutual influence of traffic flow between nodes is not static but evolves dynamically with time and traffic conditions. To comprehensively capture these two types of spatial dependencies, we propose the Dual Spatial Perception Module (DSPM), as illustrated in Figure 4.

We learn a trainable embedding vector

E_{n o d e} \in R^{N \times d_{n o d e}}

[36] for each node to capture node heterogeneity. Subsequently, this embedding is concatenated with the multi-scale feature set

H_{F}

to provide personalized spatio-temporal hidden pattern representations

X_{h i d}^{(s)} \in R^{N \times d_{h i d}}

for each node:

X_{h i d}^{(s)} = Concat [H_{F}^{(s)}, E_{n o d e}]

(4)

where

d_{n o d e}

denotes the dimension of the node embedding vector, and

d_{h i d} = (d_{n o d e} + d_{m s t})

represents the feature dimension of the hidden representation.

We introduce learnable node embedding matrices

E_{1}, E_{2} \in R^{N \times d_{a d p}}

to construct the spatial adaptive correlation matrix

A_{a d p} \in R^{N \times N}

:

A_{a d p} = {Softmax}_{r o w} (ReLU (E_{1} E_{2}^{T}))

(5)

where

d_{a d p}

denotes the dimension of the learnable node embedding matrices. The softmax operation is applied row-wise; thus, each row represents the adaptive dependency distribution from one source node to all candidate target nodes. Because

E_{1}

and

E_{2}

are independent learnable embedding matrices,

A_{a d p}

is not constrained to be symmetric and is interpreted as a directed adaptive dependency prior. After training,

A_{a d p}

is fixed during inference, whereas dynamic spatial correlations are produced by input-dependent attention weights conditioned on the current traffic representations.

Traffic patterns exhibit regular fluctuations over time. Under specific conditions, these patterns often follow similar trends at certain intervals. To capture such dynamic trends, we feed

X_{h i d}^{(s)}

through a spatial attention module to capture complex spatial dependencies, yielding the query, key, and value matrices for spatial attention as follows:

Q^{(s)} = X_{h i d}^{(s)} W_{Q}, K^{(s)} = X_{h i d}^{(s)} W_{K}, V^{(s)} = X_{h i d}^{(s)} W_{V}

(6)

where

W_{Q}, W_{K}, W_{V} \in R^{d_{h i d} \times d_{h i d}}

are learnable spatial weight matrices. In the multi-head implementation, these full projection matrices are split into head-specific subspaces, which is equivalent to using separate projection matrices for each head followed by concatenation and output projection. To leverage the adaptive graph structure, we integrate it into the multi-head attention mechanism:

{Attn}_{d s p m}^{(s)} = Softmax (\frac{Q^{(s)} {K^{(s)}}^{T}}{\sqrt{d_{h i d}}} + log (A_{a d p} + φ)) \cdot V^{(s)}

(7)

H_{d s p m}^{(s)} = SelfAttention ({{Attn}_{d s p m_{i}}^{(s)}}_{i = 1}^{h e a d})

(8)

where

h e a d

denotes the number of attention heads,

{{Attn}_{d s p m_{i}}^{(s)}}_{i = 1}^{h e a d}

represents the set of attention outputs from all heads, and

φ

is a small positive constant for numerical stability. The softmax in Equation (7) is applied over the key-node dimension.

{Attn}_{d s p m}^{(s)}

captures rich implicit spatial features, while

H_{d s p m}^{(s)}

encodes spatial dependencies under different traffic patterns.

4.4. STAFM

To effectively integrate multi-scale spatio-temporal features, we propose the Spatio-Temporal Adaptive Fusion Module (STAFM), as illustrated in Figure 3b. This module achieves accurate traffic flow forecasting through multi-view spatio-temporal modeling and adaptive fusion mechanisms.

Through the aforementioned DSPM design, the model automatically learns implicit spatial correlation dependencies between nodes from the data, denoted as

H_{d s p m} = {{H_{d s p m}}^{(1)}, {H_{d s p m}}^{(2)}, \dots, {H_{d s p m}}^{(S)}}

. Meanwhile, the periodic features contained in historical data are encoded into features

H_{t i m e} = {{H_{t i m e}}^{(1)}, {H_{t i m e}}^{(2)}, \dots, {H_{t i m e}}^{(S)}} \in R^{N \times d_{t e m p}}

through Temporal Embedding (TE) to capture the periodic patterns of traffic flow, where

d_{t e m p}

denotes the feature dimension for characterizing periodicity.

Furthermore, considering the stable structure of road networks and the bidirectional diffusion characteristics of traffic flow, we introduce

H_{forward}^{(s)} \in R^{N \times d_{f w d}}

and

H_{backward}^{(s)} \in R^{N \times d_{b w d}}

to characterize the heterogeneous propagation characteristics of traffic flow:

H_{forward}^{(s)} = FCs (Concat [H_{F}^{(s)}, FGPNet (A_{f o r w a r d})])

(9)

H_{backward}^{(s)} = FCs (Concat [H_{F}^{(s)}, BGPNet (A_{b a c k w a r d})])

(10)

where

d_{f w d}

and

d_{b w d}

encode the feature representations of forward and backward propagation intensities, respectively.

A_{f o r w a r d}

and

A_{b a c k w a r d}

are the forward and backward transition matrices obtained through row normalization, respectively [8]. FGPNet and BGPNet are graph mapping networks composed of two-layer MLPs, which map the adjacency matrix to the node feature space, as shown in Figure 5a,b. Similarly, the multi-layer bidirectional diffusion features can be represented as two sets:

H_{f o r w a r d} = {{H_{f o r w a r d}}^{(1)}, {H_{f o r w a r d}}^{(2)}, \dots, H_{forward}^{(S)}}

and

H_{b a c k w a r d} = {{H_{b a c k w a r d}}^{(1)}, {H_{b a c k w a r d}}^{(2)}, \dots, {H_{b a c k w a r d}}^{(S)}}

.

After the aforementioned spatio-temporal modeling, we treat

H_{t i m e}

,

H_{d s p m}

,

H_{f o r w a r d}

, and

H_{b a c k w a r d}

equally as features for the traffic forecasting task, and design a simple yet effective spatio-temporal fusion model to integrate these features. Inspired by the Feature Pyramid Network (FPN) [37] from the image detection domain, which has been widely adopted in YOLO variants [38,39], we design the Cross-Scale Hierarchical Anchoring (CSHA) strategy to capture spatio-temporal dependencies at different granularities in traffic flow data.

This strategy employs a coarse-to-fine recursive fusion mechanism, using the intermediate latent representations from coarse-grained scales as anchors, which are progressively propagated to fine-grained scales, achieving hierarchical aggregation of multi-scale information. Specifically, for each finer-grained scale

s < S

, in addition to fusing multi-source features at the current scale, we also introduce the latent representation

H^{(s + 1)}

from the previous scale as anchor information to guide feature learning at the current scale. Here,

H^{(s + 1)}

is not an output-space traffic-flow prediction supervised by an auxiliary loss; rather, it is a latent anchor representation generated by the fusion block at the coarser temporal scale. The final traffic-flow prediction is produced only after all scale-specific representations are concatenated and passed through the decoder. This top-down information propagation mechanism enables the model to preserve global spatio-temporal patterns while progressively capturing finer-grained local dynamic characteristics.

Given

F^{(s)} \in R^{N \times d_{f u s i o n}}

as input, the feature fusion at the s-th layer can be formulated as

F^{(s)} = [H_{t i m e}^{(s)}, H_{forward}^{(s)}, H_{backward}^{(s)}, H_{d s p m}^{(s)}]

(11)

H^{(s)} = \{\begin{matrix} {FCs}^{(s)} (F^{(s)}), & if s = S \\ {FCs}^{(s)} ([F^{(s)}, H^{(s + 1)}]), & otherwise \end{matrix}

(12)

where

d_{f u s i o n} = \{\begin{matrix} d_{t e m p} + d_{f w d} + d_{b w d} + d_{h i d}, & if s = S \\ d_{t e m p} + d_{f w d} + d_{b w d} \\ + d_{h i d} + d_{f u s i o n}, & otherwise \end{matrix}

represents the feature dimension of the fusion output at each layer.

The above design achieves cross-scale information flow, enabling coarse-scale global patterns to effectively guide fine-scale local predictions. Finally, we learn the optimal fusion weights for hierarchical features through stacked MLPs, as shown in Figure 3c, to achieve adaptive fusion. The fused features are then fed into a regression layer composed of fully connected (FC) layers to generate the final traffic-flow prediction

Y \in R^{P^{'} \times N \times C_{o}}

:

Y = FC (MLPs (Concat (H^{(1)}, \dots, H^{(S)})))

(13)

where

C_{o} = 1

in our experiments because the model predicts traffic flow only.

Y

captures both long-term spatio-temporal patterns and short-term abrupt changes, which enables effective modeling of stable trends and dynamic variations.

STAFM employs the aforementioned CSHA strategy, whose complete algorithm is presented in Algorithm 1. First, the algorithm initializes the multi-scale temporal features

H_{F}

via Equation (3), along with the temporal embeddings

H_{t i m e}

via TE, forward propagation features

H_{f o r w a r d}

via Equation (9), backward propagation features

H_{b a c k w a r d}

via Equation (10), and adaptive spatial features

H_{d s p m}

via Equation (8). Then, the algorithm performs coarse-to-fine hierarchical fusion through iterative processing from scale

s = S

to

s = 1

. At the coarsest scale (

s = S

), multi-source features are directly fused through Equations (11) and (12) to obtain the initial latent anchor representation

H^{(S)}

. For finer scales (

s < S

), the representation from the previous coarser scale,

H^{(s + 1)}

, is incorporated as anchor information to progressively refine the latent representations. Finally, all scale-specific representations are concatenated and passed through MLPs and FC layers via Equation (13) to generate the final forecast

Y

.

Algorithm 1 Cross-Scale Hierarchical Anchoring (CSHA)

Require: Multi-scale temporal features

H_{F} = {H_{F}^{(1)}, \dots, H_{F}^{(S)}}

, temporal embeddings

H_{t i m e} = {H_{t i m e}^{(1)}, \dots, H_{t i m e}^{(S)}}

, forward adjacency matrix

A_{f o r w a r d}

, backward adjacency
matrix

A_{b a c k w a r d}

Ensure: Final traffic-flow prediction

Y \in R^{P^{'} \times N \times C_{o}}

1:: Initialize: Compute forward propagation features $H_{f o r w a r d}^{(s)}$ via Equation (9) for $s = 1, \dots, S$
2:: Initialize: Compute backward propagation features $H_{b a c k w a r d}^{(s)}$ via Equation (10) for $s = 1, \dots, S$
3:: Initialize: Compute adaptive spatial features $H_{d s p m}^{(s)}$ via Equation (8) for $s = 1, \dots, S$
4:: Initialize: $H^{(S + 1)} \leftarrow None$ {No anchor for the coarsest scale}
5:: for $s = S$ down to 1 do
6:: Concatenate multi-source features: $F^{(s)} \leftarrow [H_{t i m e}^{(s)}, H_{forward}^{(s)}, H_{backward}^{(s)}, H_{d s p m}^{(s)}]$ via Equation (11)
7:: if $s = S$ then
8:: $H^{(s)} \leftarrow {FCs}^{(s)} (F^{(s)})$ via Equation (12)
9:: else
10:: $H^{(s)} \leftarrow {FCs}^{(s)} ([F^{(s)}, H^{(s + 1)}])$ via Equation (12)
11:: end if
12:: end for
13:: Concatenate all scale predictions: $H_{a l l} \leftarrow Concat (H^{(1)}, H^{(2)}, \dots, H^{(S)})$
14:: Generate final prediction: $Y \leftarrow FC (MLPs (H_{a l l}))$ via Equation (13)
15:: return $Y$

5. Experiment Implementation

In this section, we conduct extensive experiments on four real-world datasets to answer five research questions.

RQ1: How does the proposed ST-MAFNet perform compared to baselines?
RQ2: How do different components affect the performance of ST-MAFNet?
RQ3: How does the efficiency of ST-MAFNet compare to the baselines?
RQ4: How do hyperparameters affect the performance of ST-MAFNet?
RQ5: How robust is ST-MAFNet?

5.1. Experimental Settings

5.1.1. Datasets

We conducted experiments on four widely-used public large-scale datasets, each containing tens of thousands of time steps and hundreds of sensors. PEMS03, PEMS04, PEMS07, and PEMS08 were released by Song et al. [40]. These four datasets originate from four distinct regions in California, with a uniform sampling rate of 5 min. All data are collected from the California Performance Measurement System (PeMS) [41], and the detailed statistics are presented in Table 2.

We follow the standard preprocessing protocol widely adopted in prior traffic forecasting studies. Specifically, the raw traffic observations are organized as multivariate time series over sensor graphs, where each node corresponds to a traffic sensor and each timestamp records the observed traffic state. These datasets differ substantially in network scale, temporal span, and spatial sparsity, making them suitable benchmarks for evaluating the generalization ability of spatio-temporal forecasting models. By conducting experiments on all four datasets, we can more reliably assess the effectiveness of ST-MAFNet under diverse real-world traffic conditions.

5.1.2. Implementation Details

We partitioned each dataset into training, validation, and test sets chronologically using a 6:2:2 ratio. The raw traffic-flow observations are normalized using statistics computed from the training set, and the same scaler is applied to the validation and test sets to avoid information leakage. Missing values are treated as null values in the masked loss function, with the null value set to 0.0. Traffic flow is used as the prediction target, whereas time of day and day of week serve only as auxiliary input features. All experiments use the past 12 consecutive time steps to predict the next 12 time steps. Unless otherwise specified, the random seed is set to 1, CuDNN deterministic mode is enabled, and each reported result is obtained from the checkpoint with the best validation MAE.

All experiments were conducted on a machine equipped with an NVIDIA GeForce RTX 3090 GPU. We implemented ST-MAFNet using OpenEuler 24.03, PyTorch 2.4.1, and Python 3.8.20. We use the Adam optimizer [42] with a batch size of 32, weight decay of

1.0 \times 10^{- 5}

, gradient clipping with a maximum norm of 3.0, and a MultiStepLR scheduler with milestones at epochs 1, 18, 36, 54, and 72 and a decay factor of 0.5. Table 3 summarizes the main implementation details.

5.1.3. Performance Evaluation

To evaluate the performance of ST-MAFNet against baseline methods, we employ three widely-used evaluation metrics: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE), defined as follows:

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(14)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(15)

MAPE = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}|

(16)

where

{\hat{y}}_{i}

denotes the predicted value and

y_{i}

denotes the ground truth.

5.2. Experimental Results

5.2.1. Performance Comparison (RQ1)

As shown in Table 4, our proposed ST-MAFNet performs strongly across all four datasets by adopting the default settings outlined in baseline papers to ensure fair comparison, consistent with established best practices in the literature. The “Rel. change” row reports the relative error change of ST-MAFNet against the best baseline for each metric, computed as

(E_{best} - E_{ours}) / E_{best} \times 100 %

, where

E_{best}

denotes the lowest error among baseline methods and

E_{ours}

denotes the error of ST-MAFNet. Thus, positive values indicate lower errors achieved by ST-MAFNet, whereas negative values indicate higher errors than the best baseline. In terms of MAE, ST-MAFNet improves upon the previous best results by 2.95% on PEMS03 (14.59 to 14.16), 1.43% on PEMS04 (18.14 to 17.88), 1.25% on PEMS07 (19.14 to 18.90), and 0.37% on PEMS08 (13.46 to 13.41). For MAPE on PEMS08, ST-MAFNet reduces the error from 8.88% to 8.63%, corresponding to a relative improvement of 2.82%.

Traditional methods (e.g., HA and VAR [43]) perform poorly due to their idealized assumptions about data. LSTM [44] outperforms these classical methods in modeling nonlinearity but neglects critical spatial dependencies. Two classical spatio-temporal graph convolutional networks, DCRNN [8] and GWNet [9], effectively integrate GCNs with sequential models and maintain competitive forecasting performance. Building upon this foundation, AGCRN [33], MTGNN [34], and GTS [45] further improve performance by designing more sophisticated graph architectures. STNorm [46] applies spatio-temporal normalization to traffic scenarios, extracting high-frequency and low-frequency components from sequences. STGODE [47] models spatio-temporal dynamic dependencies through tensorized ordinary differential equations, addressing the limitation of separate spatio-temporal dependency modeling in traditional methods. DLinear [25] challenges the notion that complex models always outperform simple linear ones, but does not support spatial correlation modeling. STID [48] effectively captures unique spatio-temporal patterns through spatio-temporal identifier encoding. PDFormer [23] models traffic propagation delays, capturing complex dependencies in spatio-temporal data. STAEformer [36] introduces spatio-temporal adaptive embeddings combined with spatio-temporal encoders to capture traffic variation patterns across space and time. STWave [49] incorporates wavelet transforms into graph attention networks, modeling dynamic spatial correlations. HimNet [50] learns spatio-temporal heterogeneity through heterogeneous information meta-parameters. Although ST-MAFNet achieves the best MAE on all four datasets, it does not dominate every metric. For example, on PEMS04, its MAPE of 12.14% is slightly higher than those of STAEformer (11.98%), PDFormer (12.00%), and HimNet (12.00%). On PEMS08, its RMSE of 23.26 is slightly higher than that of HimNet (23.22). These results indicate that ST-MAFNet is most advantageous for MAE-oriented and short-term traffic forecasting, rather than universally superior across all metrics.

To further evaluate the performance of ST-MAFNet across different prediction horizons, Table 5 extends the baseline comparison by introducing additional models including GMAN [51], MegaCRN [52], StemGNN [35], and WaveNet [53] on top of the models presented in Table 4. As shown in Table 5, ST-MAFNet demonstrates strong performance at short-term prediction horizons but exhibits relatively weaker performance at longer horizons across all three datasets (PEMS03, PEMS04, and PEMS08). This performance pattern aligns with the design philosophy of our CSHA module, which leverages multi-scale long-term traffic patterns as anchors to enhance short-term prediction accuracy. The degradation at longer horizons is mainly associated with the increasing uncertainty of future traffic states, which weakens the reliability of short-term anchors. Under abrupt traffic changes, coarse-scale anchors may also over-constrain fine-grained variations rather than adapt to rapid local fluctuations. These observations suggest that the current temporal receptive field, which is primarily optimized for the 12-step forecasting setting, may need to be extended with stronger long-range dependency modeling or external factors to achieve more stable long-horizon prediction.

5.2.2. Ablation Study (RQ2)

To evaluate the effectiveness of each module in ST-MAFNet, we conduct ablation experiments on three datasets (PEMS03, PEMS04, and PEMS08) using ST-MAFNet and its variants, as follows:

w/o $E_{n o d e}$ : Remove adaptive node embeddings to verify the positive effect of node heterogeneity on spatial dependencies.
w/o $H_{d s p m}$ : Remove the DSPM module to examine whether the synergy between node heterogeneity and dynamic spatial dependencies improves prediction performance.
w/o CSHA: Remove the CSHA strategy to assess whether cross-scale hierarchical anchoring achieves the expected impact.

As shown in Figure 6, the designed Dual Spatial Perception Module (DSPM) and Cross-Scale Hierarchical Anchoring (CSHA) strategy significantly improve overall performance. DSPM captures node heterogeneity and dynamic spatial correlations, mitigating the inductive bias of static graph structures and adaptively capturing time-varying spatial dependencies. Meanwhile, the CSHA mechanism employs a coarse-to-fine information flow strategy, where coarse-scale layers extract global patterns and serve as anchoring information for fine-scale layers, significantly enhancing ST-MAFNet’s performance on short-term traffic forecasting tasks.

Overall, these components leverage the complementary advantages of temporal and spatial features, enabling the model to demonstrate competitive performance compared to existing methods. Notably, CSHA yields greater performance gains compared to other modules, confirming the effectiveness of anchoring short-term predictions with long-term dependencies. Compared with the w/o CSHA variant, the full ST-MAFNet reduces average MAE by 39.95%, 42.45%, and 28.82% on PEMS03, PEMS04, and PEMS08, respectively. The corresponding MAPE reductions are 68.51%, 24.74%, and 37.64%, and the RMSE reductions are 31.32%, 33.59%, and 18.67%. These results confirm that the coarse-to-fine anchoring mechanism is more effective than removing anchor-refinement interactions from multi-scale fusion.

5.2.3. Efficiency Analysis (RQ3)

We compare the efficiency of ST-MAFNet with other methods. For more intuitive and effective comparison, we examine the average training time per epoch and the corresponding prediction performance for these models. The batch size is uniformly set to 32. The results are shown in Figure 7, where ’Train’ denotes the average training time per epoch (in seconds) and the prediction performance is measured by MAE, RMSE, and MAPE.

5.2.4. Parameter Study (RQ4)

Figure 8 presents the results under different parameter settings. For the parameter S, which represents the number of latent multi-scale features, performance varies across datasets and metrics. We search S in

{2, 3, \dots, 12}

and select the final value using only validation MAE; the test set is not used in this process. In the main experiments, we set

S = 4

for all datasets to keep the architecture consistent and reduce excessive dataset-specific tuning. On both PEMS04 and PEMS08 datasets, the model achieves optimal MAE at

S = 4

, which is the primary evaluation metric in traffic forecasting. For PEMS04, while

S = 12

achieves slightly better MAPE (12.02% vs. 12.14%) and RMSE (29.68 vs. 29.76),

S = 4

provides the best balance with the lowest MAE (17.88). For PEMS08,

S = 4

simultaneously achieves the best MAE (13.41) and MAPE (8.63%), while

S = 2

shows marginally better RMSE (23.23 vs. 23.26). These observations indicate that

S = 4

strikes an optimal balance between model capacity and generalization, as fewer temporal patterns are insufficient to encode multi-scale information, while excessive temporal patterns may introduce redundancy and lead to overfitting.

5.2.5. Robustness Testing (RQ5)

As shown in Figure 9, the prediction error distributions of each model are visualized through box plots on both the PEMS04 and PEMS08 datasets, where the prediction error is computed as the average MAE across 12 forecasting horizons. Notably, ST-MAFNet exhibits a concentrated error distribution centered around lower values, indicating that it achieves lower average errors and demonstrates more robust performance. This reduced error variability translates to more consistent predictions, enhancing ST-MAFNet’s reliability in real-world traffic forecasting tasks.

Figure 10 presents the Taylor diagram, providing an overview of model performance. The angular position represents the correlation between model predictions and observed data, while the radial distance indicates the standard deviation of model outputs. It can be observed that, on both the PEMS04 and PEMS08 datasets, ST-MAFNet achieves the highest correlation coefficient and a standard deviation closest to the observed data. This indicates that the model accurately fits and predicts the data, demonstrating strong generalization capability.

5.2.6. Visualization Analysis

To further intuitively understand and evaluate our proposed ST-MAFNet, we visualize the model’s predictions and ground truth. Due to space constraints, we select 20 August and 21 August 2016 from the PEMS08 dataset as our input and randomly select four sensors. The traffic flow forecasting results for sensors 17, 28, 35, and 119 are shown in Figure 11, where sensors 17 and 28 exhibit relatively stable patterns, while sensors 35 and 119 show significant variations, indicating the presence of substantial noise. Nevertheless, our model achieves satisfactory performance in all cases. This is attributed to the effective capture of spatio-temporal dependencies.

We further employ t-SNE to visualize the spatial features on the PEMS08 and PEMS04 datasets. The results are presented in Figure 12. It can be clearly observed that nodes in the predefined graph are scattered and disorganized, with each node appearing as an isolated individual. However, with our designed DSPM, distinct clustering patterns emerge among nodes. This demonstrates that our designed DSPM forms a complementary relationship with the predefined graph, effectively capturing the underlying spatial structure that is not explicitly represented in the predefined adjacency matrix.

6. Conclusions

This paper proposes ST-MAFNet, a novel spatio-temporal multi-scale adaptive fusion network for traffic flow forecasting. The model employs a multi-scale temporal encoder to decompose inputs into temporal patterns for the Cross-Scale Hierarchical Anchoring (CSHA) strategy. The Dual Spatial Perception Module (DSPM) models node heterogeneity and dynamic spatial correlations to mitigate static graph bias, while the Spatio-Temporal Adaptive Fusion Module (STAFM) integrates temporal and spatial dependencies. Extensive experiments on four benchmark datasets show that ST-MAFNet is particularly effective for short-term forecasting and achieves strong MAE performance compared with existing methods.

Several limitations remain. ST-MAFNet is less competitive on some long-horizon metrics, particularly on PEMS04 and PEMS08 at the 60-min horizon, where STAEformer and HimNet perform better on several metrics. This indicates that the current CSHA design is more effective for stabilizing short-term predictions than for capturing long-range uncertainty. Dynamic graph learning increases modeling flexibility, but it also adds computational cost through adaptive node embeddings and spatial attention. Moreover, the learned dynamic graph may deviate from the physical road topology under stable traffic conditions. This deviation can be evaluated by comparing learned adjacency weights with physical adjacency or distance-based graphs using top-k neighbor overlap, correlation, or graph consistency metrics. Future work will explore adaptive weighting between physical and learned graphs, long-horizon-oriented anchor weighting, and the integration of external factors such as incidents, weather, and events to improve robustness in long-term traffic forecasting.

Author Contributions

Conceptualization, F.G. and X.W. (Xunhuang Wang); methodology, X.W. (Xunhuang Wang) and F.Z.; software, X.W. (Xunhuang Wang); validation, X.W. (Xunhuang Wang), L.Z. and T.F.; formal analysis, X.W. (Xunhuang Wang) and F.Z.; investigation, X.W. (Xunhuang Wang), X.W. (Xueming Wu) and H.J.; resources, F.G. and F.Z.; data curation, X.W. (Xunhuang Wang) and H.J.; writing—original draft preparation, F.G. and X.W. (Xunhuang Wang); writing—review and editing, F.G., X.W. (Xunhuang Wang), F.Z., L.Z., T.F., X.W. (Xueming Wu), H.J. and J.W.; visualization, X.W. (Xunhuang Wang) and T.F.; supervision, F.G. and J.W.; project administration, F.G.; funding acquisition, F.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Scientific Research Projects of Fujian University of Technology under Grant GY-Z24043 and by the Renewable Energy Technology Research Institute of Fujian University of Technology under Grant PT4300101.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The highway data used in this study are publicly available from the Caltrans Performance Measurement System (PeMS) data source at https://github.com/RDXiaoHuang/ST-MAFNet/tree/master/datasets (accessed on 5 June 2026).

Acknowledgments

The authors thank the anonymous reviewers and editors for their constructive comments, which improved the clarity and quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, W.; Chen, L.; Xie, Y.; Cao, W.; Gao, Y.; Feng, X. Multi-range Attentive Bicomponent Graph Convolutional Network for Traffic Forecasting. Proc. AAAI Conf. Artif. Intell. 2020, 34, 3529–3536. [Google Scholar] [CrossRef]
Chen, J.; Wang, Q.; Cheng, H.H.; Peng, W.; Xu, W. A Review of Vision-Based Traffic Semantic Understanding in ITSs. IEEE Trans. Intell. Transp. Syst. 2022, 23, 19954–19979. [Google Scholar] [CrossRef]
Jin, G.; Liang, Y.; Fang, Y.; Shao, Z.; Huang, J.; Zhang, J.; Zheng, Y. Spatio-Temporal Graph Neural Networks for Predictive Learning in Urban Computing: A Survey. IEEE Trans. Knowl. Data Eng. 2024, 36, 5388–5408. [Google Scholar] [CrossRef]
Davis, G.A.; Nihan, N.L. Using time-series designs to estimate changes in freeway level of service, despite missing data. Transp. Res. Part A Gen. 1984, 18, 431–438. [Google Scholar] [CrossRef]
Hamed, M.M.; Al-Masaeid, H.R.; Said, Z.M.B. Short-Term Prediction of Traffic Volume in Urban Arterials. J. Transp. Eng. 1995, 121, 249–254. [Google Scholar] [CrossRef]
Kumar, S.V.; Vanajakshi, L. Short-term traffic flow prediction using seasonal ARIMA model with limited input data. Eur. Transp. Res. Rev. 2015, 7, 21. [Google Scholar] [CrossRef]
Cui, Y.; Xie, J.; Zheng, K. Historical Inertia: A Neglected but Powerful Baseline for Long Sequence Time-series Forecasting. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management; Association for Computing Machinery: New York, NY, USA, 2021; pp. 2965–2969. [Google Scholar]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. In Proceedings of the International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018; Available online: https://openreview.net/forum?id=SJiHXGWAZ (accessed on 5 June 2026).
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph wavenet for deep spatial-temporal graph modeling. In Proceedings of the 28th International Joint Conference on Artificial Intelligence; AAAI Press: Menlo Park, CA, USA, 2019; pp. 1907–1913. [Google Scholar]
Shao, Z.; Zhang, Z.; Wei, W.; Wang, F.; Xu, Y.; Cao, X.; Jensen, C.S. Decoupled dynamic spatial-temporal graph neural network for traffic forecasting. Proc. VLDB Endow. 2022, 15, 2733–2746. [Google Scholar] [CrossRef]
Zhang, A. Dynamic graph convolutional networks with Temporal representation learning for traffic flow prediction. Sci. Rep. 2025, 15, 17270. [Google Scholar] [CrossRef]
Zhang, Q.; Chang, J.; Meng, G.; Xiang, S.; Pan, C. Spatio-Temporal Graph Structure Learning for Traffic Forecasting. Proc. AAAI Conf. Artif. Intell. 2020, 34, 1177–1185. [Google Scholar] [CrossRef]
Zhang, T.Y.; Wang, Y.; Wei, Z. STGAT: A Spatio-Temporal Graph Attention Network for Travel Demand Prediction; IEEE: New York, NY, USA, 2023; pp. 434–439. [Google Scholar]
Zheng, C.; Fan, X.; Pan, S.; Jin, H.; Peng, Z.; Wu, Z.; Wang, C.; Yu, P.S. Spatio-Temporal Joint Graph Convolutional Networks for Traffic Forecasting. IEEE Trans. Knowl. Data Eng. 2024, 36, 372–385. [Google Scholar] [CrossRef]
Gao, H.; Jiang, R.; Dong, Z.; Deng, J.; Ma, Y.; Song, X. Spatial-temporal-decoupled masked pre-training for spatiotemporal forecasting. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
Tang, J.; Wei, W.; Xia, L.; Huang, C. EasyST: A Simple Framework for Spatio-Temporal Prediction. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management; Association for Computing Machinery: New York, NY, USA, 2024; pp. 2220–2229. [Google Scholar]
Cheng, S.; Lu, F.; Peng, P.; Wu, S. Short-term traffic forecasting: An adaptive ST-KNN model that considers spatial heterogeneity. Comput. Environ. Urban Syst. 2018, 71, 186–198. [Google Scholar] [CrossRef]
Valente, J.M.; Maldonado, S. SVR-FFS: A novel forward feature selection approach for high-frequency time series forecasting using support vector regression. Expert Syst. Appl. 2020, 160, 113729. [Google Scholar] [CrossRef]
Fu, J.; Zhou, W.; Chen, Z. Bayesian graph convolutional network for traffic prediction. Neurocomputing 2024, 582, 127507. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A machine learning approach for precipitation nowcasting. In Proceedings of the 29th International Conference on Neural Information Processing Systems—Volume 1; MIT Press: Cambridge, MA, USA, 2015; pp. 802–810. [Google Scholar]
Cao, L.; Wang, B.; Jiang, G.; Yu, Y.; Dong, J. Spatiotemporal-aware trend-seasonality decomposition network for traffic flow forecasting. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence; AAAI Press: Menlo Park, CA, USA, 2025. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting; AAAI Press: Menlo Park, CA, USA, 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Jiang, J.; Han, C.; Zhao, W.X.; Wang, J. PDFormer: Propagation delay-aware dynamic long-range transformer for traffic flow prediction. Proc. AAAI Conf. Artif. Intell. 2023, 37, 4365–4373. [Google Scholar] [CrossRef]
Müller, M. Dynamic Time Warping. In Information Retrieval for Music and Motion; Springer: Berlin/Heidelberg, Germany, 2007; pp. 69–84. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence; AAAI Press: Menlo Park, CA, USA, 2023. [Google Scholar]
Shao, Z.; Wang, F.; Xu, Y.; Wei, W.; Yu, C.; Zhang, Z.; Yao, D.; Sun, T.; Jin, G.; Cao, X.; et al. Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity Analysis. IEEE Trans. Knowl. Data Eng. 2025, 37, 291–305. [Google Scholar] [CrossRef]
Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence; AAAI Press: Menlo Park, CA, USA, 2018; pp. 3634–3640. [Google Scholar]
Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence; AAAI Press: Menlo Park, CA, USA, 2019. [Google Scholar]
Lea, C.; Flynn, M.; Vidal, R.; Reiter, A.; Hager, G. Temporal convolutional networks for action segmentation and detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2017; pp. 1003–1012. [Google Scholar]
Ismail Fawaz, H.; Lucas, B.; Forestier, G.; Pelletier, C.; Schmidt, D.F.; Weber, J.; Webb, G.I.; Idoumghar, L.; Muller, P.A.; Petitjean, F. InceptionTime: Finding AlexNet for time series classification. Data Min. Knowl. Discov. 2020, 34, 1936–1962. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Proceedings of the 35th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2021. [Google Scholar]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar] [CrossRef]
Bai, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive graph convolutional recurrent network for traffic forecasting. In Proceedings of the 34th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2020; pp. 753–763. [Google Scholar]
Cao, D.; Wang, Y.; Duan, J.; Zhang, C.; Zhu, X.; Huang, C.; Tong, Y.; Xu, B.; Bai, J.; Tong, J.; et al. Spectral temporal graph neural network for multivariate time-series forecasting. In Proceedings of the 34th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
Liu, H.; Dong, Z.; Jiang, R.; Deng, J.; Deng, J.; Chen, Q.; Song, X. Spatio-Temporal Adaptive Embedding Makes Vanilla Transformer SOTA for Traffic Forecasting. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management; Association for Computing Machinery: New York, NY, USA, 2023; pp. 4125–4129. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection; IEEE: New York, NY, USA, 2017. [Google Scholar]
Zhang, M.; Rong, Q.; Jing, H. TTSDA-YOLO: A Two Training Stage Domain Adaptation Framework for Object Detection in Adverse Weather. IEEE Trans. Instrum. Meas. 2025, 74, 5000213. [Google Scholar] [CrossRef]
Rong, Q.; Jing, H.; Zhang, M. Scale Sensitivity Mamba Network for Object Detection in Remote Sensing Images. IEEE Sens. J. 2025, 25, 43339–43351. [Google Scholar] [CrossRef]
Song, C.; Lin, Y.; Guo, S.; Wan, H. Spatial-Temporal Synchronous Graph Convolutional Networks: A New Framework for Spatial-Temporal Network Data Forecasting. Proc. AAAI Conf. Artif. Intell. 2020, 34, 914–921. [Google Scholar] [CrossRef]
Chen, C.; Petty, K.; Skabardonis, A.; Varaiya, P.; Jia, Z. Freeway performance measurement system: Mining loop detector data. Transp. Res. Rec. 2001, 1748, 96–102. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar] [CrossRef]
Hamilton, J.D. Time Series Analysis; Princeton University Press: Princeton, NJ, USA, 1994. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems—Volume 2; MIT Press: Cambridge, MA, USA, 2014; pp. 3104–3112. [Google Scholar]
Shang, C.; Chen, J.; Bi, J. Discrete Graph Structure Learning for Forecasting Multiple Time Series. arXiv 2021, arXiv:2101.06861. [Google Scholar] [CrossRef]
Deng, J.; Chen, X.; Jiang, R.; Song, X.; Tsang, I.W. ST-Norm: Spatial and Temporal Normalization for Multi-variate Time Series Forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2021; pp. 269–278. [Google Scholar]
Fang, Z.; Long, Q.; Song, G.; Xie, K. Spatial-Temporal Graph ODE Networks for Traffic Flow Forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2021; pp. 364–373. [Google Scholar]
Shao, Z.; Zhang, Z.; Wang, F.; Wei, W.; Xu, Y. Spatial-Temporal Identity: A Simple yet Effective Baseline for Multivariate Time Series Forecasting. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management; Association for Computing Machinery: New York, NY, USA, 2022; pp. 4454–4458. [Google Scholar]
Fang, Y.; Qin, Y.; Luo, H.; Zhao, F.; Xu, B.; Zeng, L.; Wang, C. When Spatio-Temporal Meet Wavelets: Disentangled Traffic Forecasting via Efficient Spectral Graph Attention Networks. In Proceedings of the 2023 IEEE 39th International Conference on Data Engineering (ICDE); IEEE: New York, NY, USA, 2023; pp. 517–529. [Google Scholar]
Dong, Z.; Jiang, R.; Gao, H.; Liu, H.; Deng, J.; Wen, Q.; Song, X. Heterogeneity-Informed Meta-Parameter Learning for Spatiotemporal Time Series Forecasting. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2024; pp. 631–641. [Google Scholar]
Zheng, C.; Fan, X.; Wang, C.; Qi, J. GMAN: A Graph Multi-Attention Network for Traffic Prediction. Proc. AAAI Conf. Artif. Intell. 2020, 34, 1234–1241. [Google Scholar] [CrossRef]
Jiang, R.; Wang, Z.; Yong, J.; Jeph, P.; Chen, Q.; Kobayashi, Y.; Song, X.; Fukushima, S.; Suzumura, T. Spatio-Temporal Meta-Graph Learning for Traffic Forecasting. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence; AAAI Press: Menlo Park, CA, USA, 2023; Volume 37, pp. 8078–8086. [Google Scholar]
van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]

Figure 1. Multi-source spatio-temporal relationships.

Figure 2. Pattern anchoring mechanism.

Figure 3. The architecture of the proposed ST-MAFNet. TCNs: Multilayer temporal convolutional network. FCs: Multilayer linear mapping layers.

Figure 4. The architecture of the proposed DSPM.

Figure 5. A comparison of the proposed FGPNet and BGPNet.

Figure 6. Ablation study of ST-MAFNet on PEMS03, PEMS04 and PEMS08.

Figure 7. Efficiency comparison of different models on PEMS04 and PEMS08 datasets.

Figure 8. Parameter study on PEMS04 and PEMS08 datasets.

Figure 9. Prediction error distributions of different models on the PEMS04 and PEMS08 datasets. The prediction error for each model is computed as the average MAE across all 12 forecasting horizons.

Figure 10. Taylor diagram of different models on the PEMS04 and PEMS08 datasets.

Figure 11. Visualization of predictions on the PEMS08 dataset.

Figure 12. t-SNEvisualization of spatial features on the PEMS04 and PEMS08 datasets.

Table 1. Comparison between representative methods and ST-MAFNet.

Existing Method	Main Limitation	Corresponding Module	Expected Benefit
Multi-scale temporal models	Extract features at multiple scales but often lack explicit cross-scale anchor-refinement interactions.	CSHA	Uses coarse-scale temporal patterns to stabilize and refine fine-scale predictions.
FPN-style hierarchical fusion	Transfers hierarchical features but is not designed for traffic-specific temporal anchoring or multi-source spatial dependencies.	CSHA and STAFM	Enables traffic-oriented coarse-to-fine prediction refinement across temporal scales.
STAEformer	Learns adaptive spatio-temporal embeddings but does not explicitly model cross-scale anchoring.	CSHA and DSPM	Combines node heterogeneity with anchor-guided multi-scale prediction.
PDFormer	Models propagation delays and long-range dependencies but does not focus on anchor-refinement interactions among temporal scales.	CSHA	Strengthens short-term prediction through cross-scale temporal constraints.
STWave	Uses wavelet-based decomposition and graph attention but does not explicitly integrate multi-source spatial views with anchor refinement.	DSPM and STAFM	Integrates adaptive, forward, backward, and temporal features within a unified fusion module.
HimNet	Captures heterogeneity through meta-parameters but does not explicitly exploit coarse-to-fine temporal anchoring.	CSHA and DSPM	Jointly models node heterogeneity, dynamic correlations, and hierarchical temporal refinement.

Table 2. Statistics of the datasets.

Dataset	Nodes	Frames	Degree	Time Range	Data Points
PEMS03	358	26,208	1.5	1 September 2018–30 November 2018	9.38 M
PEMS04	307	16,992	1.1	1 January 2018–28 February 2018	5.22 M
PEMS07	883	28,224	1.0	1 May 2017–6 August 2017	24.92 M
PEMS08	170	17,856	1.6	1 July 2016–31 August 2016	3.04 M

Table 3. Implementation details of ST-MAFNet.

Item	Setting
Input/output length	12 historical steps/12 future steps
Data split	Chronological 6:2:2 split
Normalization	Training-set statistics applied to train/validation/test sets
Input/target features	Flow, time-of-day, day-of-week/flow only
Random seed	1
Repeated runs	One deterministic run for each setting
Model selection	Best validation MAE checkpoint
Optimizer	Adam
Batch size	32
Learning-rate schedule	MultiStepLR, milestones [1, 18, 36, 54, 72], gamma 0.5
Maximum epochs	300
Gradient clipping	Max norm 3.0
Hidden dimension	64
Adaptive embedding dimension	64
Adaptive attention heads	4
Adaptive attention layers	2
Dropout	0.2
Number of temporal scales	4
Search range of S	${2, 3, \dots, 12}$ selected by validation MAE
Decoder layers/fusion layers	2/2
TCN configuration	Cascaded temporal convolution with kernel size $⌊ P / S ⌋ + 1$

Table 4. Performance comparison across different datasets (bold: best; underline: second-best).

Method	PEMS03			PEMS04			PEMS07			PEMS08
Method	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
HA	32.62	49.89	30.60%	42.35	61.66	29.92%	49.03	71.18	22.75%	36.66	50.45	21.63%
VAR	17.48	29.40	18.27%	20.87	32.26	15.70%	44.85	62.53	23.30%	18.66	27.35	12.81%
LSTM	17.47	28.71	16.79%	23.62	37.01	16.08%	25.79	40.19	11.14%	18.23	28.75	11.99%
DCRNN	15.54	27.18	15.62%	19.63	31.26	13.59%	21.16	34.14	9.02%	15.22	24.17	10.21%
GWNet	14.59	25.24	15.52%	18.53	29.92	12.89%	20.47	33.47	8.61%	14.40	23.39	9.21%
AGCRN	15.24	26.65	15.89%	19.38	31.25	13.40%	20.57	34.40	8.74%	15.32	24.41	10.03%
MTGNN	14.85	25.23	14.55%	19.17	31.70	13.37%	20.89	34.06	9.00%	15.18	24.24	10.20%
GTS	15.41	26.15	15.39%	20.96	32.95	14.66%	22.15	35.10	9.38%	16.49	26.08	10.54%
STNorm	15.32	25.93	14.37%	18.96	30.98	12.69%	20.50	34.66	8.75%	15.41	24.77	9.76%
STGODE	16.50	27.84	16.69%	20.84	32.82	13.77%	22.29	37.54	10.14%	16.81	25.97	10.62%
DLinear	21.36	34.48	22.03%	27.93	43.84	19.14%	31.71	49.37	14.62%	22.42	35.41	14.68%
STID	15.33	27.40	16.40%	18.38	29.95	12.04%	19.61	32.79	8.30%	14.21	23.28	9.27%
PDFormer	14.94	25.39	15.82%	18.36	30.03	12.00%	19.97	32.95	8.55%	13.58	23.41	9.05%
STAEformer	15.35	27.55	15.18%	18.22	30.18	11.98%	19.14	32.60	8.01%	13.46	23.25	8.88%
STWave	15.18	26.87	15.81%	18.53	30.29	12.50%	19.65	33.13	8.56%	13.96	23.93	9.04%
HimNet	15.11	26.56	15.49%	18.14	29.88	12.00%	19.21	32.75	8.03%	13.57	23.22	8.98%
ST-MAFNet (ours)	14.16	25.08	14.04%	17.88	29.76	12.14%	18.90	32.50	7.97%	13.41	23.26	8.63%
Rel. change	+2.95%	+0.59%	+2.30%	+1.43%	+0.40%	−1.34%	+1.25%	+0.31%	+0.50%	+0.37%	−0.17%	+2.82%

Table 5. Performance comparison across different horizons (bold: best; underline: second-best).

Dataset	Method	@Horizon 15			@Horizon 30			@Horizon 45			@Horizon 60
Dataset	Method	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
PEMS03	ST-MAFNet (ours)	13.12	23.20	13.47%	14.19	25.33	14.13%	15.02	26.74	14.68%	15.73	27.79	15.25%
	GWNet	13.47	23.02	14.08%	14.58	25.13	14.51%	15.53	26.60	15.29%	16.32	27.78	15.79%
	MegaCRN	13.57	24.27	14.92%	14.82	26.53	15.33%	15.77	27.93	16.14%	16.52	28.99	16.53%
	MTGNN	13.74	23.45	14.08%	14.97	26.54	14.93%	15.81	27.95	14.82%	16.71	29.13	15.63%
	STNorm	14.31	24.87	13.83%	15.39	26.94	14.28%	16.18	28.24	14.76%	16.70	28.96	15.34%
	STWave	14.27	24.72	13.98%	15.51	27.02	14.68%	16.21	28.21	16.03%	16.96	29.43	15.82%
	STID	13.88	24.63	14.78%	15.30	27.91	16.16%	16.41	31.12	17.29%	17.48	33.27	18.78%
	HimNet	13.73	24.00	14.58%	15.29	27.39	15.84%	16.50	29.16	16.96%	17.56	30.75	17.99%
	GTS	13.98	23.90	14.11%	15.34	26.09	15.34%	16.43	27.67	16.30%	17.39	29.02	17.37%
	GMAN	14.77	24.59	14.98%	15.48	25.81	15.51%	16.21	26.99	16.20%	16.99	28.20	17.07%
	STAEformer	14.06	24.69	14.30%	15.46	27.25	15.51%	16.56	29.06	16.45%	17.59	30.62	17.53%
	DCRNN	14.25	24.59	14.45%	15.54	27.18	15.42%	16.64	28.85	16.53%	17.56	30.15	17.27%
	STGCN	14.99	25.85	14.48%	16.07	27.96	15.14%	17.03	29.56	15.92%	18.13	31.14	16.98%
	AGCRN	14.95	25.64	14.39%	16.27	27.84	15.49%	17.27	29.50	16.52%	18.38	31.11	18.02%
	StemGNN	14.61	24.68	15.49%	16.38	27.71	16.54%	17.76	29.83	17.72%	19.04	31.68	19.07%
	LSTM	16.03	26.58	15.26%	19.22	31.52	18.20%	22.49	36.20	21.44%	26.03	41.27	25.25%
	WaveNet	16.14	26.91	15.30%	19.37	31.58	18.22%	22.73	36.64	21.56%	26.36	41.72	25.39%
PEMS04	ST-MAFNet (ours)	17.13	28.43	11.61%	17.94	29.90	12.15%	18.72	30.96	12.26%	18.99	31.54	13.00%
	STAEformer	18.14	29.99	11.92%	18.15	30.05	11.93%	18.72	31.20	12.26%	18.15	30.06	11.94%
	HimNet	18.16	29.87	12.00%	18.16	29.89	12.04%	18.75	31.01	12.42%	18.13	29.87	12.03%
	STID	18.44	30.03	12.48%	18.40	29.97	12.77%	18.98	30.96	12.92%	18.38	29.97	12.84%
	MegaCRN	18.71	30.38	12.73%	18.67	30.41	12.78%	19.50	31.77	13.25%	18.66	30.44	12.73%
	STNorm	18.90	31.13	12.38%	18.89	31.12	12.40%	19.48	32.32	12.72%	18.90	31.15	12.37%
	MTGNN	18.99	31.82	12.65%	18.94	31.79	12.62%	19.69	33.43	13.21%	18.93	31.78	12.75%
	GWNet	19.29	30.82	12.70%	19.06	30.63	12.54%	19.61	31.55	12.95%	18.87	30.41	12.57%
	GMAN	19.57	31.03	12.95%	18.98	30.73	12.90%	19.26	31.25	13.06%	18.97	30.56	12.98%
	STWave	19.54	30.99	12.92%	19.43	30.99	12.79%	19.93	31.93	13.17%	19.49	31.12	12.79%
	AGCRN	19.47	31.24	12.88%	19.57	31.40	13.01%	20.15	32.33	13.21%	19.61	31.47	12.94%
	DCRNN	20.03	31.78	12.93%	20.00	31.67	13.17%	20.64	32.78	13.52%	19.82	31.61	13.01%
	STGCN	20.17	32.02	13.39%	20.15	32.02	13.37%	20.96	33.30	13.79%	20.14	32.00	13.34%
	GTS	20.96	32.92	14.79%	20.92	32.88	14.87%	22.27	34.69	15.83%	20.89	32.84	14.70%
	StemGNN	21.05	33.24	14.01%	20.94	33.15	14.11%	22.39	35.13	15.19%	21.06	33.26	14.12%
	LSTM	25.75	39.86	17.54%	25.79	39.93	17.16%	28.95	43.83	19.72%	25.74	39.85	17.43%
	WaveNet	25.77	39.91	17.27%	25.77	39.87	17.45%	29.01	43.86	19.88%	25.75	39.90	17.33%
PEMS08	ST-MAFNet (ours)	12.51	21.32	8.06%	13.44	23.40	8.64%	14.25	25.08	9.09%	14.67	25.56	9.46%
	STAEformer	13.52	23.35	8.87%	13.52	23.36	8.87%	14.18	24.65	9.30%	13.52	23.37	8.89%
	HimNet	13.54	23.15	8.99%	13.53	23.16	8.96%	14.23	24.49	9.37%	13.53	23.17	8.98%
	STID	14.27	23.69	9.27%	14.30	23.68	9.29%	14.93	24.83	9.75%	14.26	23.64	9.28%
	GWNet	14.56	23.45	9.40%	14.60	23.47	9.39%	15.26	24.72	9.75%	14.56	23.48	9.44%
	GMAN	14.53	24.19	9.64%	14.61	24.05	9.58%	14.79	24.75	9.93%	14.70	24.49	9.69%
	MegaCRN	14.93	24.00	10.20%	15.00	24.05	9.71%	15.87	25.65	10.15%	15.03	24.20	9.63%
	STNorm	15.35	25.07	9.79%	15.41	25.14	9.77%	16.14	26.53	10.52%	15.38	25.15	9.70%
	MTGNN	15.40	24.43	9.67%	15.44	24.50	9.71%	16.20	25.79	10.28%	15.41	24.49	9.76%
	DCRNN	15.35	24.38	9.87%	15.39	24.46	10.00%	16.11	25.70	10.46%	15.60	24.63	9.98%
	AGCRN	15.78	24.94	10.33%	15.68	24.83	10.29%	16.41	26.07	10.72%	15.82	24.96	10.43%
	STWave	16.02	25.20	11.12%	16.44	25.88	10.84%	16.78	26.85	10.76%	16.23	25.72	10.65%
	STGCN	16.29	25.48	10.67%	16.32	25.50	10.71%	17.18	26.91	11.12%	16.32	25.50	10.70%
	GTS	16.39	25.81	10.45%	16.37	25.81	10.42%	17.55	27.58	11.29%	16.36	25.81	10.60%
	StemGNN	16.55	26.09	11.51%	16.50	26.03	11.50%	17.57	27.77	11.85%	16.50	26.03	11.49%
	LSTM	19.86	31.42	12.53%	19.87	31.41	12.59%	22.34	34.78	14.25%	19.86	31.42	12.45%
	WaveNet	20.29	31.79	12.70%	20.27	31.82	12.65%	22.87	35.26	14.41%	20.25	31.77	12.66%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, F.; Wang, X.; Zou, F.; Zou, L.; Fang, T.; Wu, X.; Jiang, H.; Weng, J. ST-MAFNet: Spatio-Temporal Multi-Scale Adaptive Fusion Network for Traffic Forecasting. AI 2026, 7, 217. https://doi.org/10.3390/ai7060217

AMA Style

Guo F, Wang X, Zou F, Zou L, Fang T, Wu X, Jiang H, Weng J. ST-MAFNet: Spatio-Temporal Multi-Scale Adaptive Fusion Network for Traffic Forecasting. AI. 2026; 7(6):217. https://doi.org/10.3390/ai7060217

Chicago/Turabian Style

Guo, Feng, Xunhuang Wang, Fumin Zou, Lei Zou, Tao Fang, Xueming Wu, Haocai Jiang, and Jianqing Weng. 2026. "ST-MAFNet: Spatio-Temporal Multi-Scale Adaptive Fusion Network for Traffic Forecasting" AI 7, no. 6: 217. https://doi.org/10.3390/ai7060217

APA Style

Guo, F., Wang, X., Zou, F., Zou, L., Fang, T., Wu, X., Jiang, H., & Weng, J. (2026). ST-MAFNet: Spatio-Temporal Multi-Scale Adaptive Fusion Network for Traffic Forecasting. AI, 7(6), 217. https://doi.org/10.3390/ai7060217

Article Menu

ST-MAFNet: Spatio-Temporal Multi-Scale Adaptive Fusion Network for Traffic Forecasting

Abstract

1. Introduction

2. Related Work

2.1. Time Series Forecasting

2.2. Multi-Scale Temporal Modeling

2.3. Spatio-Temporal Graph Neural Networks

3. Problem Statement

4. Methodology

4.1. Framework Overview

4.2. MST-Encoder

4.3. DSPM

4.4. STAFM

5. Experiment Implementation

5.1. Experimental Settings

5.1.1. Datasets

5.1.2. Implementation Details

5.1.3. Performance Evaluation

5.2. Experimental Results

5.2.1. Performance Comparison (RQ1)

5.2.2. Ablation Study (RQ2)

5.2.3. Efficiency Analysis (RQ3)

5.2.4. Parameter Study (RQ4)

5.2.5. Robustness Testing (RQ5)

5.2.6. Visualization Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI