STAE-BiSSSM: A Traffic Flow Forecasting Model with High Parameter Effectiveness

Liu, Duoliang; Qu, Qiang; Chen, Xuebo

doi:10.3390/ijgi14100388

Open AccessArticle

STAE-BiSSSM: A Traffic Flow Forecasting Model with High Parameter Effectiveness

by

Duoliang Liu

,

Qiang Qu

and

Xuebo Chen

^*

School of Electronic and Information Engineering, University of Science and Technology Liaoning, Anshan 114051, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2025, 14(10), 388; https://doi.org/10.3390/ijgi14100388

Submission received: 12 August 2025 / Revised: 29 September 2025 / Accepted: 30 September 2025 / Published: 4 October 2025

Download

Browse Figures

Versions Notes

Abstract

Traffic flow forecasting plays a significant role in intelligent transportation systems (ITSs) and is instructive for traffic planning, management and control. Increasingly complex traffic conditions pose further challenges to the traffic flow forecasting. While improving the accuracy of model forecasting, the parameter effectiveness of the model is also an issue that cannot be ignored. In addition, existing traffic prediction models have failed to organically integrate data with well-designed model architectures. Therefore, to address the above two issues, we propose the STAE-BiSSSM model as a solution. STAE-BiSSSM consists of Spatio-Temporal Adaptive Embedding (STAE) and Bidirectional Selective State Space Model (BiSSSM), where STAE aims to process features to obtain richer spatio-temporal feature representations. BiSSSM is a novel structural design serving as an alternative to Transformer, capable of extracting patterns of traffic flow changes from both the forward and backward directions of time series with much fewer parameters. Comparative tests between baseline models and STAE-BiSSSM on five real-world datasets illustrates the advance performance of STAE-BiSSSM. This is especially so on METRLA and PeMSBAY datasets, compared with the SOTA model STAEformer. In the short-term forecasting task (horizon: 15 min), MAE, RMSE and MAPE of STAE-BiSSSM decrease by 1.89%/13.74%, 3.72%/16.19% and 1.46%/17.39%, respectively. In the long-term forecasting task (horizon: 60 min), MAE, RMSE and MAPE of STAE-BiSSSM decrease by 3.59%/13.83%, 7.26%/16.36% and 2.16%/15.65%, respectively.

Keywords:

Bidirectional Selective SSM; Spatio-Temporal Adaptive Embedding; Selective SSM; traffic flow forecasting

1. Introduction

With the continuous development of urbanization, the number of vehicles has increased gradually [1], which has brought great pressure on the transportation system. The reduced efficiency of the transportation system directly leads to a series of phenomena, such as increased cost of transportation travel, frequent traffic congestion and a significant increase in the rate of traffic accidents and indirectly leads to a series of problems, such as energy waste, environmental pollution and economic losses. Geographic information science (GIS)-enabled intelligent transportation systems (ITSs), integrated with efficient and intelligent road health monitoring functions [2], are emerging as an effective solution to improve the traffic environment [3]. ITS can anticipate future traffic states by collecting and processing diverse historical data through smart infrastructure and advanced algorithms [4]. As a vital metric to measure traffic states, traffic flow reflects the traffic pressure on roads. Advanced path planning and traffic intervention based on traffic flow forecasting are efficient solutions to alleviate traffic congestion. Therefore, our proposed model STAE-BiSSSM is of practical significance.

In general, traffic forecasting models can be divided into three main categories including parametric techniques based models, machine learning techniques-based models and deep learning techniques-based models [5]. The representative models of parametric techniques based models include integrated autoregressive moving average (ARIMA)-based models [6,7,8] and kalman filters-based models [9,10,11], etc. The representative models of machine learning techniques-based models include support vector machine (SVM)-based models [12,13,14] and k-nearest neighbor (KNN)-based models [15], etc. However, parametric techniques-based models require prior knowledge and the construction of mathematical models in advance. Machine learning-based models lack the ability to process large-scale data. In recent years, the application of deep neural network in traffic forecasting has become a hot topic. This is because deep neural networks are characterized by high accuracy, strong nonlinear relationship processing ability, strong generalization ability, strong ability to deal with large-scale data, strong real-time forecasting and response and strong adaptability [16]. These deep learning models contain nonlinear autoregression with external input neural network (NARX), deep belief networks (DBNs), recurrent neural networks (RNNs), convolution neural networks (CNNs), graph neural networks (GNNs) and attention mechanisms [17]. Ref. [18] is a success application case for NARX for traffic flow forecasting combining data denoising. Ref. [19] is a representative example for DBN applied in traffic flow forecasting, which learns the features with limited prior knowledge. Subsequently, DBN have been overtaken by newer, more modern architectures, such as RNNs and CNNs. RNNs were first used to extract temporal features in traffic flow time series. Jordan network, a kind of RNN, was applied in [20,21]. The success of LSTM in NLP extends its application in traffic forecasting areas. For example, refs. [22,23,24,25] applied the LSTM and GRU for short-term traffic flow forecasting. Since traffic flow also have spatial properties, CNNs can be used to model the spatial relationship between roads. For example, models such as refs. [26,27,28,29,30,31,32,33] are structured by LSTM and CNN; GNN’s efficient processing of graph data makes it naturally suited to handle traffic flow forecasting tasks. For example, STGCN [34] is structured by CNN and GCN; models such as ref. [35,36,37,38,39], DCRNN [40] and TGCN [41] are characterized by novel graph convolution. Models such as AGCRN [42], GWNet [43], GTS [44], MTGNN [45] and Refs. [46,47] are able to learn graph structures to model more accurate, fine-grained spatial relationships between roads. Attention mechanism-based models become popular for their outstanding performance. For example, GMAN [48], PDFormer [49], AutoFormer [50], TrafficFormer [51] and MVSTT [52] are the representative models achieving the advanced performance. While more sophisticated and improved models are being designed to improve forecasting performance, attention is returning to the traffic data itself. For example, Shao et al. proposed STID [53] model which uses the one-hot encoding to integrate the spatial and temporal feature information into input data, thus enhances sample distinguishability. Liu et al proposed STAEformer [54] model which introduces a spatial–temporal adaptive embedding that enhances the capabilities of vanilla transformers for traffic flow forecasting.

Existing traffic flow forecasting models have not yet effectively addressed two key issues.

1.: Currently, mainstream traffic forecasting models are Transformer-based models. Quadratic computational complexity of attention mechanism results in a large number of parameters for this class of models, leading to high hardware requirements and high training cost, making model difficult deployment. Therefore, finding models with fewer parameters and excellent performance to replace Transformer has become a trend, such as RetNet [55], RWKV [56] and Mamba [57].
2.: There exists a disconnect in the optimization dimensions of current research efforts—some studies focus solely on improving the model’s network structure to enhance its fitting capability, while others concentrate exclusively on mining the features of traffic flow data. These two aspects fail to achieve organic integration, making it difficult to fully exert the synergistic effect of “structure adapting to data and data feeding back to structure”.

To address above issues, we propose the STAE-BiSSSM model, combining STAE and BiSSSM. BiSSSM is a newly designed model, and its core advantage lies in its ability to effectively capture the bidirectional dependencies of traffic flow from both the forward and backward direction of time series, thereby fully learning and modeling the dynamic evolution patterns of traffic flow. In addition, STAE enriches the representation of traffic flow feature, which significantly enhances the learning outcomes of BiSSSM. Compared with mainstream Transformer-based models, STAE-BiSSSM not only maintains performance advantages but also significantly reduces the number of model parameters, thus featuring higher efficiency. The main contributions of this paper are summarized as follows:

We designed a novel bidirectional selective state space model (BiSSSM) which is able to capture the bidirectional dependency of the traffic flow from both forward and backward direction of time series and better learn the dynamic change patterns of traffic flow.
We combined spatio-temporal adaptive embedding (STAE) and BiSSSM as a new model, STAE-BiSSSM, where STAE enriches the representation of traffic flow feature, thus enhancing the learning outcomes of BiSSSM.
STAE-BiSSSM shows the outstanding long- and short-term forecasting ability across five authoritative real-world datasets. In particular, on datasets METRLA and PeMSBAY, STAE-BiSSSM outperforms the SOTA model STAEformer.
Compared with STAEformer, STAE-BiSSSM has higher parameter effectiveness. While possessing superior performance, the number of trainable parameters of STAE-BiSSSM is much less than STAEformer.

The rest of this paper is organized as follows. Section 2 introduces the prerequisite knowledge and details of the STAE-BiSSSM model. In Section 3, we verify the effectiveness of the proposed model through a series of experiments. Finally, we conclude this paper in Section 4.

2. Methodology

2.1. Problem Definition

In this paper, we use the STAE-BiSSSM model to predict traffic flow over a certain time horizon based on historical traffic flow. Given traffic flow time series with l steps, we use

X_{t - l + 1 : t}

to infer the future

l^{'}

steps by training model

F (\cdot)

with parameters

θ

, which can be formulated as:

[X_{t + 1}, X_{t + 2}, \dots, X_{t + l^{'}}] = F ([X_{t - l + 1}, X_{t - l + 2}, \dots, X_{t}]; θ)

(1)

where traffic flow step is

X_{i} \in R^{n \times d_{r a w}}

and n is the number of spatial nodes.

d_{r a w}

is the dimension of input feature.

2.2. Embedding Layer

The embedding layer contains three types of embedding which are feature embedding, periodicity embedding, and spatio-temporal adaptive embedding. In order to further enrich the feature representation of traffic flow while preserving the native information of the data. We utilize the fully connected layer to obtain feature embedding

E_{f e a} \in R^{l \times n \times d_{f e a}}

:

E_{f e a} = F C (X_{t - l + 1 : t})

(2)

where

d_{f e a}

is the dimension of feature embedding and

F C (\cdot)

indicates a fully connected layer. Since traffic is time-varying, labeling timestamps is critical. We denote the learnable timestamp-of-day embedding dict as

D i c t_{t o d} \in R^{n_{t o d} \times d_{t o d}}

, and the learnable day-of-week embedding dict as

D i c t_{d o w} \in R^{7 \times d_{d o w}}

, where

n_{t o d}

is the number of timestamps in one day and is set as 288 in our case,

d_{t o d}

is the dimension of timestamp-of-day embedding and

d_{d o w}

is the dimension of day-of-week embedding. The timestamp-of-day embedding of timestamp i in one day is

E_{t o d}^{i} \in R^{d_{t o d} \times 1}

, and the day-of-week embedding of timestamp i in one week is

E_{d o w}^{i} \in R^{d_{d o w} \times 1}

. By concatenating and broadcasting them, we get the periodicity embedding

E_{p e r} \in R^{l \times n \times (d_{t o w} + d_{d o w})}

for the traffic flow time series. Furthermore, periodicity is not the only factor affecting temporal relationship. From a localized perspective, temporal relationship is decided by chronological order in the traffic time series. From a global perspective, temporal relationships between different roads may be affected by spatial arrangement and the time series from different roads tend to have different temporal patterns. In order to dispense with the high computational costs pre-defined or dynamic adjacency matrix and better model local and global spatio-temporal relationships, we use a spatio-temporal adaptive embedding

E_{a d p} \in R^{l \times n \times d_{a d p}}

to capture the complicated spatio-temporal relationship efficiently, where

d_{a d p}

is the dimension of spatio-temporal adaptive embedding. In particular,

E_{a d p}

is shared across different traffic time series. By concatenating the above three embeddings, we obtain the final spatio-temporal representation of feature

F_{i n} \in R^{l \times n \times d_{i n}}

as follows:

F_{i n} = E_{f e a} | | E_{p e r} | | E_{a d p}

(3)

where

d_{i n}

is the dimension for STAE-BiSSSM inputs which equals

d_{f e a} + (d_{t o w} + d_{d o w}) + d_{a d p}

. The structure of the embedding layer is shown as Figure 1b.

2.3. Selective State Space Model (SSSM)

The state space model(SSM) is the basic structure of Selective State Space Model (SSSM) which can construct long distance time dependence. The inference process of SSM [58] for time series modeling is shown as Algorithm 1:

Algorithm 1 Inference Process of SSM for time series modeling.

Inputs $u : [b, l, d_{i n}]$
Parameter Matrices Initialization:
$A, B, C : [d_{i n}, d_{h i d}]$ , $Δ : [d_{i n}]$ , $h (0) : [b, d_{i n}, d_{h i d}]$
start:
Discretization:
$\bar{A} = e x p (e i n s u m (“ d_{i n}, d_{i n} d_{h i d} \to d_{i n} d_{h i d} ”, Δ, A)$
$\bar{B} = e i n s u m (“ d_{i n}, d_{i n} d_{h i d} \to d_{i n} d_{h i d} ”, Δ, B)$
$\bar{B} u = e i n s u m (“ d_{i n} d_{h i d}, b l d_{i n} \to b l d_{i n} d_{h i d} ”, \bar{B}, u)$
$\bar{A}, \bar{B} : [d_{i n}, d_{h i d}]$ $\bar{B} u : [b, l, d_{i n}, d_{h i d}]$
Hidden State Iteration:
for $k = 0 : l - 1$ do:
$\bar{A} h (k) = e i n s u m (“ d_{i n} d_{h i d}, b d_{i n} d_{h i d} \to b d_{i n} d_{h i d} ”, \bar{A}, h (k))$
$h (k + 1) = \bar{A} h (k) + \bar{B} u (k)$
$h = [h (1), h (2), . . ., h (l)]$
$y = e i n s u m (‘ ‘ d_{i n} d_{h i d}, b l d_{i n} d_{h i d} \to b l d_{i n} ’ ’, C, h)$
$h (k), \bar{A} h (k), \bar{B} u (k) : [b, d_{i n}, d_{h i d}]$ , $h : [b, l, d_{i n}, d_{h i d}]$
end:
Outputs $y : [b, l, d_{i n}]$

Where

A, \bar{A}

are the system matrix and discrete system matrix;

B, \bar{B}

are the input matrix and discrete input matrix; C is the output matirx;

Δ

is the discrete parameter and

h (k)

is the hidden state. b is the batch size; l is the length of time series;

d_{i n}

is the feature dimension of one time step in time series and

d_{h i d}

is the dimension of hidden state.

e i n s u m (\cdot)

indicates Einstein summation convention.

However, the parameter matrices of the conventional SSM is static, which means parameter matrices are not capable of adjusting to the inputs. In order to make the parameter matrix sensitive to the inputs for the purpose of selective retention of critical information, SSSM introduces the selective mechanism to the SSM to build relationship between parameter matrices and inputs. This relationship is defined as a linear projection from inputs u to

Δ

, B and C which can be formulated as follows:

\begin{matrix} δ_{b \times l \times d_{m i d}} & = L i n e a r_{d_{i n} \to d_{m i d}} (u) = u_{b \times l \times d_{i n}} W_{d_{i n} \times d_{m i d}} + b i a s_{b \times l \times d_{m i d}} \end{matrix}

(4)

\begin{matrix} Δ_{b \times l \times d_{i n}} & = S o f t p l u s (L i n e a r_{d_{m i d} \to d_{i n}} (δ)) = S o f t p l u s (δ_{b \times l \times d_{m i d}} W_{d_{m i d} \times d_{i n}} + b i a s_{b \times l \times d_{i n}}) \end{matrix}

(5)

\begin{matrix} B_{b \times l \times d_{h i d}} & = L i n e a r_{d_{i n} \to d_{h i d}} (u) = u_{b \times l \times d_{i n}} W_{d_{i n} \times d_{h i d}} + b i a s_{b \times l \times d_{h i d}} \end{matrix}

(6)

\begin{matrix} C_{b \times l \times d_{h i d}} & = L i n e a r_{d_{i n} \to d_{h i d}} (u) = u_{b \times l \times d_{i n}} W_{d_{i n} \times d_{h i d}} + b i a s_{b \times l \times d_{h i d}} \end{matrix}

(7)

where

δ

is the middle rank state and

d_{m i d} < d_{i n} / 2

is the dimension of

δ

.

S o f t p l u s (\cdot)

indicates the activation function

f (x) = l o g (1 + e^{x})

. The introduction of

δ

aims to decrease the parameters number of projection and prevents model overfitting. The inference process of Selective SSM [57] for time series modeling is shown as Algorithm 2:

Algorithm 2 Inference Process of Selective SSM (SSSM) for time series modeling.

Inputs $u : [b, l, d_{i n}]$
Parameter Matrices Initialization:
$A : [d_{i n}, d_{h i d}]$ , $h (0) : [b, d_{i n}, d_{h i d}]$
start:
Linear Projection:
$δ = L i n e a r_{d_{i n} \to d_{m i d}} (u)$ , $δ : [b, l, d_{m i d}]$
$Δ = L i n e a r_{d_{m i d} \to d_{i n}} (δ)$ , $Δ : [b, l, d_{i n}]$
$B = L i n e a r_{d_{i n} \to d_{h i d}} (u)$ , $B : [b, l, d_{h i d}]$
$C = L i n e a r_{d_{i n} \to d_{h i d}} (u)$ , $C : [b, l, d_{h i d}]$
Discretization:
$\bar{A} = e x p (e i n s u m (‘ ‘ b l d_{i n}, d_{i n} d_{h i d} \to b l d_{i n} d_{h i d} ’ ’, Δ, A))$
$\bar{B} = e i n s u m (“ b l d_{i n}, b l d_{h i d} \to b l d_{i n} d_{h i d} ”, Δ, B)$
$\bar{B} u = e i n s u m (“ b l d_{i n} d_{h i d}, b l d_{i n} \to b l d_{i n} d_{h i d} ”, \bar{B}, u)$
$\bar{A}, \bar{B}, \bar{B} u : [b, l, d_{i n}, d_{h i d}]$
Hidden State Iteration:
for $k = 0 : l - 1$ do:
$\bar{A} (k) h (k) = e i n s u m (“ b d_{i n} d_{h i d}, b d_{i n} d_{h i d} \to b d_{i n} d_{h i d} ”, \bar{A} (k), h (k))$
$h (k + 1) = \bar{A} (k) h (k) + \bar{B} u (k)$
$h = [h (1), h (2), . . ., h (l)]$
$y (k) = e i n s u m (“ b l d_{h i d}, b l d_{i n} d_{h i d} \to b l d_{i n} ”, C, h)$
$h (k), \bar{A} (k), \bar{A} (k) h (k), \bar{B} u (k) : [b, d_{i n}, d_{h i d}]$ , $h : [b, l, d_{i n}, d_{h i d}]$
end
Outputs $y : [b, l, d_{i n}]$

Where the “for” loop can be replaced by a parallel scanning algorithm that accelerates the inference speed of SSSM. The structure of SSSM is shown as Figure 1d.

2.4. STAE-Bidirectional SSSM (STAE-BiSSSM)

Traffic flow has bidirectional dependence, which means that the current traffic flow state in time series is both influenced by the historical flow states and implicitly constrained by potential future flow states. For example, The forward dependence of traffic flow is reflected in the fact that the upward trend of morning peak flow is influenced by the previous hour’s flow states, and the backward dependence of traffic flow is reflected in the fact that the congestion dissipation process in the evening peak is related to the flow states in the next 30 min. Therefore, we propose a STAE-BiSSSM for traffic flow forecasting, where STAE is used for enriching the representation of traffic flow feature, contributing to the model learning; BiSSSM, including forward SSSM and backward SSSM, is used for extracting bidirectional dependence in traffic flow series, contributing to increase the forecasting accuracy of the model. The inference process of STAE-BiSSSM is shown in Algorithm 3. An overall structure of STAE-BiSSSM is shown as Figure 1a and the structure of BiSSSM is shown as Figure 1c.

Algorithm 3 Inference Process of STAE-Bidirectional SSSM (STAE-BiSSSM) for traffic flow forecasting (Ours).

Inputs $u : [b, l, n, d_{r a w}]$
start:
$u^{'} = t r a n s p o s e (u, s h a p e = (b, n, l, d_{r a w}))$
$u^{''} = r e s h a p e (u^{'}, s h a p e = (b * n, l, d_{r a w}))$
Embedding Layer:
$F_{i n} = e m b e d d i n g_{d_{r a w} \to d_{i n}} (u^{''})$ , $F_{i n} : [b * n, l, d_{i n}]$
BiSSSM:
Convolution Layer:
$C o n v_{o u t} = c o n v 1 D (F_{i n}; d_{c o n v})$ , $C o n v_{o u t} : [b * n, l, d_{i n}]$
$C o n {v^{'}}_{o u t} = s i l u (C o n v_{o u t})$ , $C o n {v^{'}}_{o u t} : [b * n, l, d_{i n}]$
Position Switching:
$B S S S M_{i n} = p o s i t i o n s w i t c h i n g_{i \Leftrightarrow l - 1 - i} (C o n {v^{'}}_{o u t})$ , $B S S S M_{i n} : [b * n, l, d_{i n}]$
Forward SSSM (FSSSM):
$F S S S M_{o u t} = F S S S M (C o n {v^{'}}_{o u t})$ , $F S S S M_{o u t} : [b * n, l, d_{i n}]$
Backward SSSM (BSSSM):
$B S S S M_{o u t} = B S S S M (B S S S M_{i n})$ , $B S S S M_{o u t} : [b * n, l, d_{i n}]$
Concatenation:
$B i S S S M_{o u t} = F S S S M_{o u t} | | B S S S M_{o u t}$ , $B i S S S M_{o u t} : [b * n, l, 2 d_{i n}]$
Reduced Dimensional Projection:
$B i S S S {M^{'}}_{o u t} = L i n e a r_{2 d_{i n} \to d_{i n}} (B i S S S M_{o u t})$ , $B i S S S {M^{'}}_{o u t} : [b * n, l, d_{i n}]$
Residual Connection:
$B i S S S {M^{″}}_{o u t} = B i S S S {M^{'}}_{o u t} * s i l u (F_{i n}) + F_{i n}$ , $B i S S S {M^{″}}_{o u t} : [b * n, l, d_{i n}]$
Regression Layer:
$R e g_{i n} = R M S n o r m (B i S S S {M^{'}}_{o u t})$ , $R e g_{i n} : [b * n, l, d_{i n}]$
$R e {g^{'}}_{i n} = r e s h a p e (R e g_{i n}, s h a p e = (b * n, l * d_{i n}))$
$R e g_{o u t} = L i n e a r_{l * d_{i n} \to l^{'} * d_{o u t}} (R e {g^{'}}_{i n})$ , $R e g_{o u t} : [b * n, l^{'} * d_{o u t}]$
$y = r e s h a p e (R e g_{o u t}, s h a p e = (b, n, l^{'}, d_{o u t}))$
$y^{'} = t r a n s p o s e (y, s h a p e = (b, l^{'}, n, d_{o u t}))$
end
Outputs $y^{'} : [b, l^{'}, n, d_{o u t}]$

Where

t r a n s p o s e (\cdot)

indicates the tensor transpose;

r e s h a p e (\cdot)

indicates the tensor reshaping;

s h a p e (\cdot)

indicates the shape of target tensor;

s i l u (\cdot)

indicates the activation function

f (x) = x \cdot \frac{1}{1 + e^{- x}}

;

c o n v 1 D

indicates one-dimensional convolution;

R M S n o r m (\cdot)

indicates the Root Mean Square Layer Normalization [59];

d_{c o n v}

is the kernel size of one-dimensional convolution and

d_{o u t}

is the output dimension of STAE-BiSSSM.

3. Experimental Study

In this section, we briefly introduce the datasets and baseline models used in the experiments in Section 3.1. Metrics for evaluating the performance of the model are given in Section 3.2. Implementation details of the experiments are introduced in Section 3.3. The results of the experiments and the real case forecasting are shown in Section 3.4. The results of the ablation study for STAE-BiSSSM are shown in Section 3.5. The parameter effectiveness analysis is included in Section 3.6.

3.1. Description of Experimental Datasets and Baseline Models

3.1.1. Experimental Datasets

In the experiments, we apply five widely used authoritative datasets including PeMSD4, PeMSD7, PeMSD8, METRLA and PeMSBAY to evaluate the forecasting performance of STAE-BiSSSM. Details of five datasets are shown in Table 1.

3.1.2. Baseline Models

In this section, we introduce the baseline models included in the comparative tests based on their characteristics.

1.: Historical Inertia (HI) [60]: HI serves as the conventional benchmark, reflecting standard industry practices.
2.: STGCN [34]: STGCN proposes a spatio-temporal convolution block (ST-Conv Block) that consists of a spatial graph convolutional layer (GCN) and a temporally gated convolutional layer (Gated CNN) to realize the joint extraction of spatio-temporal features.
3.: DCRNN [40]: DCRNN proposes diffusion convolution to model the spatial dependence of directed graphs by bidirectional stochastic wandering (forward diffusion and backward diffusion).
4.: AGCRN [42]: AGCRN dynamically generates the adjacency matrix through learnable node embedding.
5.: GWNet [43]: GWNet introduces an adaptive graph modeling approach to discover the implicit path dependence in traffic flow, and dilated causal convolution (DCC) is used to extract the long time series dependency.
6.: GTS [44]: GTS develops a forecasting approach for multiple interrelated time series, learning a graph structure in conjunction with a GNN, which addresses the shortcomings of earlier methods.
7.: MTGNN [45]: MTGNN generates an asymmetric neighborhood matrix by learning the node embedding matrix to capture unidirectional causality in multidimensional time series.
8.: STNorm [61]: STNorm achieves multimodal feature fusion through temporal embedding, spatial embedding and feature embedding, which enhances model learning.
9.: GMAN [48]: GMAN enhances the model’s ability to resolve complex spatio-temporal patterns by parallelizing multiple attention modules (spatial attention, temporal attention and transformational attention) to capture different dimensional dependencies in traffic data separately.
10.: PDFormer [49]: PDFormer introduces a Delayed Feature Transformation (DFT) module based on attention mechanism. The DFT Module can better model information delay in real traffic situations, thus improving the forecasting accuracy of the model.
11.: STID [53]: STID proposes a spatio-temporal unique heat coding method, which enriches the feature representation and makes features discriminative. Thanks to spatio-temporal unique heat coding, it greatly improves the performance of fully connected networks on traffic flow forecasting.
12.: STAEformer [54]: STAEformer proposes a spatio-temporal adaptive embedding to enhance feature representation and distinguishability and thus makes the vanilla Transformer SOTA for traffic flow forecasting.

3.2. Metrics of Model Evaluation

To evaluate the predictive performance of the model, we use the following three metrics to assess the difference between the real traffic flow of the road V and the predicted traffic flow of the road

\hat{V}

:

1.: Mean Average Error (MAE):

$M A E = \frac{1}{n} \sum_{i = 1}^{n} |V_{i} - {\hat{V}}_{i}|$

(8)
2.: Root Mean Square Error (RMSE):

$R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(V_{i} - {\hat{V}}_{i})}^{2}}$

(9)
3.: Mean Absolute Percentage Error (MAPE):

$M A P E = \frac{1}{n} \sum_{i = 1}^{n} |\frac{V_{i} - {\hat{V}}_{i}}{V_{i}}| \times 100 %$

(10)

MAE, RMSE and MAPE are the metrics used to measure the forecasting error; the smaller the value of three metrics, the better the forecasting performance of the model.

3.3. Experiments Implementation

We implement the model with the PyTorch toolkit on a Linux server with one Nvidia A100 GPU. For the datasets settings, PeMSD4, PeMSD7 and PeMSD8 are divided into training, validation and test sets in a ratio of 6:2:2. The time steps of METRLA and PeMSBAY datasets are much longer than PeMSD4, PeMSD7 and PeMSD8 datasets, which means that more changes in traffic flow were contained in METR-LA and PeMS-BAY datasets. In order to ensure sufficient samples for training the models to fully learn the dynamic patterns of traffic flow data and sufficient data to test the performance of the model, METRLA and PeMSBAY are divided in a ratio of 7:1:2 instead of 6:2:2. Before feeding the training data in training sets into the baseline models and STAE-BiSSSM for training, we performed standard normalization on the training data. The expression of standard normalization is shown as follows:

z = \frac{x - μ}{σ}

(11)

where x is the training data, and

μ

and

σ

are the mean value of the data and the standard deviation of the data, respectively.

Model parameters are crucial to the model performance. A description and details of model parameters are shown in Table 2. For PeMSD4, PeMSD7 and PeMSD8 Datasets, we set the input length and output length to be 1 h

l = l^{'} = 12

. For METRLA and PeMSBAY datasets, we set the input length to be 1 h

l = 12

and output length to be 15 min, 30 min and 60 min

l^{'} = 3, 6, 12

, respectively. The batch size is 16, and Adam is chosen as the optimizer with the learning rate decaying from 0.001. An early-stop mechanism is applied if the validation error converges within 20 continuous steps for PeMSD4, PeMSD7 and PeMSBAY, and 30 continuous steps for PeMSD8 and METRLA. For the settings of loss function in the experiments, we apply the Huber loss as the loss function for PeMSD4, PeMSD7 and PeMSD8 datasets, which is shown as follows:

L_{δ} (y, \hat{y}) = \{\begin{matrix} \frac{1}{2} {(y - \hat{y})}^{2} & | y - \hat{y} | \leq δ \\ δ | y - \hat{y} | - \frac{1}{2} δ^{2} & | y - \hat{y} | > δ \end{matrix}

(12)

where

δ

is the hyperparameter which determines the emphasis of Huber loss on MSE and MAE. In our case,

δ

is set to be 1. For the PeMSD4, PeMSD7 and PeMSD8 datasets with a large amount of noise, Huber loss is more robust for the outlier values while ensuring relatively fast convergence. For METRLA and PeMSBAY datasets, we use the masked MAE as the loss function.

3.4. Experiment Results

In order to illustrate the advancement of STAE-BiSSSM, we conducted full, persuasive comparative experiments. Section 3.4.1 illustrates the advantages of STAE-BiSSSM compared with other baseline models according to the metrics. Section 3.4.2 shows the impressive performance of STAE-BiSSSM through real case forecasting.

3.4.1. Metrics Analysis

Table 3 shows results of comparative tests between HI, STGCN, DCRNN, AGCRN, GWNet, GTS, MTGNN, STNorm, GMAN and STAE-BiSSSM on test sets of PeMSD4, PeMSD7 and PeMSD8 datasets. Results of comparative tests between the above baseline models, SOTA models (PDFormer, STID and STAEformer) and STAE-BiSSSM on test sets of METRLA and PeMSBAY datasets are shown in Table 4 and Table 5. In each table, the metrics with underlines are the optimal ones.

As shown in Table 3, STAE-BiSSSM achieves the best performance compared to other baseline models on PeMSD4, PeMSD7 and PeMSD8 datasets. Compared with the most advanced model GWNet, MAE and MAPE of STAE-BiSSSM decrease by 0.38% and 5.04%, respectively on PeMSD4 datasets; MAE, RMSE and MAPE decrease by 2.74%, 0.93% and 2.09% on PEMSD7 datasets and MAE and RMSE of STAE-BiSSSM decrease by 3.89% and 0.68% on PEMSD8 datasets. As shown in Table 4 and Table 5, STAE-BiSSSM achieves the best performance compared to other baseline models on METRLA and PeMSBAY datasets under different forecasting horizon. In the short-term forecasting task (horizon: 15 min), compared with the SOTA model STAEformer, MAE, RMSE and MAPE of STAE-BiSSSM decrease by 1.89%/13.74%, 3.72%/16.19% and 1.46%/17.39% on METRLA and PEMSBAY datasets. In the long-term forecasting task (horizon: 60 min), compared with the STAEformer, MAE, RMSE and MAPE of STAE-BiSSSM decrease by 3.59%/13.83%, 7.26%/16.36% and 2.16%/15.65% on METRLA and PEMSBAY datasets. In a nutshell, STAE-BiSSSM not only performs well on the short-term forecasting task, but also performs even better on the long-term forecasting task.

3.4.2. Real Case Analysis

As shown in Figure 2, the forecasting curve of the STAE-BiSSSM can perfectly fit the ground truth under long-term forecasting tasks (Horizon: 12) on PeMSD4 datasets. As shown in Figure 3, Figure 4 and Figure 5, the forecasting curves of the STAE-BiSSSM can reflect the trend of the ground truth under the long-term forecasting task (Horizon: 12) well on PeMSD7, PeMSD8 and METRLA datasets with noisy data. Figure 6 shows the forecasting curves of STAE-BiSSSM under the short-term and long-term forecasting task on PeMSBAY datasets. For short-term forecasting (Horizon: 3), the forecasting curve of STAE-BiSSSM can fit the ground truth well. For long-term forecasting (Horizon: 12), the forecasting curve of STAE-BiSSSM can also reflect the trend of the ground truth well. In conclusion, STAE-BiSSSM has the ability to perform real-world traffic flow forecasting under different forecasting tasks.

3.5. Ablation Study

In order to illustrate the validity of critical components of STAE-BiSSSM, we introduce an ablation study including two variants of our model as follows:

FSSSM: It removes the spatio-temporal adaptive embedding $E_{a d p}$ and backward SSSM in STAE-BiSSSM.
STAE-FSSSM: It removes the backward SSSM in STAE-BiSSSM.

The results of the ablation study on PeMSD4, METRLA and PeMSBAY datasets under forecasting horizon 12 are shown in Table 6. Compared with FSSSM, MAE, RMSE and MAPE of the STAE-FSSSM decrease by 15.15%, 13.98% and 15.59% on PeMSD4 datasets; 17.35%, 16.71% and 17.76% on METRLA datasets; and 19.21%, 17.61% and 19.50% on PeMSBAY datasets, respectively. It can be seen that

E_{a d p}

can enrich the representation of the feature, thus contributing the model learning. Compared with STAE-FSSSM, MAE, RMSE and MAPE of the STAE-BiSSSM decrease by 0.43%, 0.13% and 2.16% on PeMSD4 datasets; 0.62%, 0.31% and 3.36% on METRLA datasets; and 1.22%, 0.55% and 4.12% on PeMSBAY datasets. It can seen that bidirectional feature extraction of time series can be better mined to find out the dynamic change pattern of traffic flow. Therefore, each critical component of STAE-BiSSSM is necessary.

3.6. Parameter Effectiveness Analysis

Table 7 and Figure 7 show the total trainable parameters of STAEformer and STAE-BiSSSM on METRLA and PeMSBAY datasets under different forecasting horizon. Figure 8 and Figure 9 show a comparison of three performance metrics between STAEformer and STAE-BiSSSM under different forecasting horizon. For METRLA datasets, the number of STAE-BiSSSM trainable parameters is about 26–27% of the number of STAEformer trainable parameters. For PEMSBAY datasets, the number of STAE-BiSSSM trainable parameters is about 32–33% of the number of STAEformer trainable parameters. However, the predictive performance of the STAE-BiSSSM model outperforms STAEformer on both datasets which suggests that the parameter effectiveness of STAE-BiSSSM is superior to that of STAEformer. This provides further evidence that the BiSSSM structure is more effective than the attention mechanism, and that fewer parameters can be used to retain critical information in the time series. At the same time, a smaller number of trainable parameters means that the model is cheaper to train and less demanding on hardware, which facilitates model deployment and application.

4. Conclusions

In this paper, we focus on both new structural design and the nature of the data itself. Therefore, a novel BiSSSM is proposed based on an emerging efficient model selective SSM to extract the bidirectional dependency of traffic flow time series, and spatio-temporal adaptive embedding is used for enriching the representation of features. We combine BiSSSM and STAE to obtain STAE-BiSSM, incorporating the advantages of both for better traffic flow forecasting. Comparative tests across five authoritative real-world datasets demonstrate that the STAE-BiSSSM is capable of both long- and short-term forecasting and achieves advanced performance. An ablation study illustrates that each critical component of STAE-BiSSSM is necessary. Parameter effectiveness analysis of STAE-BiSSSM suggests that STAE-BiSSSM can also achieve excellent forecasting results with a small number of parameters. In a nutshell, STAE-BiSSSM shows a promising direction for addressing the challenges in traffic forecasting. However, one a limitation is that we have not incorporated the interactions and connections between traffic flow and other traffic states into the forecasting of traffic flow. Joint traffic flow forecasting using multiple traffic features is a potential future research direction.

Author Contributions

Methodology, Duoliang Liu; software, Duoliang Liu; validation, Duoliang Liu; investigation, Xuebo Chen; resources, Qiang Qu; writing—original draft preparation, Qiang Qu; writing—review and editing, Qiang Qu and Xuebo Chen; visualization, Duoliang Liu; supervision, Xuebo Chen; funding acquisition, Xuebo Chen. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

https://github.com/XDZhelheim/STAEformer/tree/main/data, accessed on 28 September 2025.

Acknowledgments

This research reported herein was supported by the NSFC of China under Grant No. 71571091 and 71771112.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SSM	State space models
SSSM	Selective state space model
BiSSSM	Bidirectional selective state space model
STAE	Spatial–temporal adaptive embedding
STAE-BiSSSM	Spatial–temporal adaptive embedding–Bidirectional selective state space model
SOTA	State-of-the-art

References

Chen, S.; Cheng, K.; Yang, J.; Zang, X.; Luo, Q.; Li, J. Driving Behavior Risk Measurement and Cluster Analysis Driven by Vehicle Trajectory Data. Appl. Sci. 2023, 13, 5675. [Google Scholar] [CrossRef]
Huang, S.; Zhu, G.; Tang, J.; Li, W.; Fan, Z. Multi-Perspective Semantic Segmentation of Ground Penetrating Radar Images for Pavement Subsurface Objects. IEEE Trans. Intell. Transp. Syst. 2025, 26, 14339–14352. [Google Scholar] [CrossRef]
Boukerche, A.; Wang, J. Machine Learning-based traffic prediction models for Intelligent Transportation Systems. Comput. Netw. 2020, 181, 107530. [Google Scholar] [CrossRef]
Zhou, Z.; Yang, Z.; Zhang, Y.; Huang, Y.; Chen, H.; Yu, Z. A comprehensive study of speed prediction in transportation system: From vehicle to traffic. iScience 2022, 25, 103909. [Google Scholar] [CrossRef]
Medina-Salgado, B.; Sánchez-DelaCruz, E.; Pozos-Parra, P.; Sierra, J.E. Urban traffic flow prediction techniques: A review. Sustain. Comput. Inform. Syst. 2022, 35, 100739. [Google Scholar] [CrossRef]
Williams, B.M.; Hoel, L.A. Modeling and Forecasting Vehicular Traffic Flow as a Seasonal ARIMA Process: Theoretical Basis and Empirical Results. J. Transp. Eng. 2003, 129, 664–672. [Google Scholar] [CrossRef]
Min, W.; Wynter, L. Real-time road traffic prediction with spatio-temporal correlations. Transp. Res. Part Emerg. Technol. 2011, 19, 606–616. [Google Scholar] [CrossRef]
Hou, Q.; Leng, J.; Ma, G.; Liu, W.; Cheng, Y. An adaptive hybrid model for short-term urban traffic flow prediction. Phys. Stat. Mech. Its Appl. 2019, 527, 121065. [Google Scholar] [CrossRef]
Emami, A.; Sarvi, M.; Bagloee, S.A. Short-term traffic flow prediction based on faded memory Kalman Filter fusing data from connected vehicles and Bluetooth sensors. Simul. Model. Pract. Theory 2020, 102, 102025. [Google Scholar] [CrossRef]
Bakibillah, A.; Tan, Y.H.; Loo, J.Y.; Tan, C.P.; Kamal, M.; Pu, Z. Robust estimation of traffic density with missing data using an adaptive-R extended Kalman filter. Appl. Math. Comput. 2022, 421, 126915. [Google Scholar] [CrossRef]
Chang, S.Y.; Wu, H.C.; Kao, Y.C. Tensor Extended Kalman Filter and its Application to Traffic Prediction. Trans. Intell. Transport. Syst. 2023, 24, 13813–13829. [Google Scholar] [CrossRef]
Smola, A.J.; Schölkopf, B.S. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
Wu, C.H.; Ho, J.M.; Lee, D. Travel-time prediction with support vector regression. IEEE Trans. Intell. Transp. Syst. 2004, 5, 276–281. [Google Scholar] [CrossRef]
Yao, Z.; Shao, C.F.; Gao, Y.L. Research on methods of short-term traffic forecasting based on support vector regression. J. Beijing Jiaotong Univ. 2006, 30, 19–22. [Google Scholar]
Cheng, S.; Lu, F.; Peng, P.; Wu, S. Short-term traffic forecasting: An adaptive ST-KNN model that considers spatial heterogeneity. Comput. Environ. Urban Syst. 2018, 71, 186–198. [Google Scholar] [CrossRef]
Sun, Y.; Shi, Y.; Jia, K.; Zhang, Z.; Qin, L. A Dual-Stream Cross AGFormer-GPT Network for Traffic Flow Prediction Based on Large-Scale Road Sensor Data. Sensors 2024, 24, 3905. [Google Scholar] [CrossRef]
Carianni, A.; Gemma, A. Overview of Traffic Flow Forecasting Techniques. IEEE Open J. Intell. Transp. Syst. 2025, 6, 848–882. [Google Scholar] [CrossRef]
Chen, X.; Wu, S.; Shi, C.; Huang, Y.; Yang, Y.; Ke, R.; Zhao, J. Sensing Data Supported Traffic Flow Prediction via Denoising Schemes and ANN: A Comparison. IEEE Sens. J. 2020, 20, 14317–14328. [Google Scholar] [CrossRef]
Huang, W.; Song, G.; Hong, H.; Xie, K. Deep Architecture for Traffic Flow Prediction: Deep Belief Networks With Multitask Learning. IEEE Trans. Intell. Transp. Syst. 2014, 15, 2191–2201. [Google Scholar] [CrossRef]
Yasdi, R. Prediction of Road Traffic using a Neural Network Approach. Neural Comput. Appl. 1999, 8, 135–142. [Google Scholar] [CrossRef]
More, R.; Mugal, A.; Rajgure, S.; Adhao, R.B.; Pachghare, V.K. Road traffic prediction and congestion control using Artificial Neural Networks. In Proceedings of the 2016 International Conference on Computing, Analytics and Security Trends (CAST), Pune, India, 19–21 December 2016; pp. 52–57. [Google Scholar] [CrossRef]
Tian, Y.; Pan, L. Predicting Short-Term Traffic Flow by Long Short-Term Memory Recurrent Neural Network. In Proceedings of the 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), Chengdu, China, 19–21 December 2015; pp. 153–158. [Google Scholar] [CrossRef]
Fu, R.; Zhang, Z.; Li, L. Using LSTM and GRU neural network methods for traffic flow prediction. In Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), Wuhan, China, 11–13 November 2016; pp. 324–328. [Google Scholar] [CrossRef]
Tian, Y.; Zhang, K.; Li, J.; Lin, X.; Yang, B. LSTM-based traffic flow prediction with missing data. Neurocomputing 2018, 318, 297–305. [Google Scholar] [CrossRef]
Yang, B.; Sun, S.; Li, J.; Lin, X.; Tian, Y. Traffic flow prediction using LSTM with feature enhancement. Neurocomputing 2019, 332, 320–327. [Google Scholar] [CrossRef]
Du, S.; Li, T.; Gong, X.; Yang, Y.; Horng, S.J. Traffic flow forecasting based on hybrid deep learning framework. In Proceedings of the 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), Nanjing, China, 24–26 November 2017; pp. 1–6. [Google Scholar] [CrossRef]
Liu, Y.; Zheng, H.; Feng, X.; Chen, Z. Short-term traffic flow prediction with Conv-LSTM. In Proceedings of the 2017 9th International Conference on Wireless Communications and Signal Processing (WCSP), Nanjing, China, 11–13 October 2017; pp. 1–6. [Google Scholar] [CrossRef]
Duan, Z.; Yang, Y.; Zhang, K.; Ni, Y.; Bajgain, S. Improved Deep Hybrid Networks for Urban Traffic Flow Prediction Using Trajectory Data. IEEE Access 2018, 6, 31820–31827. [Google Scholar] [CrossRef]
Ma, D.; Sheng, B.; Jin, S.; Ma, X.; Gao, P. Short-Term Traffic Flow Forecasting by Selecting Appropriate Predictions Based on Pattern Matching. IEEE Access 2018, 6, 75629–75638. [Google Scholar] [CrossRef]
Yao, H.; Tang, X.; Wei, H.; Zheng, G.; Li, Z. Revisiting Spatial-Temporal Similarity: A Deep Learning Framework for Traffic Prediction. arXiv 2018, arXiv:1803.01254. [Google Scholar] [CrossRef]
Narmadha, S.; Vijayakumar, V. Spatio-Temporal vehicle traffic flow prediction using multivariate CNN and LSTM model. Mater. Today Proc. 2023, 81, 826–833. [Google Scholar] [CrossRef]
Zhang, X.; Huang, K.; Liu, C.; Xu, X. Urban Short - Term Traffic Flow Prediction Algorithm Based on CNN-LSTM Model. In Proceedings of the 2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE), Guangzhou, China, 6–8 January 2023; pp. 214–217. [Google Scholar] [CrossRef]
Zhao, Z.; Chen, W.; Wu, X.; Chen, P.C.Y.; Liu, J. LSTM network: A deep learning approach for short-term traffic forecast. IET Intell. Transp. Syst. 2017, 11, 68–75. [Google Scholar] [CrossRef]
Yu, B.; Yin, H.; Zhu, Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization, Stockholm, Sweden, 13–19 July 2018; pp. 3634–3640. [Google Scholar] [CrossRef]
Cao, D.; Wang, Y.; Duan, J.; Zhang, C.; Zhu, X.; Huang, C.; Tong, Y.; Xu, B.; Bai, J.; Tong, J.; et al. Spectral Temporal Graph Neural Network for Multivariate Time-series Forecasting. arXiv 2021, arXiv:2103.07719. [Google Scholar] [CrossRef]
Chen, Y.; Segovia-Dominguez, I.; Gel, Y.R. Z-GCNETs: Time Zigzags at Graph Convolutional Networks for Time Series Forecasting. arXiv 2021, arXiv:2105.04100. [Google Scholar] [CrossRef]
Diao, Z.; Wang, X.; Zhang, D.; Liu, Y.; Xie, K.; He, S. Dynamic spatial-temporal graph convolutional neural networks for traffic forecasting. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar] [CrossRef]
Fang, Z.; Long, Q.; Song, G.; Xie, K. Spatial-Temporal Graph ODE Networks for Traffic Flow Forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, New York, NY, USA, 14–18 August 2021; pp. 364–373. [Google Scholar] [CrossRef]
Guo, K.; Hu, Y.; Sun, Y.; Qian, S.; Gao, J.; Yin, B. Hierarchical Graph Convolution Network for Traffic Forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 151–159. [Google Scholar] [CrossRef]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. arXiv 2018, arXiv:1707.01926. [Google Scholar] [CrossRef]
Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction. IEEE Trans. Intell. Transp. Syst. 2020, 21, 3848–3858. [Google Scholar] [CrossRef]
Bai, L.; Yao, L.; Li, C.; Wang, X.; Wang, C. Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting. arXiv 2020, arXiv:2007.02842. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph WaveNet for Deep Spatial-Temporal Graph Modeling. arXiv 2019, arXiv:1906.00121. [Google Scholar]
Shang, C.; Chen, J.; Bi, J. Discrete Graph Structure Learning for Forecasting Multiple Time Series. arXiv 2021, arXiv:2101.06861. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks. arXiv 2020, arXiv:2005.11650. [Google Scholar] [CrossRef]
Jiang, R.; Wang, Z.; Yong, J.; Jeph, P.; Chen, Q.; Kobayashi, Y.; Song, X.; Fukushima, S.; Suzumura, T. Spatio-Temporal Meta-Graph Learning for Traffic Forecasting. arXiv 2023, arXiv:2211.14701. [Google Scholar] [CrossRef]
Zhang, Q.; Chang, J.; Meng, G.; Xiang, S.; Pan, C. Spatio-Temporal Graph Structure Learning for Traffic Forecasting. Proc. AAAI Conf. Artif. Intell. 2020, 34, 1177–1185. [Google Scholar] [CrossRef]
Zheng, C.; Fan, X.; Wang, C.; Qi, J. GMAN: A Graph Multi-Attention Network for Traffic Prediction. arXiv 2019, arXiv:1911.08415. [Google Scholar] [CrossRef]
Jiang, J.; Han, C.; Zhao, W.X.; Wang, J. PDFormer: Propagation Delay-Aware Dynamic Long-Range Transformer for Traffic Flow Prediction. Proc. AAAI Conf. Artif. Intell. 2023, 37, 4365–4373. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. arXiv 2022, arXiv:2106.13008. [Google Scholar]
Zhou, G.; Guo, X.; Liu, Z.; Li, T.; Li, Q.; Xu, K. TrafficFormer: An Efficient Pre-trained Model for Traffic Data. In Proceedings of the 2025 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 12–15 May 2025; pp. 1844–1860. [Google Scholar] [CrossRef]
Pu, B.; Liu, J.; Kang, Y.; Chen, J.; Yu, P.S. MVSTT: A Multiview Spatial-Temporal Transformer Network for Traffic-Flow Forecasting. IEEE Trans. Cybern. 2024, 54, 1582–1595. [Google Scholar] [CrossRef]
Shao, Z.; Zhang, Z.; Wang, F.; Wei, W.; Xu, Y. Spatial-Temporal Identity: A Simple yet Effective Baseline for Multivariate Time Series Forecasting. arXiv 2022, arXiv:2208.05233. [Google Scholar] [CrossRef]
Liu, H.; Dong, Z.; Jiang, R.; Deng, J.; Deng, J.; Chen, Q.; Song, X. Spatio-Temporal Adaptive Embedding Makes Vanilla Transformer SOTA for Traffic Forecasting. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, New York, NY, USA, 21–25 October 2023; pp. 4125–4129. [Google Scholar] [CrossRef]
Sun, Y.; Dong, L.; Huang, S.; Ma, S.; Xia, Y.; Xue, J.; Wang, J.; Wei, F. Retentive Network: A Successor to Transformer for Large Language Models. arXiv 2023, arXiv:2307.08621. [Google Scholar] [CrossRef]
Peng, B.; Alcaide, E.; Anthony, Q.; Albalak, A.; Arcadinho, S.; Biderman, S.; Cao, H.; Cheng, X.; Chung, M.; Grella, M.; et al. RWKV: Reinventing RNNs for the Transformer Era. arXiv 2023, arXiv:2305.13048. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv 2022, arXiv:2111.00396. [Google Scholar] [CrossRef]
Zhang, B.; Sennrich, R. Root Mean Square Layer Normalization. arXiv 2019, arXiv:1910.07467. [Google Scholar] [CrossRef]
Cui, Y.; Xie, J.; Zheng, K. Historical Inertia: An Ignored but Powerful Baseline for Long Sequence Time-series Forecasting. CoRR 2021, arXiv:2103.16349. [Google Scholar]
Deng, J.; Chen, X.; Jiang, R.; Song, X.; Tsang, I.W. ST-Norm: Spatial and Temporal Normalization for Multi-variate Time Series Forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, New York, NY, USA, 14–18 August 2021; pp. 269–278. [Google Scholar] [CrossRef]

Figure 1. (a) An overall structure of STAE-BiSSSM. (b) Structure of embedding layer. (c) Structure of BiSSSM. (d) Structure of SSSM.

Figure 2. Actual forecasting of the STAE-BiSSSM model on the PeMSD4 datasets for the road where sensor No. 200 is located from 2018-2-21 to 2018-2-28.

Figure 3. Actual forecasting of the STAE-BiSSSM model on the PeMSD7 datasets for the road where sensor No. 800 is located from 2017-8-24 to 2017-8-31.

Figure 4. Actual forecasting of the STAE-BiSSSM model on the PeMSD8 datasets for the road where sensor No. 150 is located from 2016-8-24 to 2016-8-31.

Figure 5. Actual forecasting of the STAE-BiSSSM model on the METRLA datasets for the road where sensor No. 200 is located from 2012-6-23 to 2012-6-30.

Figure 6. Actual forecasting of the STAE-BiSSSM model on the PeMSBAY datasets for the road where sensor No. 200 is located from 2017-5-24 to 2017-5-31. (a) Forecasting curves under forecasting horizon 15 min. (b) Forecasting curves under forecasting horizon 60 min.

Figure 7. (a) Total trainable parameters of STAEformer and STAE-iSSSM on METRLA datasets under different forecasting horizons. (b) Total trainable parameters of STAEformer and STAE-iSSSM on PeMSBAY datasets under different forecasting horizons.

Figure 8. (a) MAE of STAEformer and STAE-BiSSSM on METRLA datasets with different forecasting horizons. (b) RMSE of STAEformer and STAE-BiSSSM on METRLA datasets with different forecasting horizons. (c) MAPE of STAEformer and STAE-BiSSSM on METRLA datasets with different forecasting horizons.

Figure 9. (a) MAE of STAEformer and STAE-BiSSSM on PeMSBAY datasets with different forecasting horizons. (b) RMSE of STAEformer and STAE-BiSSSM on PeMSBAY datasets with different forecasting horizons. (c) MAPE of STAEformer and STAE-BiSSSM on PeMSBAY datasets with different forecasting horizons.

Table 1. Details of traffic flow datasets in experiments.

Dataset	Number of Sensors	Time Steps	Sample Interval	Time Range	Location
PeMSD4	307	16,992	5 min	01/2018–02/2018	California District 4
PeMSD7	883	28,224	5 min	05/2017–08/2017	California District 7
PeMSD8	170	17,856	5 min	07/2016–08/2016	California District 8
METRLA	207	34,272	5 min	03/2012–06/2012	Los Angeles
PeMSBAY	325	52,116	5 min	01/2017–05/2017	San Francisco Bay Area

Table 2. Description and details of model parameters.

Hyperparameter	Description	Dimension Size
$d_{f e a}$	Embedding dimension of feature embedding	24
$d_{t o d}$	Embedding dimension of time of day embedding	24
$d_{d o w}$	Embedding dimension of day-of-week embedding	24
$d_{a d p}$	Embedding dimension of spatial–temporal adaptive embedding	80
$d_{c o n v}$	Size of convolution kernel	5
$d_{h i d}$	Dimension of hidden state	64
$d_{m i d}$	Dimension of middle-rank state	16

Table 3. Predictive metrics of STAE-BiSSSM and other baseline models on PeMSD4, PeMSD7 and PeMSD8 datasets.

Dataset		PeMSD4			PeMSD7			PeMSD8
	Metric	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Model		MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
HI		42.35	61.66	29.92%	49.03	71.18	22.75%	36.66	50.45	21.63%
GWNet		18.53	29.92	12.89%	20.47	33.47	8.61%	14.40	23.39	9.21%
DCRNN		19.63	31.26	13.59%	21.16	34.14	9.02%	15.22	24.17	10.21%
AGCRN		19.38	31.25	13.40%	20.57	34.40	8.74%	15.32	24.41	10.03%
STGCN		19.57	31.38	13.44%	21.74	35.27	9.24%	16.08	25.39	10.60%
GTS		20.96	32.95	14.66%	22.15	35.10	9.38%	16.49	26.08	10.54%
MTGNN		19.17	31.70	13.37%	20.89	34.06	9.00%	15.18	24.24	10.20%
STNorm		18.96	30.98	12.69%	20.50	34.66	8.75%	15.41	24.77	9.76%
GMAN		19.14	31.60	13.19%	20.97	34.10	9.05%	15.31	24.92	10.13%
STAE-BiSSSM (Ours)		18.46	30.16	12.24%	19.91	33.16	8.43%	13.84	23.23	9.24%

Table 4. Predictive metrics of STAE-BiSSSM and other baseline models on METRLA.

Dataset		METRLA
Horizon		15 min			30 min			60 min
	Metric	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Model		MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
HI		6.80	14.21	16.72%	6.80	14.21	16.72%	6.80	14.20	10.15%
GWNet		2.69	5.15	6.99%	3.08	6.20	8.47%	3.51	7.28	9.96%
DCRNN		2.67	5.16	6.86%	3.12	6.27	8.42%	3.54	7.47	10.32%
AGCRN		2.85	5.53	7.63%	3.20	6.52	9.00%	3.59	7.45	10.47%
STGCN		2.75	5.29	7.10%	3.15	6.35	8.62%	3.60	7.43	10.35%
GTS		2.75	5.27	7.12%	3.14	6.33	8.62%	3.59	7.44	10.25%
MTGNN		2.69	5.16	6.89%	3.05	6.13	8.16%	3.47	7.21	9.70%
STNorm		2.81	5.57	7.40%	3.18	6.59	8.47%	3.57	7.51	10.24%
GMAN		2.80	5.55	7.41%	3.12	6.49	8.73%	3.44	7.35	10.07%
PDFormer		2.83	5.45	7.77%	3.20	6.46	9.19%	3.62	7.47	10.91%
STID		2.82	5.53	7.75%	3.19	6.57	9.39%	3.55	7.55	10.95%
STAEformer		2.65	5.11	6.85%	2.97	6.00	8.13%	3.34	7.02	9.70%
STAE-BiSSSM (Ours)		2.60	4.92	6.75%	2.86	5.59	7.88%	3.22	6.51	9.49%

Table 5. Predictive metrics of STAE-BiSSSM and other baseline models on PeMSBAY.

Dataset		PeMSBAY
Horizon		15 min			30 min			60 min
	Metric	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
Model		MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
HI		3.06	7.05	6.85%	3.06	7.04	6.84%	3.05	7.03	6.83%
GWNet		1.30	2.73	2.71%	1.63	3.73	3.73%	1.99	4.60	4.71%
DCRNN		1.31	2.76	2.73%	1.65	3.75	3.71%	1.97	4.60	4.68%
AGCRN		1.35	2.88	2.91%	1.67	3.82	3.81%	1.94	4.50	4.55%
STGCN		1.36	2.88	2.86%	1.70	3.84	3.79%	2.02	4.63	4.72%
GTS		1.37	2.92	2.85%	1.72	3.86	3.88%	2.06	4.60	4.88%
MTGNN		1.33	2.80	2.81%	1.66	3.77	3.75%	1.95	4.50	4.62%
STNorm		1.33	2.82	2.76%	1.65	3.77	3.66%	1.92	4.45	4.46%
GMAN		1.35	2.90	2.87%	1.65	3.82	3.74%	1.92	4.49	4.52%
PDFormer		1.32	2.83	2.78%	1.64	3.79	3.71%	1.91	4.43	4.51%
STID		1.31	2.79	2.78%	1.64	3.73	3.73%	1.91	4.42	4.55%
STAEformer		1.31	2.78	2.76%	1.62	3.68	3.62%	1.88	4.34	4.41%
STAE-BiSSSM (Ours)		1.13	2.33	2.28%	1.36	2.96	2.93%	1.62	3.63	3.72%

Table 6. Ablation study on PeMSD4, METRLA and PeMSBAY datasets.

Dataset		FSSSM	STAE-FSSSM	STAE-BiSSSM
Dataset	Metric	FSSSM	STAE-FSSSM	STAE-BiSSSM
PeMSD4	MAE	21.85	18.54	18.46
	RMSE	35.11	30.20	30.16
	MAPE	14.82%	12.51%	12.24%
METRLA	MAE	3.92	3.24	3.22
	RMSE	7.84	6.53	6.51
	MAPE	11.94%	9.82%	9.49%
PeMSBAY	MAE	2.03	1.64	1.62
	RMSE	4.43	3.65	3.63
	MAPE	4.82%	3.88%	3.72%

Table 7. Total trainable parameters of STAEformer and STAE-BiSSSM on METRLA and PEMSBAY datasets under different forecasting horizon.

Datasets		STAEformer	STAE-BiSSSM
Datasets	Horizon	STAEformer	STAE-BiSSSM
METRLA	3 (15 min)	1,242,555	327,195
	6 (30 min)	1,248,030	332,670
	9 (60 min)	1,258,980	343,620
PeMSBAY	3 (15 min)	1,355,835	440,475
	6 (30 min)	1,361,310	445,950
	9 (60 min)	1,372,260	456,900

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, D.; Qu, Q.; Chen, X. STAE-BiSSSM: A Traffic Flow Forecasting Model with High Parameter Effectiveness. ISPRS Int. J. Geo-Inf. 2025, 14, 388. https://doi.org/10.3390/ijgi14100388

AMA Style

Liu D, Qu Q, Chen X. STAE-BiSSSM: A Traffic Flow Forecasting Model with High Parameter Effectiveness. ISPRS International Journal of Geo-Information. 2025; 14(10):388. https://doi.org/10.3390/ijgi14100388

Chicago/Turabian Style

Liu, Duoliang, Qiang Qu, and Xuebo Chen. 2025. "STAE-BiSSSM: A Traffic Flow Forecasting Model with High Parameter Effectiveness" ISPRS International Journal of Geo-Information 14, no. 10: 388. https://doi.org/10.3390/ijgi14100388

APA Style

Liu, D., Qu, Q., & Chen, X. (2025). STAE-BiSSSM: A Traffic Flow Forecasting Model with High Parameter Effectiveness. ISPRS International Journal of Geo-Information, 14(10), 388. https://doi.org/10.3390/ijgi14100388

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STAE-BiSSSM: A Traffic Flow Forecasting Model with High Parameter Effectiveness

Abstract

1. Introduction

2. Methodology

2.1. Problem Definition

2.2. Embedding Layer

2.3. Selective State Space Model (SSSM)

2.4. STAE-Bidirectional SSSM (STAE-BiSSSM)

3. Experimental Study

3.1. Description of Experimental Datasets and Baseline Models

3.1.1. Experimental Datasets

3.1.2. Baseline Models

3.2. Metrics of Model Evaluation

3.3. Experiments Implementation

3.4. Experiment Results

3.4.1. Metrics Analysis

3.4.2. Real Case Analysis

3.5. Ablation Study

3.6. Parameter Effectiveness Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI