Next Article in Journal
Sustainable Mixed-Model Assembly Line Balancing with an Analytical Lower Bound and Adaptive Large Neighborhood Search
Next Article in Special Issue
Projection-Free Decentralized Federated Learning with Privacy Guarantees in Complex Systems
Previous Article in Journal
Exponentially Fitted Midpoint Scheme for a Stochastic Oscillator
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

An Overview of Spatiotemporal Network Forecasting: Current Research Status and Methodological Evolution

1
School of Environmental Science and Engineering, Yangzhou University, Yangzhou 225127, China
2
School of Mathematics, Yangzhou University, Yangzhou 225127, China
3
School of Automation and Artificial Intelligence, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(1), 18; https://doi.org/10.3390/math14010018
Submission received: 9 November 2025 / Revised: 8 December 2025 / Accepted: 19 December 2025 / Published: 21 December 2025
(This article belongs to the Special Issue Advanced Machine Learning Research in Complex System)

Abstract

Time series and spatio-temporal forecasting are fundamental tasks for complex system modeling and intelligent decision-making, with broad applications in transportation, meteorology, finance, healthcare, and public safety. Compared with simple univariate time series, real-world spatio-temporal data exhibit rich temporal dynamics and intricate spatial interactions, leading to heterogeneity, non-stationarity, and evolving topologies. Addressing these challenges requires modeling frameworks that can simultaneously capture temporal evolution, spatial correlations, and cross-domain regularities. This survey provides a comprehensive synthesis of forecasting methods, spanning statistical algorithms, traditional machine learning approaches, neural architectures, and recent generative and causal paradigms. We review the methodological evolution from classical linear models to deep learning–based temporal modules and emphasize the role of attention-based Transformers as general-purpose sequence architectures. In parallel, we distinguish these architectural advances from pre-trained foundation models for time series and spatio-temporal data (e.g., large models trained across diverse domains), which leverage self-supervised objectives and exhibit strong zero-/few-shot transfer capabilities. We organize the review along both data-type and architectural dimensions—single long-term time series, Euclidean-structured spatio-temporal data, and graph-structured spatio-temporal data—while also examining advanced paradigms such as diffusion models, causal modeling, multimodal-driven frameworks, and pre-trained foundation models. Through this taxonomy, we highlight common strengths and limitations across approaches, including issues of scalability, robustness, real-time efficiency, and interpretability. Finally, we summarize open challenges and future directions, with a particular focus on the joint evolution of graph-based, causal, diffusion, and foundation-model paradigms for next-generation spatio-temporal forecasting.

1. Research Background and Overview

Time series and spatio-temporal forecasting have emerged as core technologies for complex system modeling and intelligent decision-making, playing pivotal roles in domains such as traffic flow management, weather forecasting, financial modeling, medical diagnosis, and public safety. In urban mobility, for example, network-based spatio-temporal prediction frameworks have been used to analyze and forecast ridesourcing behavior at the city scale, where origin–destination flows and interaction patterns among drivers and passengers are modeled as a dynamic graph [1]. Compared with single time series modeling, real-world data often exhibit both temporal dependencies and spatial interactions, forming complex spatio-temporal network structures. Such data simultaneously present short- and long-term dynamic characteristics, accompanied by heterogeneity, non-stationarity, and evolving topologies, which pose significant modeling challenges. Consequently, developing systematic approaches to model and forecast spatio-temporal data is not only a pressing research frontier but also a practical necessity for improving system efficiency and generating substantial socio-economic benefits.
In recent years, a number of surveys have systematically summarized the development of time series and spatio-temporal forecasting. Lim and Zohren provided an early review of deep learning architectures for time series, including RNNs, CNNs, and attention mechanisms, and highlighted hybrid and interpretable models; however, their focus remained on classical deep architectures and did not cover the later wave of Transformer variants, diffusion models, or causal frameworks for spatio-temporal forecasting [2]. Cheng et al. offered a comprehensive overview of core concepts, modeling challenges, and task formulations for time series forecasting, thus providing a useful conceptual roadmap, but they mainly concentrated on temporal models and did not systematically discuss graph-based spatio-temporal networks or generative approaches [3]. Kong et al. proposed a taxonomy of recent deep architectures and feature-enhancement strategies for univariate and multivariate time series, yet large pre-trained models and cross-domain spatio-temporal prediction were only lightly touched upon [4]. Meanwhile, Su et al. systematically examined large language models (LLMs) for forecasting and anomaly detection, including some spatio-temporal applications such as traffic flow and human mobility prediction, but did not develop a unified methodological taxonomy that jointly covers graph neural networks, diffusion-based generative models, and causal reasoning frameworks [5]. Complementary to forecasting-focused surveys, Chen et al. provide a large-scale review of critical node identification in complex networks, organizing methods into centrality-based measures, critical-node deletion, influence maximization, network control, machine-learning approaches, and higher-order or dynamic-network formulations [6]. Their taxonomy emphasizes how node importance and structural vulnerabilities evolve in time-varying graphs and highlights open challenges such as algorithmic universality, scalable evaluation in dynamic settings, and the integration of temporal dynamics into node-ranking algorithms, which are closely related to the robustness and deployment issues faced by spatio-temporal forecasting models. In parallel, a number of domain-specific studies further illustrate how spatio-temporal and networked dynamical models are deployed in practice. For example, work on directional switches in pigeon flocks combines high-resolution trajectory data with control-theoretic models to analyze how local interaction rules propagate through the flock and trigger collective reorientation events [7]. Finite-time consensus analyses of nonlinear multi-agent systems with input saturation and disturbances establish rigorous convergence guarantees under realistic actuator constraints and exogenous perturbations, highlighting the role of robust distributed control in networked systems [8]. Likewise, sliding-observer-based mSEIR frameworks for pandemic spread prediction couple compartmental epidemic dynamics with online state estimation to track and forecast infection trajectories under partial observability and time-varying contact patterns [9]. These application-driven studies, while not surveys, underscore the need for forecasting models that remain reliable under coordination shocks, control constraints, and structural changes in complex spatio-temporal environments. In summary, existing surveys are not simply “outdated” in terms of publication year; rather, they tend to emphasize either temporal models or LLM-based forecasting in isolation and rarely analyze, within a single framework, the intersection of advanced paradigms such as graph-based spatio-temporal networks, causal modeling, diffusion models, and pre-trained foundation models.
Motivated by this gap, this survey seeks to provide a more integrated and forward-looking synthesis. While reviewing traditional statistical methods and neural architectures, we place particular emphasis on (i) the role of temporal modeling components (e.g., RNNs, CNNs, and Transformers) within larger spatio-temporal networks, (ii) the methodological evolution of Euclidean-structured and graph-structured spatio-temporal models, and (iii) the emergence of diffusion-based generative models, causality-aware frameworks, multimodal integration, and pre-trained foundation models for time series and spatio-temporal forecasting. Taken together, these perspectives aim to clarify the complementary strengths, limitations, and transfer capabilities of different paradigms across application domains.
The remainder of the paper is organized as follows: Section 2 reviews temporal modeling methods for time series forecasting and for the temporal components embedded in spatio-temporal networks, covering statistical approaches, traditional machine learning methods, deep neural networks, RNN-based architectures, Transformer-based architectures, and pre-trained foundation models for time series. Section 3 provides a comprehensive review of spatio-temporal forecasting models, including Euclidean-structured and graph-structured methods, diffusion-model–based approaches, causality-based frameworks, multimodal integration, and Transformer-based spatio-temporal architectures, and it explicitly discusses how temporal modules from Section 2 are embedded into these networks. Section 4 analyzes the limitations and challenges of current research and outlines promising directions such as continuous-time modeling, heterogeneous and dynamic graph structure learning, cross-modal fusion, causal reasoning, and foundation-model pretraining, and Section 5 presents the overall conclusions. Figure 1 summarizes the resulting methodological taxonomy and its evolutionary relationships. Table 1 provides an overview of representative forecasting method families.

2. Time Series Forecasting Model

In this section, we mainly view time series forecasting methods as temporal modeling components that can be used either as stand-alone predictors for univariate or multivariate sequences, or as building blocks within larger spatio-temporal networks introduced in Section 3. In practice, many spatio-temporal models adopt a modular design in which a temporal backbone (e.g., ARIMA, RNN, Transformer) is combined with a spatial module (e.g., CNN, GNN); therefore, clarifying the strengths and limitations of temporal models in isolation helps explain their role when embedded into full spatio-temporal architectures.
Time Series Forecasting methods can be broadly divided into two categories: traditional statistical approaches and deep learning–based approaches. Traditional time series forecasting methods generally treat spatiotemporal sequences as multiple independent time series, making it difficult to capture spatial correlations. Representative examples include the Historical Average (HA) model and the Autoregressive Integrated Moving Average (ARIMA) model. These models typically assume that current observations are linear functions of historical observations and historical errors, which enables effective modeling of linear patterns in time series. However, they struggle to model nonlinear patterns. Time series can be expressed as follows:
Y t = [ Y 1 t , Y 2 t , , Y N t ] ,
where Y t is the multidimensional observation vector at time t, Y i t denotes the observation of the i-th variable at time t, and N is the total number of variables in the time series.
For a time series, the predicted value Y ^ t is calculated based on the historical observations Y t 1 , Y t 2 , , Y t p , which can be represented as
Y ^ t = f ( Y t 1 , Y t 2 , , Y t p ) ,
which highlights the essence of time series forecasting as learning a mapping from historical observations to future values. In the following subsections, we start from classical statistical models and then move to neural architectures and Transformer-based designs, before finally distinguishing between generic Transformer architectures and pre-trained foundation models for time series and spatio-temporal forecasting.

2.1. Statistical Methods

Traditional statistical approaches have historically served as the foundation of time series forecasting, especially for short- and medium-term prediction tasks. These methods typically assume (weak) stationarity and linearity, and model the current observation as a linear function of past values and error terms. Representative examples include the Historical Average (HA) model, the Autoregressive (AR) and Moving Average (MA) families, the combined Autoregressive Moving Average (ARMA) and Autoregressive Integrated Moving Average (ARIMA) models, multivariate extensions such as Vector Autoregressive (VAR) models, Kalman filtering in state–space form, and additive decomposition models such as Prophet.
A canonical ARIMA ( p , d , q ) model can be written as
1 i = 1 p ϕ i B i ( 1 B ) d Y t = 1 + j = 1 q θ j B j ε t ,
where B is the backshift operator, ϕ i and θ j denote autoregressive and moving average coefficients, and d controls the order of differencing. Seasonal variants such as SARIMA and multivariate VAR models have been widely used in traffic flow, demand forecasting and econometrics. Kalman filters provide recursive minimum-variance estimates under linear–Gaussian assumptions, while Prophet decomposes a series into trend, seasonality and holiday components and is robust to missing data and change points.
Overall, statistical methods offer clear interpretability, low training cost and well-understood uncertainty quantification under model assumptions. However, their reliance on linearity and (quasi-)stationarity makes it difficult to capture strong nonlinearities, regime shifts and complex spatial interactions. In modern spatio-temporal forecasting pipelines, these models are therefore often used as baselines, components in hybrid systems (e.g., for residual modeling), or lightweight predictors in scenarios with strict computational constraints.

2.2. DNN-Based Methods

To overcome the limitations of linear statistical models, deep neural networks (DNNs) have been widely adopted for time series forecasting. Early work relied on feed-forward networks trained by backpropagation (often referred to as BPNNs), which approximate nonlinear mappings from past observations and exogenous variables to future values. In a typical BPNN, an input layer is followed by several hidden layers and an output layer, and the network parameters are optimized to minimize a prediction loss. Such models have been applied to short-term traffic forecasting, financial series and sales prediction, and often outperform ARIMA-type baselines when strong nonlinearities are present.
However, plain feed-forward networks do not explicitly encode temporal order or long-range dependencies, and their interpretability remains limited. As a result, the focus of temporal modeling has gradually shifted towards architectures with explicit sequence structure, such as recurrent networks, temporal convolutions and attention-based Transformers, which are reviewed in Section 2.3 and Section 2.4. In modern spatio-temporal forecasting systems, BPNNs and other fully connected DNNs are more often used as auxiliary components—for example, for feature fusion, residual correction or mapping learned representations to task-specific outputs—rather than as stand-alone temporal backbones.

2.3. RNN-Based Methods

Recurrent Neural Networks (RNNs)have demonstrated strong performance in modeling temporal dependencies and are widely applied to time series processing tasks. The basic architecture of a standard RNN is illustrated in Figure 2, where each hidden state is updated recurrently based on the previous state and the current input.
However, standard RNN architectures suffer from limited memory capacity, which hampers their ability to model long-range temporal dependencies effectively. To overcome this limitation, Long Short-Term Memory (LSTM) networks [10] and Gated Recurrent Units (GRU) [11] were proposed, both providing enhanced memory mechanisms for spatio-temporal sequence modeling. By employing gating structures to selectively retain and discard information, LSTM and GRU alleviate the gradient vanishing and exploding problems, and thus have become the most influential RNN variants.
Building on these innovations, LSTM has been extensively applied in spatio-temporal forecasting. Its core structure, consisting of forget, input, and output gates alongside the cell state, enables the effective modeling of long-term dependencies. Fu et al. [12] first introduced LSTM and GRU into network-based spatio-temporal prediction, providing both theoretical and practical foundations. While these models achieved strong performance in short-term forecasting, their accuracy degraded in medium- and long-term horizons as network depth increased, reintroducing gradient-related challenges. To address this, MacKenzie et al. [13] incorporated HTM-inspired mechanisms into the LSTM framework, extending its predictive horizon and improving long-term dependency modeling. Consequently, RNNs and their variants have gradually become a mainstream paradigm in temporal information modeling.
To further enhance stability and multi-step forecasting, stacked architectures were proposed. The Stacked-LSTM, for example, has been widely applied in multi-step prediction and missing data recovery. Cui et al. [14] developed the SBU-LSTM model, combining bidirectional and unidirectional layers, which proved robust in traffic prediction with missing observations. Similarly, Karevan and Suykens [15] integrated spatial information into a stacked spatio-temporal LSTM for weather forecasting, enabling the capture of multi-scale spatio-temporal dependencies.
Compared with LSTM, the GRU offers a more compact structure with fewer parameters and higher computational efficiency, making it attractive for real-time and resource-constrained applications. On this basis, the Bidirectional GRU (BiGRU) was proposed to capture both forward and backward dependencies. Li et al. [16] designed a BiGRU network for human activity recognition using high-resolution radar echo data, achieving 97.5% accuracy across three activity categories, thereby validating its expressive power and generalization capability.
Despite these advances, RNN-based methods still face bottlenecks when dealing with non-stationary dynamics, abrupt changes, and long-range dependencies. Their sequential chain structure limits long-term information propagation, making them susceptible to errors in the presence of sudden events or structural shifts. Furthermore, RNNs are sensitive to input scale variations and often exhibit slow convergence or instability during training. Consequently, recent studies have explored integrating RNNs as auxiliary modules to complement other architectures through error correction or feature augmentation.
Since 2023, research has increasingly highlighted the limited adaptability of RNN-type models in long-horizon or non-stationary forecasting. This has paved the way for Transformer-based architectures, which offer global modeling capabilities. For instance, Hou and Yu [17] proposed the RWKV-TS model, which combines the local sequential modeling strengths of RNNs with the global representation power of Transformers, outperforming traditional RNN architectures and underscoring their potential in complex dynamic scenarios.

2.4. Transformer-Based Methods

Building on the limitations of RNNs in modeling long-range dependencies, the Transformer architecture has emerged as a novel paradigm by leveraging attention mechanisms for global representation. With the deepening research on attention mechanisms, in 2017 the Transformer architecture was proposed, composed of a self-attention mechanism and a feed-forward neural network [18]. By employing the attention mechanism, it overcame the sequential computation limitation of RNNs and enabled parallel computation.
The core of the Transformer model is the multi-head attention mechanism. The basic idea is to calculate the attention distribution using the input vector X, which computes weighted averages of the input information based on the attention weights. The forward propagation process can be summarized as follows:
  • Key, Query, and Value Vector Generation:
    The input vector X is linearly projected into three representations: query vector Q, key vector K, and value vector V, computed as:
    Q = X W Q , K = X W K , V = X W V .
  • Self-Attention Aggregation: Each query vector Q is compared with each key vector K through inner products to measure similarity. The process includes: (1) computing the scaled inner product between Q and K, normalized by d k ; (2) applying the softmax operation to obtain the attention distribution; (3) taking the weighted sum with the value vector V. The attention function is defined as:
    A t t ( Q , K , V ) = s o f t m a x Q K T d k V .
In recent years, Transformer models have gradually emerged as a dominant paradigm for long-term time series forecasting and network-based spatiotemporal prediction, thanks to their ability to capture global temporal dependencies in parallel and their flexibility for extension across data modalities.

2.4.1. Early Explorations

Initial applications, such as Wu et al.’s influenza prevalence forecasting [19], validated the feasibility of employing self-attention to temporal dynamics, while the Adversarial Sparse Transformer [20] incorporated adversarial training and sparse attention to enhance robustness under noisy and limited data conditions. These pioneering works laid the foundation for subsequent general-purpose Transformer designs.

2.4.2. Efficiency and Long-Horizon Modeling

A major line of research seeks to overcome the quadratic cost of vanilla attention and to extend prediction horizons. Informer proposed the ProbSparse mechanism to reduce computational burden [21]; Reformer adopted reversible residuals and LSH attention to improve memory efficiency [22]; Pyraformer [23] further introduced a pyramidal attention design achieving linear complexity; FEDformer enhanced long-term dependency capture via frequency-domain decomposition [24]; Autoformer combined autocorrelation with decomposition to better extract periodic patterns [25]; Crossformer [26] explicitly modeled cross-dimensional dependencies in multivariate forecasting; PatchTST segmented series into local patches to stabilize multi-variate learning [27]; iTransformer [28] inverted encoder–decoder design for efficient ultra-long forecasting; and CASA [29] replaced the scoring stage with a CNN autoencoder, achieving large memory savings while retaining accuracy. Collectively, these advances define the efficiency-oriented lineage of Transformer models.

2.4.3. Robustness and Adaptability

Another branch focuses on handling distribution shifts and domain-specific dynamics. The Non-stationary Transformer [30] addressed time-series non-stationarity by combining series stationarization with de-stationary attention; PDFormer [31] introduced a propagation-delay-aware dual-graph attention for traffic networks; and ChannelTokenFormer [32] innovated with channel tokens, frequency-adaptive patching, and mask-aware attention to provide robustness under asynchronous sampling and test-time missing intervals. These works highlight the trend of adapting Transformer architectures to realistic spatiotemporal data conditions.

2.4.4. Multi-Modal, Spectral and Exogenous Integration

To expand modeling power, several recent studies integrated external structures or modalities. The Sentinel Multi-Patch Transformer [33] exploited temporal and channel attention across multi-patch representations to enhance remote-sensing sequence prediction; CITRAS [34] embedded covariates directly into the Transformer encoder, bridging causal exogenous drivers with time-series forecasting; and Sonnet [35] incorporated spectral operators to stabilize frequency components, improving the modeling of long-term seasonal trends.

2.4.5. Causality and Historical Dependency

Beyond correlation-based forecasting, researchers have introduced mechanisms to incorporate causality and historical accumulation. SCFormer [36] employed structured channel-wise attention with cumulative HiPPO states, thereby breaking the Markovian limitation of fixed-length windows; CAIFormer [37] embedded causal graph priors, distinguishing endogenous, direct causal, and collider dependencies while filtering spurious correlations.

2.4.6. Summary

Overall, Transformer-based Time Series Forecasting has evolved from early explorations and efficiency improvements to robust, multi-modal, and causally informed designs. Despite this rapid progress, key challenges remain: integrating graph-based spatial structures, handling irregular sampling and missing values, and ensuring stability and interpretability in ultra-long sequences. These constitute critical directions for the next generation of Transformer research.

2.5. Pre-Trained Foundation Models for Time Series and Spatio-Temporal Forecasting

While the Transformer architectures reviewed above are typically trained from scratch or fine-tuned on a specific dataset, a new line of work seeks to build pre-trained foundation models for time series and spatio-temporal data. These models are trained on large collections of heterogeneous series drawn from many domains (e.g., energy, finance, industry, climate) using self-supervised objectives such as masked forecasting, reconstruction, contrastive learning, or next-step prediction. The goal is to learn universal temporal representations that can be adapted to downstream tasks via zero-shot prompting, few-shot adaptation, or light-weight fine-tuning.
Representative examples include large pre-trained forecasters such as TimeGPT- and Lag-Llama-style models, which treat time series as sequences analogous to tokens in language models and leverage scalable Transformer backbones with temporal positional encodings. These models can be applied to new datasets with minimal or no fine-tuning, often performing competitively with task-specific architectures in short-term forecasting while providing strong cross-domain generalization. In addition, GPT-ST [38] extends the generative pre-training paradigm to spatio-temporal graphs: it pre-trains Graph Neural Networks using masked node–time prediction on large collections of spatio-temporal graphs, thereby learning long-range dependency structures that transfer across traffic networks and other spatial domains.
From a design perspective, pre-trained foundation models differ from conventional task-specific Transformers in three key aspects: (i) their pre-training corpora, which aggregate diverse domains and temporal resolutions; (ii) their learning objectives, which emphasize self-supervised representation learning rather than single-task supervised training; and (iii) their adaptation mechanisms, which include prompting, parameter-efficient fine-tuning, and cross-domain zero-/few-shot forecasting. These properties make foundation models particularly attractive for applications where labeled data are scarce, deployment environments shift over time, or multi-task transfer is required. At the same time, they raise new challenges regarding pre-training data biases, calibration of uncertainty under distribution shifts, and the integration of spatial structure and causality into large-scale pre-trained architectures.
Compared with task-specific Transformers, pre-trained spatio-temporal foundation models aim to learn generic representations from large heterogeneous corpora, so that downstream tasks can be solved via prompting or lightweight fine-tuning. This design raises several open questions: how to construct sufficiently diverse yet domain-relevant pretraining datasets; how to adapt foundation models to specific cities or domains without catastrophic forgetting; and how to ensure interpretability, safety and calibrated uncertainty in high-stakes forecasting applications. Addressing these issues is crucial for moving from “black-box” sequence models to trustworthy spatio-temporal assistants.

3. Spatio-Temporal Forecasting Model

Building on the temporal modeling components discussed in Section 2, this section focuses on architectures that jointly model spatial dependencies and temporal dynamics. In most spatio-temporal networks, a temporal backbone (e.g., ARIMA, RNN, Transformer, or a foundation model) is embedded into a spatial module (e.g., CNNs on grids or GNNs on graphs), so that temporal feature extraction and spatial interaction modeling are tightly coupled. Therefore, rather than treating time series forecasting and spatio-temporal forecasting as disjoint areas, we view the former as providing reusable temporal blocks that are instantiated within the latter.
Traditional time series forecasting methods generally treat spatiotemporal sequences as multiple independent time series, making it difficult to capture spatial correlations between sequences. However, in real-world complex networks, not only do nodes exhibit temporal dynamic evolution, but there also exist spatial dependencies arising from interactions among a large number of nodes. Therefore, conducting research on network-based spatiotemporal sequence forecasting is of greater practical significance.
Network spatiotemporal data can be further categorized by data type into Euclidean-structured spatiotemporal data and graph-structured spatiotemporal data. Euclidean-structured spatiotemporal data refers to structured data such as two-dimensional images and one-dimensional text sequences, which can be considered as special cases of graph-structured data.
Euclidean-structured spatiotemporal prediction methods can be classified into three categories: traditional machine learning–based methods, convolutional neural network–based methods, and autoencoder-based methods.

3.1. Euclidean-Structured Methods

(1)
Traditional Machine Learning Methods
Support Vector Machines (SVMs), as a classical supervised machine learning method, are widely applied to nonlinear and high-dimensional data modeling tasks. The core idea is to map the original data into a high-dimensional feature space via kernel functions, thereby achieving linear separability of samples in that space and effectively addressing nonlinear classification and regression problems. In regression modeling, Support Vector Regression (SVR) constructs an epsilon-insensitive margin around the target function and only penalizes samples lying outside this margin, thereby optimizing both model complexity and prediction error.
Considering the characteristics of spatiotemporal data modeling, ref. [39] introduced a spatiotemporal kernel function into the traditional SVM framework to develop the Spatio-Temporal Support Vector Regression (STSVR) model, which enhances the model’s ability to jointly represent temporal and spatial features. Additionally, ref. [40] incorporated the Hidden Markov Model (HMM), transforming the spatiotemporal sequence modeling task into a problem of modeling state transition processes, providing an alternative approach for handling sequences with latent dynamic patterns.
Overall, traditional machine learning methods have certain advantages in small-scale, structured spatiotemporal data modeling, especially when dealing with nonlinear relationships and low-dimensional features, where they often demonstrate stable performance. However, such methods generally rely heavily on sophisticated feature engineering, making it difficult to capture complex spatial structures or long-range temporal dependencies. As a result, they face limitations in modeling capability and generalization performance in large-scale, high-dimensional, and multi-source heterogeneous data scenarios.
Nevertheless, in practical industrial and edge computing applications, traditional machine learning models remain valuable. Methods such as K-Nearest Neighbors (KNN), SVM, Random Forests, and Decision Trees are often used in lightweight forecasting scenarios due to their simple structures, low training overhead, and deployment flexibility, meeting the needs for fast response and real-time processing. In recent years, some studies have attempted to integrate traditional methods with deep learning models, such as combining XGBoost with LSTM for coarse feature screening and multi-stage forecasting, or combining SVM with Temporal Convolutional Networks (TCNs) to enhance feature representation and prediction accuracy.
In addition, Automated Machine Learning (AutoML) technologies have gradually been applied to spatiotemporal modeling tasks, further simplifying the modeling process. AutoML frameworks such as Google Vizier and Microsoft NNI support automated search for regression model structures and optimal feature combinations, enabling model selection and hyperparameter tuning without manual intervention. This provides efficient and scalable modeling solutions for spatiotemporal prediction in complex systems.
(2)
Convolutional Neural Networks
CNNs [41,42] are a type of neural network that has been successfully applied across various domains such as computer vision and signal processing. In addition to the input and output layers, the CNN architecture includes hidden layers such as convolutional layers, pooling layers, and fully connected layers. The convolutional layer convolves input features with different convolution kernels, which are sets of weights learned during training, to produce output features. Pooling layers, typically applied after convolutional layers, perform subsampling on their input, for example, applying a max operation to adjacent values in a matrix to obtain a smaller matrix composed of the most salient features. Unlike traditional feedforward neural networks in which two adjacent layers are fully connected, convolutional layers in CNNs connect only local elements via sliding kernels. Through such local connectivity, output neurons are connected only to a subset of input neurons in their spatial neighborhood, enabling the model to capture local spatial features of the input matrix.
A typical CNN workflow is shown in Figure 3: input data pass through successive convolutional and pooling layers to extract multi-level spatial features, which are then integrated by fully connected layers before reaching the output.
Due to their powerful feature extraction capabilities, CNNs have been widely applied to modeling spatial-domain correlations in spatiotemporal sequence prediction [43,44,45]. By transforming irregular network topologies into regular grids, CNNs have also been used for spatial feature extraction [46,47]; however, such grid transformations result in the loss of inherent topological information from irregular networks.
Subsequently, methods combining CNNs with RNNs for joint spatiotemporal feature extraction began to emerge. For example, ref. [48] proposed a Diffusion Convolutional Recurrent Neural Network model, which applies diffusion convolution and recurrent neural networks to model spatial and temporal features, respectively, on directed graphs. ConvLSTM [49], a variant of LSTM, replaces the fully connected operations with convolutional operations, enabling convolution-based transformations in both the input-to-hidden state mapping and the gated structures. As a result, ConvLSTM can effectively extract spatial information in addition to capturing temporal features.
In recent years, for gridded spatiotemporal data such as urban traffic, crowd density, and air quality indices, researchers have proposed architectures such as ST-ResNet and ST-UNet. These models leverage deep residual convolutional modules to achieve multi-scale spatial modeling while effectively capturing temporal dynamics, achieving state-of-the-art performance on multiple benchmark datasets. In meteorological forecasting tasks, CNN-based approaches remain dominant. Google’s DeepST and MetNet series models, for instance, utilize CNNs to extract spatial structure information from multi-source observational data (e.g., radar and satellite imagery), and integrate temporal convolutions with attention mechanisms to deliver high-resolution, short-term, high-frequency precipitation forecasting capabilities [46,50].
While CNNs provide an effective framework for processing Euclidean-structured data, they are not capable of fully capturing the complex spatial relationships in non-Euclidean spatiotemporal data. Therefore, even frameworks that combine CNNs with time series models still face limitations in extracting the complex spatial dependencies present in spatiotemporal networks.
It is also important to note that the boundary between Euclidean-structured and graph-structured methods is increasingly blurred. On the one hand, irregular sensor networks are often embedded into regular grids (e.g., via interpolation or spatial aggregation) so that CNNs and ConvLSTM-style architectures can be applied efficiently, at the cost of discarding part of the original network topology. On the other hand, regular grids themselves can be represented as graphs whose nodes correspond to grid cells and whose edges encode local neighborhood relations, allowing GNN-based models to operate on nominally Euclidean data. In practice, choosing between Euclidean CNN-based architectures and graph-based architectures involves a trade-off: CNNs provide high local spatial resolution and are computationally efficient on dense grids, whereas GNNs better preserve complex topological structure and long-range relational patterns, especially when the underlying domain is inherently non-Euclidean.
(3)
Autoencoder Models
Autoencoder-based architectures can be designed as end-to-end models specifically tailored for spatiotemporal sequence forecasting problems. The encoder and decoder structures can be constructed through combinations of fundamental frameworks such as CNNs and RNNs, enabling efficient extraction and prediction of long-term complex spatiotemporal features.
In [51], an end-to-end model for multi-step forecasting was proposed based on a deep ConvLSTM encoder–decoder architecture, which demonstrated the capability of capturing spatiotemporal dependencies through convolutional operations on sequential inputs. Building on this line of research, the DGMR model further introduced a U-Net architecture combined with a GRU decoder, where the U-Net component compresses spatial features and the ConvGRU decoder performs step-by-step temporal prediction [50]. Extending beyond convolutional and recurrent designs, the DCRNN model incorporated graph convolution into the encoder–decoder framework, enabling the integration of spatial graph structures with temporal dynamics for more effective modeling of spatiotemporal dependencies [48].

3.2. GNN-Based Methods

While Euclidean-structured methods such as CNNs and autoencoder-based models have shown strong performance in extracting spatial correlations on regular grids, their reliance on grid transformation inevitably leads to the loss of intrinsic topological information when applied to irregular, graph-like data. In many real-world spatio-temporal scenarios—such as traffic networks, power grids, and sensor systems—the spatial domain is naturally represented as a graph, where nodes correspond to observation points and edges capture complex and dynamic interactions. To overcome the limitations of Euclidean assumptions and better preserve structural dependencies, research has increasingly turned to GNNs, which extend deep learning to non-Euclidean domains and provide a powerful framework for modeling spatio-temporal graph data.
Moreover, GNNs can also be applied to gridded data by constructing adjacency relations among grid cells, making them a flexible tool that unifies irregular sensor networks and regular Euclidean lattices under a single graph-based formulation.
In the real world, the spatial features of spatiotemporal data are often represented as a graph network in which nodes exhibit complex coupling relationships. Such structures cannot be simply modeled as standard 2D or 3D grid data. It can modeled as a complex decentralized learning with several nodes [52,53,54]. Moreover, data collected from irregularly distributed sensors in practical applications, when processed through a gridding operation followed by CNN-based feature extraction, inevitably lose the inherent complex network topological information of the system.
Graph embedding techniques can preserve structural information to a greater extent; however, their limitation lies in the fact that node encoding weights cannot be shared, causing the number of weight parameters to increase linearly with the number of nodes. This makes them less suitable for modeling networks with large-scale nodes. The emergence of GNNs [55] has extended the capabilities of deep learning to non-Euclidean domains.
Spatiotemporal data with graph structures can be categorized, in terms of implementation methods, into two types: GNN-based methods and spatiotemporal sequence prediction methods based on spatiotemporal forecasting models.
(1)
Graph Convolutional Networks
To address the challenge of spatial feature extraction in spatiotemporal sequence prediction, Joan Bruna et al. proposed the first generation of graph convolutional networks based on the graph Laplacian matrix, providing a new approach for spatial feature extraction [56]. Its structure is defined as:
H l + 1 = σ U g θ U T H l ,
where U denotes the eigenvectors of the Laplacian matrix, H is the hidden layer, and g θ is a set of trainable parameters.
Subsequently, Defferrard et al. [57] improved the training parameters by employing Chebyshev polynomials, leading to the second generation of graph convolution. The structure is given by:
g θ x k = 0 K θ k T k ( L ˜ ) x ,
where T k represents the Chebyshev polynomial of order k, and L ˜ = 2 λ max L I N is the rescaled Laplacian matrix.
Later, Kipf and Welling [58] further simplified the parameterization of Chebyshev polynomials to obtain the third generation of graph convolutions, which significantly reduced computational complexity. The formulation is:
H l + 1 = f ( H l , A ) = σ D ˜ 1 2 A ˜ D ˜ 1 2 H l W l ,
where A ˜ is the adjacency matrix with self-loops, D ˜ is the corresponding degree matrix, and W l is the trainable weight matrix at layer l.
In the second-generation graph convolutional networks, the selection of Chebyshev polynomial order K reflects the K-hop neighborhood information aggregation of a given node in the graph. More generally, spatial-domain GCNs update a node’s representation by aggregating information from its neighbors. Through neighbor selection and normalization, the convolution operation can effectively capture structural information from graph data.
(2)
Spatiotemporal Prediction Models Based on GCN
With the introduction of Graph Convolutional Networks (GCNs), graph-based convolutional methods have been increasingly applied in spatiotemporal prediction tasks. The early stage of this research combined GCNs with recurrent architectures to jointly capture spatial and temporal dependencies. Zhao et al. proposed T-GCN, integrating GCN and GRU to model spatial correlations in transportation networks and verifying the effectiveness of GCNs in capturing complex topological dependencies [59]. Building on this, Li et al. developed the Diffusion Convolutional Recurrent Neural Network (DCRNN), embedding graph diffusion into the recurrent process through an encoder–decoder structure and achieving remarkable results on the METR-LA dataset [48]. Yu et al. further proposed STGCN, which separates spatial and temporal convolutions into spatiotemporal blocks, enabling modularized learning [60]. Later, MGCN [61] enhanced representational power through multi-graph structures, while STGSL [62] introduced adaptive graph structure learning to jointly optimize graph topology and prediction. Meanwhile, SGL and TLE frameworks [63,64] improved robustness and generalization by learning sparse graph structures and local perturbations, further expanding GCN’s adaptability in dynamic environments.
Subsequent studies extended GCN-based frameworks toward more expressive and adaptive architectures. Zhu et al. proposed ASTGCN to integrate node attributes and positional encoding, enhancing the representation of heterogeneous nodes [65], while Guo et al. introduced an attention-based ASTGCN variant with multi-scale modeling and spatiotemporal attention mechanisms that outperformed STGCN on the PEMS dataset [66]. Around the same period, STG2Seq [67] and attention-based models [68] improved long-horizon forecasting. GCN-GAN [69] leveraged adversarial learning for dynamic network prediction, MC-STGCN [70] incorporated multivariate correlation modeling, and GraphWaveNet [71] introduced residual and adaptive graph mechanisms to achieve state-of-the-art results. More recently, GMAN [72] utilized multi-head attention to capture dynamic topological evolution. Overall, GCN-based spatiotemporal prediction has evolved from static graph–RNN hybrids into adaptive, multi-graph, attention-enhanced frameworks with strong scalability, robustness, and generalization for large-scale networked systems.
(3)
Optimized Prediction Models Based on Spatiotemporal Feature Pattern Mining
In recent years, network-based spatiotemporal prediction models have often adopted a modular spatiotemporal architecture to extract, analyze, and predict input traffic data, typically combining Graph Neural Networks with Recurrent Neural Networks to model spatiotemporal relationships [73,74]. Common components within these spatiotemporal modules include attention mechanisms, graph neural networks, and temporal convolutions. While GNN- and RNN-based spatiotemporal prediction models have achieved promising results, they still exhibit certain limitations: (1) In temporal feature extraction, RNN or Temporal Convolutional Network methods are often employed, but lack the ability to capture global dependencies in temporal data, leading to reduced accuracy in long-horizon forecasting; (2) In spatial feature extraction, most models focus on local neighborhood information around nodes while neglecting global feature extraction; (3) Most approaches assume uniformly sampled spatiotemporal data, lacking exploration of continuous-time evolution in complex systems.
[I]
Spatiotemporal Prediction Models Based on Global Information Extraction
Most existing spatiotemporal feature extraction methods emphasize modeling local neighborhood information around nodes, with limited consideration for long-range or global dependencies. While frameworks such as RNN and LSTM can effectively model dynamic features within short local time spans, they suffer from representation degradation when modeling long-term dependencies as the distance between input and output sequences increases. To address this challenge, Chen et al. proposed the Spatio-Temporal Hypergraph Separation Networks (STH-SepNet) framework, which explicitly decouples temporal and spatial modeling, improving predictive performance while maintaining computational efficiency [75]. This model employs lightweight Large Language Models to extract low-rank dynamic features along the temporal dimension and introduces an adaptive hypergraph neural network to dynamically construct high-order spatial hyperedges, thereby capturing complex spatial relationships. During the fusion stage, a gating mechanism integrates temporal and spatial representations to achieve efficient global-level prediction. Li et al. proposed GPT-ST (Generative Pre-Training for Spatio-Temporal Graph Neural Networks), which further advances the direction of global spatiotemporal modeling [38]. GPT-ST adopts the masked prediction paradigm from language models, pre-training graph neural networks on spatiotemporal graph structures to learn awareness of long-range spatiotemporal dependencies, thereby improving generalization and robustness across multiple downstream forecasting tasks. Jeon and Kim proposed a context-enhanced learning framework for solar irradiance forecasting [76]. In scenarios lacking long-term observational data, this method utilizes a context-driven multi-scale historical window mechanism to significantly improve learning of solar radiation intensity trends, demonstrating effective integration of global temporal information. The long-term context modeling concept can also be transferred to other continuous spatiotemporal tasks such as traffic and energy consumption, showing strong adaptability for complex scenarios requiring global structural modeling. Furthermore, ref. [77] proposed a spatiotemporal graph neural network architecture integrating Spatial Graph Neural Networks (S-GNNs), GRUs, and Transformers to respectively model spatial dependencies, local temporal dependencies, and global temporal dependencies, thereby broadening the scope of spatiotemporal feature extraction. From the perspective of graph representation improvement, ref. [78] employs Dynamic Time Warping (DTW) to construct temporal graph structures and combines them with a threshold-based diffusion convolution mechanism to strengthen modeling of latent temporal patterns and long-range spatial correlations. Meanwhile, PYRAFORMER incorporates a multi-scale temporal connection mechanism and utilizes a hierarchical attention structure to better capture long-range dependencies [23].
[II]
Spatiotemporal Prediction Models Based on Dynamic Spatiotemporal Feature Extraction
Many existing models overlook the dynamic structural characteristics of temporal networks in real-world applications, where spatiotemporal patterns evolve over time. However, in many early models, spatial correlations become fixed after training, making them less adaptable to subsequent structural changes. To better model such spatiotemporal dynamics, Graph Attention Networks (GATs) introduce attention mechanisms to emphasize the relative importance of neighboring nodes, thereby enhancing the dynamic modeling of spatial dependencies [79]. Li et al. proposed the Gated Attention Network (GaAN), which integrates GAT with LSTM to balance spatial dynamic perception and temporal modeling capabilities, demonstrating strong transferability in heterogeneous scenarios [80]. In 2019, Park et al. introduced the Spatio-Temporal Graph Attention Network (STGRAT), which uses self-attention to jointly model spatial and temporal information, avoiding the sequential constraints of traditional RNN-based methods, and achieving better performance than STGCN, DCRNN, and GaAN on traffic datasets such as METR-LA [81]. Building on this, Zheng et al. introduced the Graph Multi-Attention Network (GMAN), which applies multi-head attention to differentially model graph structures, enabling dynamic adjustment of spatiotemporal dependencies over time [72]. Nonetheless, these approaches generally construct graphs based on k-nearest neighbors, limiting their ability to capture high-order dependencies beyond adjacency ranges. To overcome inherent limitations in graph construction, Cini et al. proposed the Taming Local Effects (TLE) framework, which analyzes and corrects the issue of local node feature dominance in GNN predictions [64]. This model employs a shared node representation mechanism to enhance cross-region representation consistency, thereby improving the stability of graph structures in evolving spatiotemporal scenarios. Guo et al. further incorporated heterogeneous spatial modeling concepts and proposed the Spatial Heterogeneity-aware GNN (SHGNN), which models heterogeneous temporal and spatial characteristics across different subregions, making it suitable for heterogeneous transportation networks such as buses, taxis, and bike-sharing systems [82]. Wang et al. approached the problem from a long-term deployment perspective, proposing the Robust-Spatial GNN framework, which introduces a dynamic adversarial training mechanism to enhance model generalization under long-term spatial shifts [e.g., cross-year forecasting], demonstrating robustness in real-world traffic scenarios with frequent infrastructure changes [83]. The Spatio-Temporal Transformer Networks (STTNs) model presents a novel dynamic modeling structure in which spatiotemporal blocks consist of spatial and temporal Transformers. These blocks explicitly model dynamic dependencies that vary with changes in road topology and temporal steps. By mapping input features into a high-dimensional latent space, STTNs effectively capture complex association patterns under dynamic evolution [84].
[III]
Spatiotemporal Prediction Models Based on Neural ODE for Continuous-Time Modeling
Although research on spatiotemporal sequence prediction continues to advance, most existing studies are based on modeling and analyzing large-scale discrete spatiotemporal data collected by sensors, typically under regular, uniformly sampled time intervals [85,86]. If the spatiotemporal patterns of a network undergo a switching event, the exact time of this event may be unknown. When relying on discrete data, such a switch may occur between two sampling points, meaning that temporal changes are not precisely observed. Consequently, uniformly sampled data face inherent bottlenecks in capturing time-varying spatiotemporal transformation patterns. Chen et al. proposed a novel neural network paradigm that generalizes discrete deep neural networks by parameterizing the derivatives of latent states, enabling the use of Neural Ordinary Differential Equation modules for continuous-time dynamic data modeling [87]. Recent studies have tested ODE-based continuous-time modeling in spatiotemporal traffic flow forecasting. Coupled Graph ODEs, for instance, employ Neural ODE–based GNN algorithms to learn the coupled dynamics of nodes and edges in a network, with sequence forecasting experiments validating the method’s effectiveness [88]. Ref. [89] proposed the Spatio-Temporal Graph Neural Controlled Differential Equation (STG-NCDE) approach, which designs two separate Neural Controlled Differential Equations (NCDEs) to model the temporal and spatial domains, respectively, and combines them into a unified framework. Experiments demonstrated that the model achieved state-of-the-art performance in traffic flow forecasting scenarios.
A key advantage of Neural ODE- and NCDE-based spatio-temporal models is their ability to evolve latent states in continuous time. Unlike discrete-time RNNs or Transformers, which typically assume regularly sampled sequences (or require ad hoc interpolation to handle irregular intervals and missing values), continuous-time formulations can naturally query the latent trajectory at arbitrary time stamps. This property makes Neural ODE/NCDE–based architectures particularly attractive for settings with irregular sampling, asynchronous sensors, and intermittently missing observations—challenges that are highlighted in Section 4 as central obstacles for real-world spatio-temporal forecasting.

3.3. Diffusion-Model-Based Methods

Although GNN-based approaches have achieved strong performance in capturing spatial dependencies, they remain largely discriminative in nature. To better address uncertainty, distribution shifts, and long-horizon forecasting, research has increasingly explored generative paradigms, with diffusion probabilistic models emerging as a promising direction.
In recent years, Diffusion Probabilistic Models have gained attention for their remarkable performance in image generation and have gradually expanded to time series and spatiotemporal forecasting tasks, becoming a significant research direction in generative modeling. Unlike traditional discriminative models, diffusion models perform progressive “denoising” sampling on data distributions, enabling effective characterization of latent dynamic evolution processes and high-order correlation structures in complex systems, particularly suited for modeling non-stationary spatiotemporal data with high uncertainty.
To incorporate diffusion mechanisms into graph-structured modeling tasks, Wen et al. proposed the DiffSTG framework, which integrates the denoising diffusion process with Spatio-Temporal Graph Neural Networks by introducing a graph-structure-preserving constraint along the diffusion path. This enables unified probabilistic prediction and generative modeling for tasks such as traffic forecasting. The method is notable for its combination of multi-step prediction accuracy, generalization capability, and interpretability, making it one of the most representative applications of DPMs in graph prediction tasks [90].
Cheng et al. proposed SparseDiff, a sparse diffusion autoencoder framework designed for dynamic adaptability modeling in complex systems during the testing phase. This approach introduces a sparse graph encoder and a Graph Neural Ordinary Differential Equation module, enabling the generation process to automatically reconstruct graph structures and guide the diffusion decoder to adapt to distribution shifts during testing, thereby improving robustness and continuous modeling capabilities in real-world complex systems [91].
In the direction of dynamic graph structure learning, Jung and Jang introduced the DiffGSL framework, which reconstructs graph topology based on the diffusion process and guides neural networks to dynamically learn time-varying relationships between nodes. By combining a noise-driven graph reconstruction mechanism with Graph Attention Networks, this method effectively models dynamic dependency structures in heterogeneous graph scenarios [92].
Further integrating system physical dynamics, Cachay et al. proposed the DYffusion model, which injects dynamics-informed constraints derived from system differential equations into the diffusion path. This improves structural consistency and physical interpretability in physical domains such as traffic flow, meteorology, and energy consumption. Experiments demonstrated its superior multi-step forecasting performance on multiple real-world datasets [93].
From a methodological review perspective, Yang et al. conducted a systematic survey of diffusion-based time series and spatiotemporal modeling methods, identifying three primary research directions:
  • Designing reverse diffusion paths with strong task adaptability, such as residual graph structure prediction and modulation-attention-based temporal modeling;
  • Improving inference efficiency and stability, including sparse decoders and multi-scale reconstruction paths;
  • Integrating with Graph Neural Networks or Neural Ordinary Differential Equations to achieve unified frameworks for continuous-time modeling and dynamic graph forecasting [94].
However, diffusion-based approaches also introduce non-trivial computational overhead compared with one-shot predictors such as standard GCN- or Transformer-based models. Each prediction typically requires iterating a denoising process over dozens or even hundreds of time steps, which increases inference latency and energy consumption. This cost can be particularly problematic in real-time applications such as traffic signal control, ride-hailing dispatch, or short-term weather nowcasting, where decisions must be updated at high frequency. Recent work on sparse decoders, reduced-step samplers, and distillation of diffusion models into faster surrogates partially mitigates these issues, but the trade-off between probabilistic accuracy and deployment-time efficiency remains an important consideration when comparing diffusion-based architectures with simpler discriminative models.
Overall, diffusion models offer powerful generative capabilities and modeling flexibility for spatiotemporal forecasting tasks, showing strong potential in challenging scenarios such as non-uniform observations, continuous-time modeling, and long-horizon multi-step forecasting. With the advancement of Graph Neural Networks, self-supervised learning, and differential equation-based modeling, diffusion-based spatiotemporal prediction methods are expected to become a mainstream generative modeling paradigm in the future.

3.4. Causality-Based Methods

Beyond diffusion-based generative approaches, another emerging direction is causality based modeling, which focuses on disentangling true causal drivers from correlations to improve robustness and interpretability in spatiotemporal prediction.
Traditional spatiotemporal prediction models predominantly rely on correlation-driven learning mechanisms, making it difficult to disentangle causal drivers from accompanying associations in complex network dynamics. As a result, such models often exhibit limited generalization capability when facing structural perturbations, distribution shifts, or intervention reasoning tasks. In recent years, researchers have attempted to incorporate causal inference frameworks to enhance adaptability to structural changes and mechanism transfer, thereby improving robustness and interpretability in spatiotemporal sequence modeling.
Xia et al. systematically examined mainstream graph neural network–based spatiotemporal prediction models and identified a key issue in their prediction mechanisms: the “mixing of causal paths and confounding variables.” They proposed reconstructing the data generation mechanism in the prediction process from the perspective of causal graph analysis. Based on this, the authors designed the Causal-Spatio-Temporal Graph Neural Network (Causal-STGNN) framework, which decomposes predictive causal chains in the graph structure, identifies confounding variables, and performs intervention-based modeling. This significantly improves predictive robustness in scenarios involving structural perturbations or regional transfer. Experiments on datasets such as METR-LA and PEMS-BAY demonstrated that this method not only outperforms various traditional approaches in terms of average prediction error but also exhibits superior generalization performance after subregional topological changes [95].
In terms of modeling methodology, Causal-STGNN constructs a causal graph structure incorporating intervention variables and applies do-calculus operations to perform causal intervention modeling on spatial graph information. The design emphasizes disentangling static correlations among observed variables into dynamic causal paths, followed by explicit modeling and solution of the causal graph structure. Experimental results show that when node numbers or connectivity patterns change, Causal-STGNN maintains high predictive accuracy, significantly outperforming mainstream models based on graph attention or adaptive adjacency matrices.
Complementary to graph-level causal intervention, Chen et al. proposed a causality-induced distributed spatio-temporal feature extraction framework that enforces consistency between local features and global causal structures across sensor nodes [96]. By injecting causal priors into the feature learning process and designing distributed modules that operate under communication and computation constraints, their method improves robustness under structural perturbations and noisy observations on traffic and mobility benchmarks.
Furthermore, Einizade et al. proposed the Causal Graph Process Neural Networks (CGPN) model, which leverages causal graph processes to characterize structurally driven mechanisms underlying spatiotemporal evolution. They recover latent causal graph structures from observed time series and transform them into graph process terms for modeling, thereby capturing dynamic causal relationships within the system. The CGPN framework integrates graph neural networks with causal inference and adopts a parameter-efficient design that substantially reduces computational cost, making it particularly suitable for large-scale scenarios such as high-dimensional traffic networks and energy consumption forecasting [97].
Existing causal frameworks for spatio-temporal forecasting can roughly be grouped into causal discovery and causal inference paradigms. The former focuses on learning directed graphs or Granger-causal relations from observational data, while the latter studies how interventions on nodes or edges influence future trajectories. In practice, both aspects are challenging: scalable discovery on large, noisy graphs is computationally demanding, realistic intervention data are scarce, and existing evaluation protocols rarely stress-test robustness under edge or weight perturbations, exogenous shocks, or policy changes. Developing benchmarks and metrics that explicitly target these scenarios would greatly facilitate progress in this direction.
Unlike previous methods, CGPN emphasizes causal structure sparsity and structural interpretability in the prediction process. Its inference procedure can be decomposed into three stages: (1) Structure identification—estimating the dynamic causal influence matrix between nodes by minimizing causal consistency error; (2) Process modeling—using sparse graph processes to model the evolution of node states; (3) Structure-enhanced prediction—embedding the causal structure into neural networks to achieve predictions with stronger transferability and generalization.
In summary, spatiotemporal prediction methods based on causal structures demonstrate significant potential in enhancing robustness, interpretability, and adaptability to complex topological perturbations. This research direction is gradually emerging as a cross-disciplinary frontier linking machine learning and causal inference, offering a new paradigm for addressing spatiotemporal prediction problems in real-world, dynamically heterogeneous structures.

3.5. Multimodal Methods

In addition to causality-based approaches that enhance robustness and interpretability, another important trend is the integration of multimodal information. By combining diverse data sources and incorporating social or contextual knowledge, multimodal methods further enrich spatiotemporal representations and enable more comprehensive modeling of complex real-world dynamics.
To more comprehensively capture spatiotemporal dynamics in complex scenarios, researchers have gradually incorporated multimodal fusion and social-awareness mechanisms to strengthen modeling of spatial proximity, behavioral commonalities, and semantic complementarity. Malla et al. proposed the Social-STAGE framework, which integrates visual and social-aware modeling, employing a graph attention mechanism to characterize interaction dynamics among multiple agents. It also introduces a spatial gating module and a Conditional Variational Autoencoder [CVAE]-based prediction branch to address path prediction problems under diverse social scenarios, achieving significant performance improvements on trajectory prediction datasets such as ETH and UCY [98].
In more emergent scenarios, such as crowd evacuation forecasting triggered by natural disasters, Jiang et al. proposed the Social Meta-Knowledge Guided Transformer framework. This approach incorporates social meta-knowledge into the Transformer architecture to construct migration graphs between urban spaces. By integrating a multi-layer graph attention encoding module, the model enhances its capacity to capture structural heterogeneity, effectively meeting the demands of human mobility prediction under extreme conditions [99].
To address the strong heterogeneity present in multimodal data modeling, Deng et al. further proposed the Multi-Modality Spatio-Temporal Forecasting via Self-Supervised Learning (MoSSL) framework. MoSSL introduces a multimodal enhancer, global self-supervised learning (GSSL), and modality-specific self-supervised learning (MSSL) modules, which analyze latent dynamic patterns from temporal, spatial, and modality-specific perspectives, respectively. This approach significantly improves adaptability in joint prediction tasks involving multiple transportation modalities such as bike-sharing and taxi services [100]. On real-world urban datasets such as NYC and BJ, MoSSL achieved state-of-the-art performance, demonstrating the substantial potential of combining multimodal modeling with self-supervised strategies to enhance generalization and robustness.
Overall, multimodal and social-aware modeling provides higher-dimensional information representation and interaction modeling pathways for spatiotemporal prediction in complex scenarios. From behavioral interaction and modality heterogeneity to the incorporation of external structural knowledge, these approaches enrich the understanding of spatiotemporal heterogeneity and dynamic evolution, representing an important development trend in the current field.

3.6. Transformer-Based Methods

While multimodal methods enrich spatiotemporal forecasting by integrating heterogeneous sources and social-contextual information, their effectiveness still depends on the ability to model long-range dependencies and dynamic interactions across modalities. Transformer-based architectures, with their powerful self-attention mechanisms, naturally complement this need by providing a unified framework for capturing global temporal correlations and adaptive spatial relations at scale.
In recent years, the Transformer architecture has demonstrated strong capabilities in modeling global dependencies and scalability for spatio-temporal forecasting tasks, gradually emerging as an alternative to traditional RNN–GCN hybrid paradigms. Its core advantage lies in the use of multi-head self-attention mechanisms to capture long-range temporal dependencies and dynamic spatial correlations, while maintaining flexibility in multi-scale and uncertainty modeling.
Zhang et al. proposed the Hierarchical Spatial-Temporal Transformer Network (HSTTN) [101], which adopts an hourglass-like encoder–decoder structure. By combining down-sampling and up-sampling operations to extract multi-scale temporal features and introducing a Context Fusion Block (CFB) between spatial and temporal Transformer branches, the model significantly improved the accuracy of long-term wind farm forecasting. This design effectively reduces the computational cost of point-level self-attention and is particularly suitable for highly intermittent spatio-temporal data such as wind energy.
In the transportation domain, Jiang et al. proposed PDFormer [31], addressing challenges in traditional graph neural networks related to dynamic dependencies, long-range correlations, and propagation delays. They introduced a geographic–semantic dual-graph masked spatial attention mechanism and designed a propagation delay-aware module to capture the delayed effects of neighborhood states. Experiments showed that PDFormer outperformed mainstream baselines such as DCRNN and STGCN across multiple real-world traffic datasets, while also providing enhanced interpretability.
Liang et al. developed AirFormer [102], the first model to achieve nationwide air quality prediction across more than one thousand monitoring stations. The model introduced Dartboard Spatial Multi-Head Self-Attention (DS-MSA) and Causal Temporal Multi-Head Self-Attention (CT-MSA), which reduced the spatial complexity from quadratic to linear while retaining the parallelism advantage of Transformers. Additionally, it incorporated variational latent variables to capture uncertainty in air quality data. In 72-h forecasting tasks, AirFormer reduced prediction error by 5–8% compared to existing models, demonstrating scalability for large-scale environmental monitoring.
To address computational efficiency in large-scale spatio-temporal forecasting, Sun et al. proposed the Improved Inverted Transformer [103], which optimizes the inverted Transformer structure with sparse attention mechanisms. This approach significantly reduced computational overhead for long-sequence modeling while improving parallelism and memory efficiency. Applied to large-scale traffic and meteorological forecasting tasks, the method maintained prediction accuracy while greatly reducing computational costs, highlighting its potential for ultra-large-scale spatio-temporal applications.
Recent studies have also explored the integration of Transformers with graph-structured modeling. Bai and Liu proposed T-Graphormer [104], a graph-structure-aware Transformer that jointly models spatial topologies and temporal dependencies, enhancing cross-regional generalization capability. Zhang et al. introduced MLH-Trans [105], which leverages multivariate long-horizon representations to improve long-term forecasting stability and generalization in multi-step prediction tasks. The Unified Weather Transformer, published in Nature Machine Learning [106], applied shared cross-station attention mechanisms for global weather prediction, underscoring the potential of Transformers for large-scale and multimodal spatio-temporal data modeling.
In summary, Transformer-based spatio-temporal prediction methods are emerging as an independent research path. Their strengths lie in global dependency modeling, dynamic spatial adaptation, and large-scale scalability. However, challenges remain in ultra-long sequence stability, cross-modal information integration, and training efficiency, which require further investigation in future research.
Summary of empirical performance across benchmarks. Across commonly used benchmarks such as METR-LA, PEMS-BAY, PEMS-D3/D4/D7/D8, TaxiBJ, BikeNYC and nationwide air-quality datasets, classical statistical and traditional machine learning baselines (e.g., HA, ARIMA, VAR, SVR) typically serve as lower bounds on performance, with errors substantially higher than deep architectures under comparable settings. CNN- and RNN-based spatio-temporal models (e.g., ST-ResNet, ST-UNet, ConvLSTM, DCRNN, STGCN) generally achieve clear gains over these baselines, especially in short-horizon forecasting, but their improvements saturate when horizons lengthen or network topologies become more complex. More recent GNN- and Transformer-based architectures (e.g., GraphWaveNet, GMAN, HSTTN, PDFormer, AirFormer) tend to deliver the best trade-off between accuracy and scalability on large-scale traffic and environmental datasets, often yielding several percentage points of relative error reduction compared with earlier STGNNs. Diffusion-based generative models (e.g., DiffSTG, DYffusion, SparseDiff, DiffGSL) further improve long-horizon and uncertainty-aware forecasting at the cost of higher inference overhead, while causality-based frameworks (e.g., Causal-STGNN, CGPN) and multimodal methods (e.g., Social-STAGE, Social Meta-Knowledge Transformer, MoSSL) provide robustness gains under structural perturbations, distribution shifts and heterogeneous sensing conditions. Overall, empirical results across benchmarks suggest a consistent trend: methods that more explicitly encode graph structure, long-range dependencies, and uncertainty modeling achieve stronger performance, but also raise new challenges in terms of computational efficiency and deployment complexity.

4. Limitations and Future Directions

Despite substantial methodological and applied advances in time series and spatio-temporal forecasting over the past decade, several critical limitations continue to constrain the field. Existing studies often rely heavily on a narrow range of benchmark datasets—such as METR-LA and PEMS-BAY for traffic flow, PEMS-D3/D4/D7/D8 and related PEMS variants for freeway volume, TaxiBJ and BikeNYC for urban mobility, as well as a few gridded meteorological or air-quality datasets—and on single-dimensional evaluation metrics, thereby limiting their ability to capture the complexity of real-world deployment scenarios. These benchmarks are typically drawn from a small set of metropolitan regions, preprocessed to be relatively clean and regularly sampled, and they seldom stress-test non-stationarity, infrastructure changes, or severe distributional shifts (e.g., policy changes, pandemics, extreme weather events). As a result, models that perform well on these datasets may still struggle to generalize under realistic domain shifts and data quality issues. In particular, issues such as cross-domain generalization, rigorous uncertainty quantification, and stability under multi-step rolling predictions remain insufficiently examined. At the algorithmic level, recurrent neural networks (RNNs) and their variants still exhibit performance degradation when modeling long-range dependencies or highly non-stationary dynamics. Transformer-based models, although effective in learning global dependencies, face notable challenges in maintaining stability and computational efficiency for ultra-long sequence forecasting. Convolutional neural networks (CNNs) and graph convolutional networks (GCNs) provide complementary advantages in spatial feature extraction, yet their capacity to model dynamic topologies, propagation delays, and multi-relational heterogeneous interactions remains limited. Furthermore, real-world spatio-temporal data are frequently characterized by irregular sampling, asynchronous observations, and missing values, complexities that current approaches address only superficially. While generative and diffusion-based models introduce new perspectives for probabilistic forecasting and uncertainty representation, they impose substantial inference costs and often produce inadequately calibrated predictive intervals. Similarly, causal modeling approaches demonstrate potential in enhancing interpretability and transferability across structural variations, but their methodological maturity and empirical validation remain at an early stage.
Looking ahead, advancing spatio-temporal forecasting research necessitates progress on several interrelated fronts. First, at the data and evaluation level, more stringent and multi-dimensional protocols are required, moving beyond single-city, regularly sampled benchmarks toward suites that explicitly stress-test irregular sampling, missing values, sensor failures, structural breaks and cross-city/cross-domain transfer. Concretely, future benchmark design should couple accuracy metrics with calibration, robustness, latency and memory footprints, and provide standardized splits for cross-region generalization and long-horizon rolling evaluation. Second, for modeling non-stationarity, long-horizon dependencies and dynamically evolving structures, a promising path is to integrate continuous-time neural formulations (e.g., Neural ODE/NCDE) with adaptive graph learning, jointly learning latent dynamics and time-varying adjacency. Key challenges here include numerical stability of ODE solvers on large graphs, scalable training under limited supervision, and interpretable separation of regime shifts from noise.
Third, the convergence of generative modeling and causal inference offers a route to robust, policy-aware forecasting, for example by injecting causal graph priors into diffusion or autoregressive generative models and evaluating them on intervention and counterfactual tasks. However, this direction faces non-trivial obstacles, such as causal graph identifiability under purely observational data, the computational cost of probabilistic sampling, and the design of reliable evaluation metrics for counterfactual predictions. Fourth, multimodal integration and resilience to missing or corrupted modalities should be prioritized in transportation, meteorology and human mobility applications. Beyond simple feature concatenation, future work can explore alignment modules that handle asynchronous sampling and conflicting signals, as well as fallback mechanisms that gracefully degrade to subsets of modalities under real-time constraints.
Finally, large-scale pretraining and cross-domain transfer learning for time series and spatio-temporal data constitute an emerging foundation-model paradigm. Promising research paths include constructing diverse, de-biased pretraining corpora with rich metadata, developing parameter-efficient adaptation schemes for resource-constrained deployments, and systematically studying negative transfer and failure modes under distribution shifts. These developments need to be supported by standardized evaluation frameworks and reproducible pipelines, so as to bridge the gap between academic innovation and real-world deployment and to enable the design of spatio-temporal forecasting systems that are not only accurate, but also interpretable, reliable and cost-efficient in operation.

5. Conclusions

Network-based spatiotemporal prediction serves as a vital bridge between complex system perception and intelligent decision-making and has undergone rapid methodological evolution in recent years. Research in this field has progressively advanced from statistical and traditional machine learning approaches toward deep learning and graph-based paradigms, forming three major pathways: single long time series modeling, Euclidean-structured modeling, and graph-structured modeling. Statistical and machine learning methods such as ARIMA, Kalman filtering, and SVM laid the foundation for spatiotemporal forecasting but remain limited by linear assumptions and reliance on feature engineering. With the rise of deep neural networks, architectures based on RNN, LSTM, and GRU have significantly improved temporal feature extraction, while Transformer-based frameworks introduced attention mechanisms for capturing long-range dependencies and complex multivariate relationships. Concurrently, CNNs and GCNs have become core tools for spatial and graph-structured data modeling, exemplified by representative models including DCRNN, STGCN, ASTGCN, GMAN, and GraphWaveNet, which effectively capture heterogeneous and dynamic spatial dependencies. Recent developments—such as GPT-ST, Pyraformer, STGRAT, SHGNN, and neural ODE-based architectures—extend spatiotemporal prediction toward global dependency modeling, dynamic graph adaptation, and continuous-time forecasting. Furthermore, diffusion models, causal inference frameworks (e.g., Causal-STGNN, CGPN), and multimodal fusion mechanisms are redefining robustness, interpretability, and adaptability, revealing new potentials for uncertainty modeling and heterogeneous data integration. Overall, the field is evolving toward a unified paradigm that integrates graph neural networks, attention mechanisms, generative models, and causal reasoning, paving the way for intelligent systems with stronger dynamic understanding, structural adaptability, and cross-domain generalization in complex real-world environments.

Author Contributions

C.Y.: Conceptualization, methodology, software, formal analysis, writing—original draft preparation; W.Z.: validation, investigation, supervision, project administration, funding acquisition; Y.Z.: Conceptualization, methodology, software, investigation. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported in part by the National Natural Science Foundation of China (Grant No. 62376242) and in part by Yangzhou Innovation Capability Enhancement Fund (Grant No. YZ2024245).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data used in this study are included within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, D.; Shao, Q.; Liu, Z.; Yu, W.; Chen, C.P. Ridesourcing behavior analysis and prediction: A network perspective. IEEE Trans. Intell. Transp. Syst. 2020, 23, 1274–1283. [Google Scholar] [CrossRef]
  2. Lim, B.; Zohren, S. Time-series forecasting with deep learning: A survey. Philos. Trans. R. Soc. A 2021, 379, 20200209. [Google Scholar] [CrossRef]
  3. Cheng, M.; Liu, Z.; Tao, X.; Liu, Q.; Zhang, J.; Pan, T.; Zhang, S.; He, P.; Zhang, X.; Wang, D.; et al. A comprehensive survey of time series forecasting: Concepts, challenges, and future directions. Authorea, 2025; preprints. [Google Scholar]
  4. Kong, X.; Chen, Z.; Liu, W.; Ning, K.; Zhang, L.; Muhammad Marier, S.; Liu, Y.; Chen, Y.; Xia, F. Deep learning for time series forecasting: A survey. Int. J. Mach. Learn. Cybern. 2025, 16, 5079–5112. [Google Scholar] [CrossRef]
  5. Su, J.; Jiang, C.; Jin, X.; Qiao, Y.; Xiao, T.; Ma, H.; Wei, R.; Jing, Z.; Xu, J.; Lin, J. Large language models for forecasting and anomaly detection: A systematic literature review. arXiv 2024, arXiv:2402.10350. [Google Scholar] [CrossRef]
  6. Chen, D.; Chen, J.; Zhang, X.; Jia, Q.; Liu, X.; Sun, Y.; Lv, L.; Yu, W. Critical nodes identification in complex networks: A survey. arXiv 2025, arXiv:2507.06164. [Google Scholar] [CrossRef]
  7. Chen, D.; Sun, Y.; Shao, G.; Yu, W.; Zhang, H.T.; Lin, W. Coordinating directional switches in pigeon flocks: The role of nonlinear interactions. R. Soc. Open Sci. 2021, 8, 210649. [Google Scholar] [CrossRef] [PubMed]
  8. Chen, D.; Lu, T.; Liu, X.; Yu, W. Finite-time consensus of multiagent systems with input saturation and disturbance. Int. J. Robust Nonlinear Control 2021, 31, 2097–2109. [Google Scholar] [CrossRef]
  9. Chen, D.; Yang, Y.; Zhang, Y.; Yu, W. Prediction of COVID-19 spread by sliding mSEIR observer. Sci. China Inf. Sci. 2020, 63, 222203. [Google Scholar] [CrossRef]
  10. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  11. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
  12. Fu, R.; Zhang, Z.; Li, L. Using LSTM and GRU Neural Network Methods for Traffic Flow Prediction. In Proceedings of the 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC 2016), Wuhan, China, 26–28 May 2016; pp. 324–328. [Google Scholar]
  13. Mackenzie, J.; Roddick, J.F.; Zito, R. An Evaluation of HTM and LSTM for Short-Term Arterial Traffic Flow Prediction. IEEE Trans. Intell. Transp. Syst. 2019, 20, 1847–1857. [Google Scholar] [CrossRef]
  14. Cui, Z.; Ke, R.; Pu, Z.; Wang, Y. Stacked bidirectional and unidirectional LSTM recurrent neural network for forecasting network-wide traffic state with missing values. Transp. Res. Part C Emerg. Technol. 2020, 118, 102674. [Google Scholar] [CrossRef]
  15. Karevan, Z.; Suykens, J.A. Spatio-temporal stacked LSTM for temperature prediction in weather forecasting. arXiv 2018, arXiv:1811.06341. [Google Scholar] [CrossRef]
  16. Li, C.; He, Y.; Li, X.; Jing, X. BiGRU Network for Human Activity Recognition in High Resolution Range Profile. In Proceedings of the 2019 International Radar Conference (RADAR), Toulon, France, 23–27 September 2019; pp. 1–5. [Google Scholar]
  17. Hou, H.; Yu, F.R. Rwkv-ts: Beyond traditional recurrent neural network for time series tasks. arXiv 2024, arXiv:2401.09093. [Google Scholar] [CrossRef]
  18. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; ukasz Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Nice, France, 2017; Volume 30. [Google Scholar]
  19. Wu, N.; Green, B.; Ben, X.; O’Banion, S. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv 2020, arXiv:2001.08317. [Google Scholar] [CrossRef]
  20. Wu, S.; Xiao, X.; Ding, Q.; Zhao, P.; Wei, Y.; Huang, J. Adversarial Sparse Transformer for Time Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates, Inc.: Nice, France, 2020; Volume 33, pp. 17105–17115. [Google Scholar]
  21. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
  22. Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar] [CrossRef]
  23. Liu, S.; Yu, H.; Liao, C.; Li, J.; Lin, W.; Liu, A.X.; Dustdar, S. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
  24. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), Baltimore, MD, USA, 17–23 July 2022; Proceedings of Machine Learning Research. pp. 27268–27286. [Google Scholar]
  25. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2021), Virtual, 6–14 December 2021; Volume 34, pp. 22419–22430. [Google Scholar]
  26. Zhang, Y.; Yan, J. Crossformer: Transformer Utilizing Cross-Dimension Dependency for Multivariate Time Series Forecasting. In Proceedings of the 11th International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  27. Liu, M.; Zeng, A.; Xu, Z.; Lai, Q.; Xu, Q. Time Series is a Special Sequence: Forecasting with Sample Convolution and Interaction. arXiv 2021, arXiv:2106.09305. [Google Scholar]
  28. Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. itransformer: Inverted transformers are effective for time series forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar]
  29. Lee, M.; Yoon, H.; Kang, M. CASA: CNN Autoencoder-based Score Attention for Efficient Multivariate Long-term Time-series Forecasting. arXiv 2025, arXiv:2505.02011. [Google Scholar]
  30. Liu, Y.; Wu, H.; Wang, J.; Long, M. Non-stationary transformers: Exploring the stationarity in time series forecasting. Adv. Neural Inf. Process. Syst. 2022, 35, 9881–9893. [Google Scholar]
  31. Jiang, J.; Han, C.; Zhao, W.X.; Wang, J. PDFormer: Propagation Delay-Aware Dynamic Long-Range Transformer for Traffic Flow Prediction. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI 2023), Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 4365–4373. [Google Scholar]
  32. Jang, J.; Park, H.; Choi, J.; Kim, T. Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness. arXiv 2025, arXiv:2506.08660. [Google Scholar] [CrossRef]
  33. Villaboni, D.; Castellini, A.; Danesi, I.L.; Farinelli, A. Sentinel: Multi-Patch Transformer with Temporal and Channel Attention for Time Series Forecasting. arXiv 2025, arXiv:2503.17658. [Google Scholar] [CrossRef]
  34. Yamaguchi, Y.; Suemitsu, I.; Wei, W. Citras: Covariate-informed transformer for time series forecasting. arXiv 2025, arXiv:2503.24007. [Google Scholar] [CrossRef]
  35. Shu, Y.; Lampos, V. Sonnet: Spectral Operator Neural Network for Multivariable Time Series Forecasting. arXiv 2025, arXiv:2505.15312. [Google Scholar] [CrossRef]
  36. Guo, S.; Chen, Z.; Ma, Y.; Han, Y.; Wang, Y. SCFormer: Structured Channel-wise Transformer with Cumulative Historical State for Multivariate Time Series Forecasting. arXiv 2025, arXiv:2505.02655. [Google Scholar] [CrossRef]
  37. Zhang, X.; Qiang, W.; Zhao, S.; Guo, H.; Li, J.; Sun, C.; Zheng, C. CAIFormer: A Causal Informed Transformer for Multivariate Time Series Forecasting. arXiv 2025, arXiv:2505.16308. [Google Scholar] [CrossRef]
  38. Li, Z.; Xia, L.; Xu, Y.; Huang, C. GPT-ST: Generative pre-training of spatio-temporal graph neural networks. Adv. Neural Inf. Process. Syst. 2023, 36, 70229–70246. [Google Scholar]
  39. Wang, S.; Cao, J.; Yu, P.S. Deep Learning for Spatio-Temporal Data Mining: A Survey. IEEE Trans. Knowl. Data Eng. 2022, 34, 3681–3700. [Google Scholar] [CrossRef]
  40. Eddy, S.R. Hidden markov models. Curr. Opin. Struct. Biol. 1996, 6, 361–365. [Google Scholar] [CrossRef]
  41. Miao, S.; Wang, Z.J.; Liao, R. A CNN Regression Approach for Real-Time 2D/3D Registration. IEEE Trans. Med. Imaging 2016, 35, 1352–1363. [Google Scholar] [CrossRef]
  42. Liu, F.; Lin, G.; Shen, C. CRF Learning with CNN Features for Image Segmentation. Pattern Recognit. 2015, 48, 2983–2992. [Google Scholar] [CrossRef]
  43. Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional Sequence to Sequence Learning. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, NSW, Australia, 6–11 August 2017; Proceedings of Machine Learning Research. pp. 1243–1252. [Google Scholar]
  44. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2012), Lake Tahoe, NV, USA, 3–8 December 2012; Volume 25. [Google Scholar]
  45. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  46. Zhang, J.; Zheng, Y.; Qi, D. Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI 2017), San Francisco, CA, USA, 4–9 February 2017; Volume 31, pp. 1655–1661. [Google Scholar]
  47. Zhang, J.; Zheng, Y.; Sun, J.; Qi, D. Flow Prediction in Spatio-Temporal Networks Based on Multitask Deep Learning. IEEE Trans. Knowl. Data Eng. 2020, 32, 468–478. [Google Scholar] [CrossRef]
  48. Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv 2017, arXiv:1707.01926. [Google Scholar]
  49. Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2015), Montréal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
  50. Ravuri, S.; Lenc, K.; Willson, M.; Kangin, D.; Lam, R.; Mirowski, P.; Fitzsimons, M.; Athanassiadou, M.; Kashem, S.; Madge, S.; et al. Skilful Precipitation Nowcasting Using Deep Generative Models of Radar. Nature 2021, 597, 672–677. [Google Scholar] [CrossRef]
  51. Essien, A.; Giannetti, C. A Deep Learning Model for Smart Manufacturing Using Convolutional LSTM Neural Network Autoencoders. IEEE Trans. Ind. Informatics 2020, 16, 6069–6078. [Google Scholar] [CrossRef]
  52. Wei, M.; Yang, J.; Zhao, Z.; Zhang, X.; Li, J.; Deng, Z. DeFedHDP: Fully Decentralized Online Federated Learning for Heart Disease Prediction in Computational Health Systems. IEEE Trans. Comput. Soc. Syst. 2024, 11, 6854–6867. [Google Scholar] [CrossRef]
  53. Jiang, L.; Ming, X.; Zhang, X. DT-DOFL: Digital-Twin-Empowered Decentralized Online Federated Learning for User-Centered Smart Healthcare Service Systems. IEEE Trans. Comput. Soc. Syst. 2025, 12, 4441–4455. [Google Scholar] [CrossRef]
  54. Wei, M.; Yu, W.; Chen, D. AccDFL: Accelerated Decentralized Federated Learning for Healthcare IoT Networks. IEEE Internet Things J. 2025, 12, 5329–5345. [Google Scholar] [CrossRef]
  55. Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph Neural Networks: A Review of Methods and Applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
  56. Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and locally connected networks on graphs. arXiv 2013, arXiv:1312.6203. [Google Scholar]
  57. Defferrard, M.; Bresson, X.; Vandergheynst, P. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
  58. Kipf, T. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  59. Zhao, L.; Song, Y.; Zhang, C.; Liu, Y.; Wang, P.; Lin, T.; Deng, M.; Li, H. T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction. IEEE Trans. Intell. Transp. Syst. 2020, 21, 3848–3858. [Google Scholar] [CrossRef]
  60. Yu, B.; Yin, H.; Zhu, Z. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv 2017, arXiv:1709.04875. [Google Scholar]
  61. Chai, D.; Wang, L.; Yang, Q. Bike Flow Prediction with Multi-Graph Convolutional Networks. In Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL 2018), Seattle, WA, USA, 6–9 November 2018; pp. 397–400. [Google Scholar]
  62. Zhang, Q.; Chang, J.; Meng, G.; Xiang, S.; Pan, C. Spatio-Temporal Graph Structure Learning for Traffic Forecasting. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1177–1185. [Google Scholar]
  63. Cini, A.; Zambon, D.; Alippi, C. Sparse graph learning from spatiotemporal time series. J. Mach. Learn. Res. 2023, 24, 1–36. [Google Scholar]
  64. Cini, A.; Marisca, I.; Zambon, D.; Alippi, C. Taming local effects in graph-based spatiotemporal forecasting. Adv. Neural Inf. Process. Syst. 2023, 36, 55375–55393. [Google Scholar]
  65. Zhu, J.; Wang, Q.; Tao, C.; Deng, H.; Zhao, L.; Li, H. AST-GCN: Attribute-augmented spatiotemporal graph convolutional network for traffic forecasting. IEEE Access 2021, 9, 35973–35983. [Google Scholar] [CrossRef]
  66. Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention-Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI 2019), Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 922–929. [Google Scholar]
  67. Bai, L.; Yao, L.; Kanhere, S.; Wang, X.; Sheng, Q. Stg2seq: Spatial-temporal graph to sequence model for multi-step passenger demand forecasting. arXiv 2019, arXiv:1905.10069. [Google Scholar]
  68. Do, L.N.N.; Vu, H.L.; Vo, B.Q.; Liu, Z.; Phung, D. An Effective Spatial-Temporal Attention Based Neural Network for Traffic Flow Prediction. Transp. Res. Part C Emerg. Technol. 2019, 108, 12–28. [Google Scholar] [CrossRef]
  69. Lei, K.; Qin, M.; Bai, B.; Zhang, G.; Yang, M. GCN-GAN: A Non-Linear Temporal Link Prediction Model for Weighted Dynamic Networks. In Proceedings of the IEEE Conference on Computer Communications (IEEE INFOCOM 2019), Paris, France, 29 April–2 May 2019; pp. 388–396. [Google Scholar]
  70. Wang, S.; Zhang, M.; Miao, H.; Peng, Z.; Yu, P.S. Multivariate Correlation-aware Spatio-temporal Graph Convolutional Networks for Multi-scale Traffic Prediction. ACM Trans. Intell. Syst. Technol. 2022, 13, 38. [Google Scholar] [CrossRef]
  71. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph wavenet for deep spatial-temporal graph modeling. arXiv 2019, arXiv:1906.00121. [Google Scholar]
  72. Zheng, C.; Fan, X.; Wang, C.; Qi, J. GMAN: A Graph Multi-Attention Network for Traffic Prediction. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI 2020), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1234–1241. [Google Scholar]
  73. Wei, M.; Yu, W.; Liu, H.; Xu, Q. Distributed Weakly Convex Optimization Under Random Time-Delay Interference. IEEE Trans. Netw. Sci. Eng. 2024, 11, 212–224. [Google Scholar] [CrossRef]
  74. Wei, M.; Chen, G.; Guo, Z. A Fixed-Time Optimal Consensus Algorithm over Undirected Networks. In Proceedings of the 2018 Chinese Control and Decision Conference (CCDC 2018), Shenyang, China, 9–11 June 2018; pp. 725–730. [Google Scholar]
  75. Chen, J.; Shao, Q.; Chen, D.; Yu, W. Decoupling Spatio-Temporal Prediction: When Lightweight Large Models Meet Adaptive Hypergraphs. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025), Toronto, ON, Canada, 24–28 August 2025; pp. 167–178. [Google Scholar]
  76. Jeon, B.K.; Kim, E.J. Solar irradiance prediction using reinforcement learning pre-trained with limited historical data. Energy Rep. 2023, 10, 2513–2524. [Google Scholar] [CrossRef]
  77. Wang, X.; Ma, Y.; Wang, Y.; Jin, W.; Wang, X.; Tang, J.; Jia, C.; Yu, J. Traffic Flow Prediction via Spatial-Temporal Graph Neural Network. In Proceedings of the Web Conference 2020 (WWW 2020), Taipei, Taiwan, 20–24 April 2020; pp. 1082–1092. [Google Scholar]
  78. Li, M.; Zhu, Z. Spatial-Temporal Fusion Graph Neural Networks for Traffic Flow Forecasting. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI 2021), Virtual, 2–9 February 2021; Volume 35, pp. 4189–4196. [Google Scholar]
  79. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  80. Zhang, J.; Shi, X.; Xie, J.; Ma, H.; King, I.; Yeung, D.Y. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. arXiv 2018, arXiv:1803.07294. [Google Scholar] [CrossRef]
  81. Park, C.; Lee, C.; Bahng, H.; Kim, K.; Jin, S.; Ko, S.; Choo, J. ST-GRAT: A Spatio-Temporal Graph Attention Network for Traffic Forecasting. arXiv 2019, arXiv:1911.13181. [Google Scholar]
  82. Guo, S.; Lin, Y.; Wan, H.; Li, X.; Cong, G. Learning Dynamics and Heterogeneity of Spatial-Temporal Graph Data for Traffic Forecasting. IEEE Trans. Knowl. Data Eng. 2022, 34, 5415–5428. [Google Scholar] [CrossRef]
  83. Wang, H.; Chen, J.; Pan, T.; Dong, Z.; Zhang, L.; Jiang, R.; Song, X. Robust Traffic Forecasting against Spatial Shift over Years. arXiv 2024, arXiv:2410.00373. [Google Scholar] [CrossRef]
  84. Xu, M.; Dai, W.; Liu, C.; Gao, X.; Lin, W.; Qi, G.J.; Xiong, H. Spatial-temporal transformer networks for traffic flow forecasting. arXiv 2020, arXiv:2001.02908. [Google Scholar]
  85. Wei, M.; Yu, W.; Chen, D.; Kang, M.; Cheng, G. Privacy Distributed Constrained Optimization Over Time-Varying Unbalanced Networks and Its Application in Federated Learning. IEEE/CAA J. Autom. Sin. 2025, 12, 335–346. [Google Scholar] [CrossRef]
  86. Wei, M.; Yang, Z.; Ji, Q.; Zhao, Z. Privacy-preserving distributed projected one-point bandit online optimization over directed graphs. Asian J. Control 2023, 25, 4705–4720. [Google Scholar] [CrossRef]
  87. Chen, R.T.Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D.K. Neural Ordinary Differential Equations. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
  88. Huang, Z.; Sun, Y.; Wang, W. Coupled Graph ODE for Learning Interacting System Dynamics. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21), Virtual, Singapore, 14–18 August 2021; pp. 705–715. [Google Scholar]
  89. Choi, J.; Choi, H.; Hwang, J.; Park, N. Graph Neural Controlled Differential Equations for Traffic Forecasting. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI 2022), Virtual, 22 February–1 March 2022; Volume 36, pp. 6367–6374. [Google Scholar]
  90. Wen, H.; Lin, Y.; Xia, Y.; Wan, H.; Wen, Q.; Zimmermann, R.; Liang, Y. DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models. In Proceedings of the 31st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL 2023), Hamburg, Germany, 13–16 November 2023; pp. 1–12. [Google Scholar]
  91. Cheng, J.; Li, R.; Wang, H.; Li, Y. Sparse Diffusion Autoencoder for Test-time Adapting Prediction of Complex Systems. arXiv 2025, arXiv:2505.17459. [Google Scholar] [CrossRef]
  92. Jung, C.; Jang, Y. DiffGSL: A Graph Structure Learning Diffusion Model for Dynamic Spatio-Temporal Forecasting. In Proceedings of the 2024 IEEE International Conference on Big Data (IEEE BigData 2024), Washington, DC, USA, 15–18 December 2024; pp. 5785–5793. [Google Scholar]
  93. Rühling Cachay, S.; Zhao, B.; Joren, H.; Yu, R. Dyffusion: A dynamics-informed diffusion model for spatiotemporal forecasting. Adv. Neural Inf. Process. Syst. 2023, 36, 45259–45287. [Google Scholar]
  94. Yang, Y.; Jin, M.; Wen, H.; Zhang, C.; Liang, Y.; Ma, L.; Wang, Y.; Liu, C.; Yang, B.; Xu, Z.; et al. A survey on diffusion models for time series and spatio-temporal data. arXiv 2024, arXiv:2404.18886. [Google Scholar] [CrossRef]
  95. Xia, Y.; Liang, Y.; Wen, H.; Liu, X.; Wang, K.; Zhou, Z.; Zimmermann, R. Deciphering spatio-temporal graph forecasting: A causal lens and treatment. Adv. Neural Inf. Process. Syst. 2023, 36, 37068–37088. [Google Scholar]
  96. Chen, D.; Yu, W.; Shao, Q.; Liu, X. Causality Induced Distributed Spatio-Temporal Feature Extraction. In Proceedings of the 2021 8th International Conference on Information, Cybernetics, and Computational Social Systems (ICCSS 2021), Beijing, China, 10–12 December 2021; pp. 68–73. [Google Scholar]
  97. Einizade, A.; Malliaros, F.D.; Giraldo, J.H. Spatiotemporal Forecasting Meets Efficiency: Causal Graph Process Neural Networks. arXiv 2024, arXiv:2405.18879. [Google Scholar] [CrossRef]
  98. Malla, S.; Choi, C.; Dariush, B. Social-STAGE: Spatio-Temporal Multi-Modal Future Trajectory Forecast. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA 2021), Xi’an, China, 30 May–5 June 2021; pp. 13938–13944. [Google Scholar]
  99. Jiang, R.; Wang, Z.; Tao, Y.; Yang, C.; Song, X.; Shibasaki, R.; Chen, S.C.; Shyu, M.L. Learning Social Meta-Knowledge for Nowcasting Human Mobility in Disaster. In Proceedings of the ACM Web Conference 2023 (WWW 2023), Austin, TX, USA, 30 April–4 May 2023; pp. 2655–2665. [Google Scholar]
  100. Deng, J.; Jiang, R.; Zhang, J.; Song, X. Multi-modality spatio-temporal forecasting via self-supervised learning. arXiv 2024, arXiv:2405.03255. [Google Scholar]
  101. Zhang, Y.; Liu, L.; Xiong, X.; Li, G.; Wang, G.; Lin, L. Long-term wind power forecasting with hierarchical spatial-temporal transformer. arXiv 2023, arXiv:2305.18724. [Google Scholar]
  102. Liang, Y.; Xia, Y.; Ke, S.; Wang, Y.; Wen, Q.; Zhang, J.; Zheng, Y.; Zimmermann, R. AirFormer: Predicting Nationwide Air Quality in China with Transformers. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-23), Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 14329–14337. [Google Scholar]
  103. Sun, J.; Yeh, C.C.M.; Fan, Y.; Dai, X.; Fan, X.; Jiang, Z.; Saini, U.S.; Lai, V.; Wang, J.; Chen, H.; et al. Towards Efficient Large Scale Spatial-Temporal Time Series Forecasting via Improved Inverted Transformers. arXiv 2025, arXiv:2503.10858. [Google Scholar] [CrossRef]
  104. Bai, H.Y.; Liu, X. T-graphormer: Using transformers for spatiotemporal forecasting. arXiv 2025, arXiv:2501.13274. [Google Scholar] [CrossRef]
  105. Zhang, H.; Wu, D.; Zinflou, A.; Dellacherie, S.; Dione, M.M.; Boulet, B. Leveraging Multivariate Long-Term History Representation for Time Series Forecasting. arXiv 2025, arXiv:2505.14737. [Google Scholar] [CrossRef]
  106. Wu, H.; Zhou, H.; Long, M.; Wang, J. Interpretable weather forecasting for worldwide stations with a unified deep model. Nat. Mach. Intell. 2023, 5, 602–611. [Google Scholar] [CrossRef]
Figure 1. Methodological taxonomy and evolution of time series and spatio-temporal forecasting models. The inner ring groups methods into three major families (temporal modeling backbones, spatio-temporal architectures, and advanced paradigms), while the outer ring lists representative subclasses and models such as ARIMA/VAR/Kalman/Prophet and SVM/SVR (statistical & classical ML), RNN/LSTM/GRU (RNN-based deep learning), Transformer-based time-series forecasters (Informer, Autoformer, FEDformer), Euclidean CNN-based and graph-based STGNNs (ST-ResNet, ST-UNet, DGMR, T-GCN, DCRNN, STGCN, MGCN, GraphWaveNet, GMAN), continuous-time Neural ODE ST models (Coupled Graph ODE, STG-NCDE), diffusion-based generative ST models (DiffSTG, DYffusion, SparseDiff, DiffGSL), causality-aware models (Causal-STGNN, CGPN, causal graph processes), and multimodal or knowledge-enhanced ST models (Social-STAGE, Social Meta- Knowledge Transformer, MoSSL). This radial diagram also serves as a high-level roadmap of method evolution, illustrating how research has progressed from classical statistical models towards graph-based, diffusion, causal, multimodal and foundation-model paradigms.
Figure 1. Methodological taxonomy and evolution of time series and spatio-temporal forecasting models. The inner ring groups methods into three major families (temporal modeling backbones, spatio-temporal architectures, and advanced paradigms), while the outer ring lists representative subclasses and models such as ARIMA/VAR/Kalman/Prophet and SVM/SVR (statistical & classical ML), RNN/LSTM/GRU (RNN-based deep learning), Transformer-based time-series forecasters (Informer, Autoformer, FEDformer), Euclidean CNN-based and graph-based STGNNs (ST-ResNet, ST-UNet, DGMR, T-GCN, DCRNN, STGCN, MGCN, GraphWaveNet, GMAN), continuous-time Neural ODE ST models (Coupled Graph ODE, STG-NCDE), diffusion-based generative ST models (DiffSTG, DYffusion, SparseDiff, DiffGSL), causality-aware models (Causal-STGNN, CGPN, causal graph processes), and multimodal or knowledge-enhanced ST models (Social-STAGE, Social Meta- Knowledge Transformer, MoSSL). This radial diagram also serves as a high-level roadmap of method evolution, illustrating how research has progressed from classical statistical models towards graph-based, diffusion, causal, multimodal and foundation-model paradigms.
Mathematics 14 00018 g001
Figure 2. Schematic illustration of a Recurrent Neural Network (RNN).
Figure 2. Schematic illustration of a Recurrent Neural Network (RNN).
Mathematics 14 00018 g002
Figure 3. Typical CNN Architecture.
Figure 3. Typical CNN Architecture.
Mathematics 14 00018 g003
Table 1. Overview of representative methods and their qualitative properties.
Table 1. Overview of representative methods and their qualitative properties.
CategoryRepresentative MethodsTemporal BackboneSpatial StructureTraining/Inference Complexity
Channel-independent (pure time series)
StatisticalHA, ARIMA, VAR, Kalman, ProphetLinear AR/MA/state-space modelsNo explicit spatial structure (per series)Low–Medium
DNN-basedBPNN/MLP-style DNNsFeedforward fully connected layersNo explicit spatial structure (per series)Medium
RNN-basedLSTM, GRU, stacked/bi-directional RNNsRecurrent neural units (RNN/LSTM/GRU)No explicit spatial structure (per series)Medium–High (sequential computation)
Transformer-basedInformer, Autoformer, FEDformer, PatchTST, iTransformerSelf-attention with feed-forward layersNo explicit spatial structure (per series)Medium–High (attention-dependent)
Channel-dependent (spatio-temporal)
Euclidean-structuredCNN, ConvLSTM, ST-ResNet, ST-UNet, DGMRConvolutional and temporal convolutional layersRegular grids (images, rasters)Medium
GNN-basedT-GCN, DCRNN, STGCN, GraphWaveNet, GMANRecurrent, temporal convolutional, and attention layersExplicit graph structures (static or adaptive)Medium–High (graph operations)
Diffusion-basedDiffSTG, DYffusion, SparseDiff, DiffGSLDenoising diffusion steps with GNN/Transformer backbonesGraphs or grids, often dynamicHigh–Very High (multi-step sampling)
Causality-basedCausal-STGNN, CGPN, causal graph processesGNN/RNN/Transformer backbones with causal priorsLearned causal graphs or sparse structuresMedium (extra structure learning)
MultimodalSocial-STAGE, Social Meta-Knowledge Transformer, MoSSLRNN/Transformer-based multimodal fusion layersGrids and/or graphs with multimodal inputsMedium–High (fusion overhead)
Transformer-based STHSTTN, PDFormer, AirFormer, T-GraphormerSpatio-temporal self-attention layersGraphs or grids with learned relationsMedium–High
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, C.; Zhang, W.; Zhou, Y. An Overview of Spatiotemporal Network Forecasting: Current Research Status and Methodological Evolution. Mathematics 2026, 14, 18. https://doi.org/10.3390/math14010018

AMA Style

Yang C, Zhang W, Zhou Y. An Overview of Spatiotemporal Network Forecasting: Current Research Status and Methodological Evolution. Mathematics. 2026; 14(1):18. https://doi.org/10.3390/math14010018

Chicago/Turabian Style

Yang, Chenchen, Wenbing Zhang, and Yingjiang Zhou. 2026. "An Overview of Spatiotemporal Network Forecasting: Current Research Status and Methodological Evolution" Mathematics 14, no. 1: 18. https://doi.org/10.3390/math14010018

APA Style

Yang, C., Zhang, W., & Zhou, Y. (2026). An Overview of Spatiotemporal Network Forecasting: Current Research Status and Methodological Evolution. Mathematics, 14(1), 18. https://doi.org/10.3390/math14010018

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop