Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview

Capone, Vincenzo; Casolaro, Angelo; Camastra, Francesco

doi:10.3390/info16100917

Open AccessReview

Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview

by

Vincenzo Capone

^†

,

Angelo Casolaro

^†

and

Francesco Camastra

^*,†

Department of Science and Technology, Parthenope University of Naples, Centro Direzionale Isola C4, 80143 Naples, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2025, 16(10), 917; https://doi.org/10.3390/info16100917

Submission received: 26 September 2025 / Revised: 13 October 2025 / Accepted: 17 October 2025 / Published: 19 October 2025

(This article belongs to the Special Issue New Deep Learning Approach for Time Series Forecasting, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The raster is the most common type of spatio-temporal data, and it can be either regularly or irregularly spaced. Spatio-temporal prediction on regular raster data is crucial for modelling and understanding dynamics in disparate realms, such as environment, traffic, astronomy, remote sensing, gaming and video processing, to name a few. Historically, statistical and classical machine learning methods have been used to model spatio-temporal data, and, in recent years, deep learning has shown outstanding results in regular raster spatio-temporal prediction. This work provides a self-contained review about effective deep learning methods for the prediction of regular raster spatio-temporal data. Each deep learning technique is described in detail, underlining its advantages and drawbacks. Finally, a discussion of relevant aspects and further developments in deep learning for regular raster spatio-temporal prediction is presented.

Keywords:

spatio-temporal data; raster data; deep learning; convolutional neural networks; recurrent neural networks; vision transformers; diffusion models; convolutional LSTM

1. Introduction

In several scientific areas [1,2,3,4,5], data measurements are generally accompanied by additional metadata denoting the location and the time in which data were recorded. Recently, in statistics [6,7] and machine learning [8,9], particular attention has been paid on how to effectively model spatio-temporal data while fully leveraging the additional knowledge on acquisition location and time.

There are different types of spatio-temporal data, and each can be properly analysed by specific methodologies. The most common spatio-temporal data type is the raster spatio-temporal data [8], characterised by a set of spatio-temporal observations acquired at fixed locations and timestamps. Both locations and timestamps can be regularly or irregularly sampled, meaning that the distance between neighbouring locations, and timestamps, can be either constant (regular case) or not constant (irregular case) over the whole spatio-temporal domain. Spatial sampling is crucial, since, depending on how raster spatio-temporal data are spaced, some deep learning models may be fitter than others for their processing. As shown in Figure 1, regularly spaced spatio-temporal data can be organised on a regular three-dimensional lattice, with one dimension being time and the remaining dimensions being space. Observations are localized within cells of a grid (e.g., pixels of a multichannel image), defining the spatio-temporal neighbourhood of each observation. On the contrary, irregularly spaced data are better represented as a time series of spatial graphs, whose edges encode space–time neighbouring relations among observations.

Some deep learning architectures, such as Convolutional Neural Networks [10] and transformers [11], can leverage spatial regularity and are, therefore, better suited to process regularly spaced spatio-temporal data.

Regular raster spatio-temporal data are widely encountered in several applicative fields, such as environmental monitoring [12,13,14,15], remote sensing [16,17,18,19], traffic modelling [20,21,22,23], energy management [24,25] and video processing [26,27,28,29], among others. In recent years, deep learning has been widely applied to tasks from the aforementioned fields. To address properties and challenges of each specific task, several deep learning architectures have been employed. For example, Convolutional Neural Networks, transformers and Diffusion models have been widely applied to video processing, while Recurrent Neural Networks and hybrid convolutional–recurrent models have found great application in the environmental field.

Although there are a few surveys addressing the use of deep learning for spatio-temporal data in general, a dedicated survey focusing specifically on regular raster data is absent in the literature. Such a survey, therefore, presents a remarkable degree of novelty and can be potentially relevant for scholars and researchers working in the aforementioned fields, where grid-based spatio-temporal data are most commonly encountered. The aim of this work is to provide a review of the deep learning methodologies commonly applied to regularly spaced spatio-temporal raster data. Instead of focusing on single specific methodologies, this work provides a detailed description of the key concepts and core techniques behind deep-learning-based approaches for spatio-temporal prediction. Moreover, the work discusses specific benefits and drawbacks of the reviewed deep learning methodologies, as well as the applicative domains in which they are more effective. Finally, tables summarising related papers, along with their applicative domains and used datasets, are reported.

The manuscript is organised in the following structure. Section 2 formalises the foundations for spatio-temporal prediction; Section 3 gives an overview of Convolutional Neural Networks for regular raster spatio-temporal prediction; Section 4 describes Recurrent Neural Networks and their application to spatio-temporal prediction; Section 5 presents hybrid models combining both recurrent and Convolutional Neural Networks; Section 6 reviews transformers and their application to spatio-temporal prediction and video processing; Section 7 describes diffusion models for the same task; Section 8 provides an application-oriented discussion of the presented methodologies and outlines future research directions; finally, Section 9 draws conclusions and suggests some directions for future investigations.

2. Preliminaries

A spatio-temporal process describes a phenomenon evolving through space and time and it can be generally formalised as

z (\vec{s}, t) \forall \vec{s} \in S, t \in T,

(1)

where locations

S = {{\vec{s}}_{1}, {\vec{s}}_{2}, {\vec{s}}_{3}, \dots, {\vec{s}}_{N}}

and timestamps

T = {t_{1}, t_{2}, t_{3}, \dots, t_{M}}

are the spatial and temporal domains of the spatio-temporal process being observed. Spatio-temporal processes have been widely studied in statistics, and they have been modelled following the so-called conditional (or dynamic) approaches [30]. These approaches model the dynamics of the spatio-temporal process by grasping how spatio-temporal variables influence each other at different locations

\vec{s}

and times t. Statistical spatio-temporal autoregressive models [7], such as STAR, STARMA, STARIMA and STARMAX [6,31,32,33], follow this modelling approach by trying to capture spatio-temporal dynamics in a data-driven way.

In an analogy with statistical approaches, a spatio-temporal series can be organised as a temporal collection of vectors

{{\vec{z}}_{t}}_{t = 1}^{T}, {\vec{z}}_{t} \in R^{N}

, where each dimension corresponds to a location, and N is the number of locations being observed. The behaviour of spatio-temporal series, generated by the underlying nonlinear dynamical system [7], can be modelled by a nonlinear spatio-temporal autoregressive (NSTAR) model as follows:

{\vec{z}}_{t} = F ({\vec{z}}_{t - 1}, {\vec{z}}_{t - 2}, \dots, {\vec{z}}_{t - p}) + {\vec{ϵ}}_{t},

(2)

where p is the model order (i.e., how many past samples are required to model the spatio-temporal series adequately), and

F (\cdot)

is the skeleton, a generic nonlinear function modelling the spatio-temporal relationships. Following such a data-driven paradigm, machine learning techniques, such as Support Vector Machines [34] and Random Forests [35], and more recently deep learning methodologies [36], have been employed to approximate

F (\cdot)

by effectively catching complex dependencies in spatio-temporal data.

The remaining sections of this work are devoted to describe the various deep learning methods applied to spatio-temporal prediction.

3. Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are deep learning models that rely on convolutions to extract spatio-temporal relationships from spatial time series. Thanks to the use of convolutions, CNNs have three peculiar characteristics: local connectivity, weight sharing and translation equivariance [10]. Furthermore, it is common in CNNs to employ not only convolutions but also other kinds of layers, such as pooling [10], fully connected layers and residual connections [37]. In CNNs, different types of convolutions can be adopted to model spatio-temporal data, according to the input structure, namely, mono-dimensional, bi-dimensional and three-dimensional convolutions.

CNNs, based on 1D convolutions, can process data evolving along a single dimension, e.g., usual time series. Given an input sequence

Z = {{\vec{z}}_{t}}_{t = 1}^{T}

, with

{\vec{z}}_{t} \in R^{D}

, and a kernel

W \in R^{Q} \otimes R^{D}

, where Q is the kernel size, 1D convolution between W and Z is

y_{t} = {(W * Z)}_{t} = \sum_{q = 0}^{Q - 1} 〈 {\vec{w}}_{q}, {\vec{z}}_{t - q + ⌊ Q / 2 ⌋} 〉 \forall t .

(3)

(W * Z)

denotes the convolution operator,

{\vec{w}}_{q} \in R^{D}

is an element of the kernel and

〈 \cdot, \cdot 〉

is the dot product. Standard 1D convolution has a limited receptive field, i.e., while processing the t-th time step, 1D convolution only considers

{\vec{z}}_{t}

and a few, contiguous elements in the close past and future. Therefore, it might not be a good choice when modelling temporal dependencies that can span many non-contiguous time steps. The receptive field of a CNN can be enlarged by increasing kernel sizes and stacking multiple convolutional kernels. However, the increase in the receptive field size implies a larger computational cost. In these cases, causal dilated convolutions can be a better choice to model temporal data. Unlike standard 1D convolutions, applying the kernel over both past and future elements, causal dilated convolutions only consider elements from the past, motivating the term causal. Moreover, while standard 1D convolutions apply the kernel to contiguous elements, in causal dilated convolutions, a fixed step between element pairs is considered by introducing a positive dilation factor

d \in N

, thus dilated. A 1D causal dilated convolution is defined as

y_{t} = {(W * Z)}_{t} = \sum_{q = 0}^{Q - 1} 〈 {\vec{w}}_{q}, {\vec{z}}_{t - d q} 〉 \forall t \in [1, T] .

(4)

Causal dilated convolutions are the core of Temporal Convolutional Networks (TCNs) [38], CNNs specifically designed to process temporal data using convolutions. TCNs are made of multiple stacked dilated convolution layers with different dilation factors at each layer. Typically, the i-th layer uses a dilation factor

d_{i} = 2^{i}, i \geq 0

. Causal dilated convolutions provide TCNs two advantages. The former enables the modelling of temporal dependencies across multiple time scales by applying different dilation factors at each layer. The latter allows grasping long-range dependencies using fewer stacked layers compared with standard convolutions. Due to its inherent limitations, 1D convolution cannot cope with spatial time series. Nevertheless, they can be useful building blocks in other deep learning architectures for spatio-temporal prediction since they can model temporal information effectively.

Two-dimensional convolutions are used with spatially evolving data, e.g., still images. Given a tensor

Z \in R^{U} \otimes R^{V} \otimes R^{D}

, where

{\vec{z}}_{u, v} \in R^{D}

is the observation at location

u, v

, and a kernel

W \in R^{Q} \otimes R^{K} \otimes R^{D}

, 2D convolution between Z and W is defined as follows:

\begin{matrix} y_{u, v} & = {(W * Z)}_{u, v} \\ = \sum_{q = 0}^{Q - 1} \sum_{k = 0}^{K - 1} 〈 {\vec{w}}_{q, k}, {\vec{z}}_{u - q + ⌊ Q / 2 ⌋, v - k + ⌊ K / 2 ⌋} 〉 \forall u, v . \end{matrix}

(5)

(W * X)

is the 2D convolution operator, and

{\vec{w}}_{q, k} \in R^{D}

is an element of the kernel. Q and K are the spatial dimensions of the kernel, with typically

K = Q

. In regular raster spatio-temporal modelling, 2D convolutions can be used to jointly process the spatial and temporal domains by considering the first two dimensions of the input tensor X as spatial dimensions and the third one as a temporal dimension [39,40,41,42,43]. Given a sequence of spatial data, e.g., remote sensing images, evolving over time

{I_{t}}_{t = 1}^{T}, I_{t} \in R^{U} \otimes R^{V}

, spatio-temporal tensors

Z_{t}

are built by concatenating, along the third dimension, the spatial data at time t,

I_{t}

with the previous

T - 1

ones:

Z_{t} = c o n c a t e n a t e (I_{t}, I_{t - 1}, \dots, I_{t - T + 1}) \in R^{U} \otimes R^{V} \otimes R^{T},

where

T

is a fixed time lag. Spatio-temporal features can be extracted by 2D convolutions between

Z_{t}

and a kernel

W \in R^{Q} \otimes R^{K} \otimes R^{T}

, according to Equation (5). With this approach, 2D convolutions can be used to model spatio-temporal dynamics within a limited time window of length

T

, which might not be optimal. Furthermore, this approach requires a kernel whose depth is exactly equal to

T

, making the resulting 2D convolution operation computationally burdensome when

T

is large. This limits the size of the temporal window that can be processed by 2D convolutions. Furthermore, 2D convolution transforms the 3D spatio-temporal input tensor into a 2D feature map, flattening the temporal dimension and thus losing temporal localisation of CNN-extracted features. Figure 2 shows an example application of 2D convolution to a spatio-temporal input tensor. Notice how the kernel has the same size of the input tensor along the temporal dimension and how this dimensions is lost in the convolution output.

Three-dimensional convolutions overcome the aforementioned limitations since the kernel does not need to have a depth equal to the window length

T

and can be much larger. Furthermore, the kernel is allowed to move not only along the spatial dimensions but along the temporal one, too. Three-dimensional convolutions [20,44,45,46] can be applied to spatio-temporal tensors

Z \in R^{U} \otimes R^{V} \otimes R^{T} \otimes R^{D}

, where each space–time observation

{\vec{z}}_{u, v, t} \in R^{D}

is a multidimensional vector. Three-dimensional convolutions can be defined as

\begin{matrix} y_{u, v, t} & = {(W * Z)}_{u, v, t} \\ = \sum_{q = 0}^{Q - 1} \sum_{k = 0}^{K - 1} \sum_{r = 0}^{R - 1} 〈 {\vec{w}}_{q, k, r}, {\vec{z}}_{u - q + ⌊ Q / 2 ⌋, v - k + ⌊ K / 2 ⌋, t - r + ⌊ R / 2 ⌋} 〉 \forall u, \forall v, \forall t, \end{matrix}

(6)

where

W \in R^{Q} \otimes R^{K} \otimes R^{R} \otimes R^{D}

is the kernel; Q and K are the spatial dimensions of the kernel; and

R ≪ T

is the temporal dimension. Unlike 2D convolution, which outputs a 2D feature map, 3D convolution results in a 3D feature map, preserving the temporal dimension. In 3D convolutions, the kernel also moves along the temporal dimension, and thus the 2D convolution properties in the space domain (e.g., weight sharing and translation equivariance) are inherited by 3D convolution in time, too. Figure 3 shows an example application of 3D convolution to a spatio-temporal input tensor. Notice how, unlike 2D convolution, the kernel does not need to have the same size of the input tensor along the temporal dimension. Moreover, since it moves in time as well, the temporal dimension is preserved in the output tensor, although shrunk.

Another type of convolution relevant for spatio-temporal prediction is the transposed convolution [47]. Transposed convolutions are used to upsample input images through learnable weights. They work the opposite to convolutions: while convolution with a kernel of size

Q \times Q

maps an equally sized input patch onto a single value of the output feature map, transposed convolutions map a single input value onto a patch of size

Q \times Q

on the output feature map through an equally sized kernel.

The approaches analysed herein can be categorised into different taxonomies, according to the type of used convolution, i.e., 2D or 3D, and how convolution impacts on spatial time-series modelling. More precisely, in some cases, convolutions in CNNs are used as spatio-temporal feature extraction mechanisms. Then, extracted features are used for non spatio-temporal tasks, e.g., human action recognition, video classification or index prediction. On the other hand, convolutions are used for end-to-end spatio-temporal tasks, i.e., to map spatio-temporal inputs onto spatial or spatio-temporal outputs, such as in spatio-temporal forecasting tasks. In this context, encoder–decoder architectures are also used, which first spatially downsample the input images, e.g., using valid convolutions and pooling operations, and then produce pixel-wise predictions over all the input spatial cells through a series of upsampling layers. For instance, a CNN based on U-Net [48] performs the spatio-temporal forecasting of arctic sea ice [43], whereas an encoder–decoder CNN based on transposed convolutions through its upsampling module carries out missing data imputation on remote sensing imagery [49]. Table 1 summarises all those articles, analysed in this work, that rely exclusively on convolutions to model spatio-temporal dynamics.

Benefits and Drawbacks of Convolutional Neural Networks

Two-dimensional and three-dimensional convolutions can be effectively used to extract spatio-temporal dependencies among data evolving through space and time. The convolution operation has some properties that can benefit many applications, such as translation equivariance, allowing CNNs to detect features regardless of their location in space and time, and weight sharing, i.e., reusing the same kernel weights across spatio-temporal locations, thus reducing the overall parameter number.

Two-dimensional convolutions are naturally suited to extracting spatial dependencies, and they can easily be extended to catch temporal dependencies as well. This extension involves gathering multiple temporal observations into a single tensor and making the convolution kernel as deep as the number of stacked temporal steps. This last aspect limits the number of temporal steps that can be gathered since aggregating too many steps would make the application of 2D convolution computationally unfeasible. Furthermore, the application of 2D convolution to a spatio-temporal tensor flattens the temporal dimension, thus losing temporal localisation in the convolution output while hindering model interpretability too. A possible solution could process small batches of temporal observations and then stack along the temporal dimensions the spatio-temporal features extracted by 2D convolutions for different temporal batches. However, this mechanism risks losing temporal coherence; thus, it may make the model unable to grasp spatio-temporal dynamics correctly.

Three-dimensional convolutions overcome these issues by allowing the kernel to move along both space and time, thus capturing spatio-temporal dependencies jointly and providing richer feature representations. This, however, implies a higher computational cost than 2D convolution for an equally sized kernel.

Two-dimensional and three-dimensional convolutions share some limitations. Both rely on Euclidean kernels [53] with a rigid structure, where only a continuous and commonly small local spatio-temporal neighbourhood is considered at each kernel application. While in some cases this can be a strong inductive bias, in some others, it may not be. Although local, spatio-temporal dependencies can be non-Euclidean and follow irregular spatio-temporal patterns. In such cases, some techniques, e.g., deformable convolutions [54], can help. Furthermore, in many spatio-temporal scenarios, especially Earth and climate domains, there are the so-called teleconnections [55], that is, a strong dependency can exist between events occurring very far in space and time. Two-dimensional and three-dimensional convolutions can hardly cope with teleconnections due to their Euclidean local neighbourhoods. Although, these phenomena could be modelled by stacking multiple convolutional layers and using larger kernel sizes, a better choice could be using dilated convolutions (see Section 3) or attention mechanisms (see Section 6).

4. Recurrent Neural Networks

Recurrent Neural Networks (RNNs) [10,36] are deep learning architectures designed to model temporal relationships among elements in a sequence. One of the earliest RNN formulations is commonly referred to as the vanilla RNN [56]. A vanilla RNN layer with K units sequentially processes an input sequence

Z = {{\vec{z}}_{1}, {\vec{z}}_{2}, \dots, {\vec{z}}_{T}}

, handling one element

{\vec{z}}_{t} \in R^{D}

at a time, and updates a hidden representation

{\vec{h}}_{t} \in R^{K}

at each time step according to

{\vec{h}}_{t} = g (W {\vec{h}}_{t - 1} + U {\vec{z}}_{t} + \vec{b}) \forall t \in [1, T],

(7)

where

W \in R^{K} \otimes R^{K}

,

U \in R^{K} \otimes R^{D}

and

\vec{b} \in R^{K}

are a hidden–hidden weights matrix, input–hidden weights matrix and bias, respectively;

g (\cdot)

can be any generic nonlinear function. Although the vanilla RNN has been widely adopted in time-series modelling, two RNN variants are preferred: Long Short-Term Memory (LSTM) networks [57] and Gated Recurrent Units (GRUs) [58]. The main advantage of LSTM networks and GRUs over the vanilla RNN consists of their increased stability during training and, therefore, of their better modelling temporal dependencies. Vanilla RNNs, trained with backpropagation, can suffer of vanishing or exploding gradients [10], namely, as the error gradient is back-propagated through a deep computational graph, some of its components might either get very small, giving an equally small contribution to the corresponding weight update, or very large, making training unstable. Both LSTM networks and GRUs adopt gating mechanisms to control the information flow in order to prevent derivatives from getting too small or too large.

Information flow within a LSTM unit is controlled through three gates, namely, input, forget and output gates. Given the previous LSTM output

{\vec{h}}_{t - 1}

, an encoded representation

{\vec{i}}_{t}

of the current sequence element

{\vec{z}}_{t}

is first computed as

{\vec{i}}_{t} = β (W_{i} {\vec{z}}_{t} + U_{i} {\vec{h}}_{t - 1} + {\vec{b}}_{i}) .

(8)

where

β (\cdot)

can be any sigmoidal function, and

{W_{i}, U_{i}, {\vec{b}}_{i}}

are learnable parameters. Then, activations of the three gates are computed according to the following equations:

\begin{matrix} {\vec{g}}_{t} & = σ (W_{g} {\vec{z}}_{t} + U_{g} {\vec{h}}_{t - 1} + {\vec{b}}_{g}), \\ \vec{f_{t}} & = σ (W_{f} {\vec{z}}_{t} + U_{f} {\vec{h}}_{t - 1} + {\vec{b}}_{f}), \\ {\vec{o}}_{t} & = σ (W_{o} {\vec{z}}_{t} + U_{o} {\vec{h}}_{t - 1} + {\vec{b}}_{o}), \end{matrix}

(9)

where

σ (\cdot)

is the logistic function,

{\vec{g}}_{t}

,

{\vec{f}}_{t}

,

{\vec{o}}_{t}

are the input, forget and output gates, respectively, and

{W_{g}, U_{g}, {\vec{b}}_{g}}

,

{W_{f}, U_{f}, {\vec{b}}_{f}}

and

{W_{o}, U_{o}, {\vec{b}}_{o}}

are the corresponding sets of parameters. An inner state

{\vec{c}}_{t}

is then updated by a linear combination of the encoded input

{\vec{i}}_{t}

and

{\vec{c}}_{t - 1}

:

{\vec{c}}_{t} = {\vec{g}}_{t} ⊙ {\vec{i}}_{t} + {\vec{f}}_{t} ⊙ {\vec{c}}_{t - 1},

(10)

where ⊙ is the element-wise product. Finally, the LSTM output

{\vec{h}}_{t}

is given by the current inner state, weighted by the output gate:

{\vec{h}}_{t} = {\vec{o}}_{t} ⊙ tanh ({\vec{c}}_{t}) .

(11)

A GRU is simpler than a LSTM unit as it is composed of just two gates: a reset and an update gate, respectively, weighting the influence of the previous GRU output

{\vec{h}}_{t - 1}

and the current input

{\vec{z}}_{t}

on the current GRU output

{\vec{h}}_{t}

. In a GRU layer, the two gates are computed first, according to the following equations:

\begin{matrix} {\vec{r}}_{t} & = σ (W_{r} {\vec{h}}_{t - 1} + U_{r} {\vec{x}}_{t} + {\vec{b}}_{r}), \\ {\vec{u}}_{t} & = σ (W_{u} {\vec{h}}_{t - 1} + U_{u} {\vec{x}}_{t} + {\vec{b}}_{u}) . \end{matrix}

(12)

{\vec{r}}_{t}

and

{\vec{u}}_{t}

are the reset and update gates, respectively;

{W_{r}, U_{r}, {\vec{b}}_{r}}

and

{W_{u}, U_{u}, {\vec{b}}_{u}}

are dedicated sets of parameters;

σ (\cdot)

is a sigmoid activation function. An intermediate output

{\vec{h}}_{t}^{'}

is then computed using the reset gate:

{\vec{h}}_{t}^{'} = tanh (W ({\vec{r}}_{t} ⊙ {\vec{h}}_{t - 1}) + U {\vec{z}}_{t} + + \vec{b}),

(13)

The final output

{\vec{h}}_{t}

of the GRU is then computed using the update gate as follows:

{\vec{h}}_{t} = {\vec{u}}_{t} ⊙ {\vec{h}}_{t - 1} + (\vec{e} - {\vec{u}}_{t}) ⊙ {\vec{h}}_{t}^{'},

(14)

where

\vec{e}

is a column vector of ones.

Another relevant variant of the vanilla RNN is the Echo State Network (ESN) [59], belonging to the reservoir computing paradigm. In this framework, the latent input representation is generated by a fixed reservoir, which, in the case of ESNs, corresponds to the recurrent layer. ESNs employ the same update equations as a vanilla RNN layer (see Equation (7)). However, the matrices W and U are non-trainable and randomly fixed parameters. Additionally, a predefined sparsity level is enforced on these matrices by setting a subset of their elements to zero. The adoption of a fixed recurrent reservoir confers two primary advantages: ESNs are largely unaffected by vanishing or exploding gradients since no backpropagation is performed through the recurrent layer, and they can be trained significantly faster than vanilla RNNs, LSTMs or GRUs.

Despite being naturally suited to process temporal data, the RNN and its variants have also been adopted to process spatio-temporal data. For instance, a Spatio-Temporal RNN (STRNN) [60] can model spatio-temporal dependencies by cascading two vanilla RNNs: the former processes spatial dependencies and the latter the temporal ones. The spatial RNN, firstly applied, is a quad-directional RNN processing spatial data at each time step through recurrent formulations applied along four spatial directions. The temporal RNN is a bidirectional RNN processing the output of the spatial RNN in the time domain. Another approach is implemented by a Structural RNN (S-RNN) [61]. S-RNN was developed to tackle spatio-temporal tasks that can be modelled as generic spatio-temporal graphs. Hence, S-RNN can be seen as an alternative to spatio-temporal Graph Neural Networks [62]. S-RNN models relationships among graph nodes by assigning a dedicated RNN, in particular, a LSTM network, to each node and edge of the graph. In order to reduce the computational cost of S-RNN, RNNs are shared among nodes and edges with the same semantic meaning according to the task at hand. RNNs are used for the temporal processing of node and edge properties and are distinguished into nodeRNNs and edgeRNNs. When processing a given node v of the graph, edgeRNNs, assigned to all edges that are incident on v, are used to process the temporal sequence of the corresponding edge features first. Then, the nodeRNN assigned to v is used to process the temporal sequence of features of v, along with the output previously computed by edgeRNNs.

Echo State Networks have also been used for spatio-temporal forecasting tasks. For example, a quadratic ESN [63] has been employed to forecast sea surface temperature. Additionally, uncertainty quantification has been achieved through a bootstrap ensemble of ESNs built by sampling multiple random reservoirs and by averaging their prediction. The work has been extended [64] by considering deep ESNs, along with a Bayesian approach to uncertainty quantification, applied to the problem of soil moisture forecasting.

Table 2 summarises the works analysed herein that rely exclusively on RNNs and their variants to model spatio-temporal dependencies.

Benefits and Drawbacks of Recurrent Neural Networks

RNNs excel at processing sequences. While they can naturally model temporal dependencies, adaptations are required to properly capture spatial dependencies too.

In the literature, LSTMs, GRUs and ESNs are the most widely used RNN variants since Vanilla RNNs are characterised by unstable training, impeding neural networks to capture long-term temporal dependencies (see Section 4). Such drawback is mostly overcome by LSTM networks and GRUs, using gating mechanisms to control the information flow within recurrent cells. Nevertheless, it must be remarked that these models still suffer from short-term memory, i.e., the inability in grasping long-term dependencies. Consider the RNN output

y_{t}

at time t. It is predicted starting from the current input

{\vec{z}}_{t}

and the hidden state

{\vec{h}}_{t}

, which is a rough condensation of the past inputs.

{\vec{h}}_{t}

is the only way for the output layer to have knowledge about past inputs in the sequence. Unfortunately, there is no certainty that, after several recurrent steps

{\vec{h}}_{1}, {\vec{h}}_{2}, \dots, {\vec{h}}_{t - 1}

, adequate information about temporally distant inputs survives into

{\vec{h}}_{t}

. This effect gives the RNN and its most common variants a short-term memory, making them unable to preserve information from the distant past. A solution to this drawback is to use methods that can efficiently consider the whole temporal sequence (or most of it, at least) as one. Attention mechanisms are an example of such a method (see Section 6).

The short-term memory of RNNs is partially due to their sequential nature, which represents yet another limitation. Sequential processing in RNNs, in fact, limits parallelization of such models, making them hard to scale to longer sequences or more data.

5. Hybrid Convolutional–Recurrent Networks

A major research direction in spatio-temporal modelling has been the integration of convolutions with RNNs. On one hand, convolutions are naturally suited to model spatial data (e.g., images); on the other, RNNs give their best when modelling temporal sequences. The idea is to join both approaches into a single, hybrid convolutional–RNN. In this domain, there are some works that apply convolutions and RNNs as two separate modules [16,75,76,77], either convolution-first or RNN-first. A lot of attention, however, has been focused into directly integrating convolutions within recurrent units. One of the earliest works in this direction is the Convolutional LSTM (ConvLSTM) [12]. ConvLSTM was proposed to tackle precipitation nowcasting tasks, but it is applicable to general spatio-temporal series modelling tasks. The core idea is to nest convolutions within recurrent LSTM equations, i.e., Equations (8) and (9), by substituting matrix products with convolution operators. ConvLSTM is a modification of an alternative LSTM formulation [78], slightly different from the one described in Section 4. A ConvLSTM layer is controlled by the following equations:

\begin{matrix} i_{t} & = tanh (W_{i} * z_{t} + U_{i} * h_{t - 1} + b_{i}), \\ g_{t} & = σ (W_{g} * z_{t} + U_{g} * h_{t - 1} + V_{g} ⊙ c_{t - 1} + b_{g}), \\ f_{t} & = σ (W_{f} * z_{t} + U_{f} * h_{t - 1} + V_{f} ⊙ c_{t - 1} + b_{f}), \\ c_{t} & = g_{t} ⊙ i_{t} + f_{t} ⊙ c_{t - 1}, \\ o_{t} & = σ (W_{o} * z_{t} + U_{o} * h_{t - 1} + V_{o} ⊙ c_{t} + b), \\ h_{t} & = o_{t} ⊙ tanh (c_{t}) . \end{matrix}

(15)

ConvLSTM processes a series of spatial data, such as multichannel images, thus

z_{t} \in R^{C} \otimes R^{H} \otimes R^{W}

, where C is the number of channels (e.g., RGB), and H and W are the spatial dimensions. The variables

c_{t}

and

h_{t}

denote, respectively, the inner state and output of ConvLSTM and have the same spatial dimensions of the input

z_{t}

(

i_{t}

,

g_{t}

,

f_{t}

and

o_{t}

also share the same dimensionality). Their number of channels depends on the number of filters used in the convolution (i.e., ∗). The triplets

(W_{i}, U_{i}, b_{i})

,

(W_{g}, U_{g}, b_{g})

,

(W_{f}, U_{f}, b_{f})

and

(W_{o}, U_{o}, b_{o})

are learnable parameters; the first two elements of each triplet are convolutional kernels, while the last one of each triplet is a bias tensor. Figure 4 shows a simplified example of ConvLSTM’s inner mechanisms. Each cell in the hidden state

h_{t}

depends on a small local neighbourhood from both the hidden state at the previous time step

h_{t - 1}

and the input at the current time step

z_{t}

. Such dependency is modelled through convolutions, which preserves spatial structure.

An extension of ConvLSTM is Trajectory GRU (TrajGRU) [79]. TrajGRU is based on ConvGRU, a GRU where all matrix products in its reset and update gates have been replaced with convolutions similar to ConvLSTM. A key innovation of TrajGRU resides in using location-variant convolutions, where the neighbourhood structure at a given position is not fixed by the kernel size and shape, but it is learned during training. This is similar to the concept of deformable convolutions, and it is implemented through a trainable structure generating network, that is, a small CNN that receives the current input

z_{t}

and the previous GRU output

h_{t - 1}

and produces a flow field indicating the structure of the neighbourhood at any given location.

Another extension of ConvLSTM is the Predictive RNN (PredRNN) [14,80]. The key innovation of PredRNN resides in how the information flows across time and between multiple ConvLSTM layers. In a multilayer scenario, the l-th ConvLSTM layer, at the t-th time step, receives two inputs: its own output from the previous time step,

h_{t - 1}^{l}

, and the output of its preceding ConvLSTM layer,

h_{t}^{l - 1}

, which is the current input

z_{t}

if

l = 0

(i.e., the first layer). In this way, layers are loosely interconnected and almost mutually independent. In fact, each layer tracks its own temporal evolution but has almost no clue on the features extracted so far by the whole network. The main finding of PredRNN is that a unified memory pool, tracking the evolution of information in time and across all layers, is useful for spatio-temporal predictive learning. It proposes to implement such mechanism by adding an additional path along which information can flow: at each time step, the first ConvLSTM layer also receives the output of the last layer of the network from the previous time step. This new path is integrated in the canonical ConvLSTM layer through dedicated equations that can be found in the original papers [14,80].

The Memory in Memory (MiM) network [81] further extends PredRNN in order to better model non-stationary dynamics in spatio-temporal processes. MiM authors noted that the forget gate of ConvLSTM (and other LSTM-like units) often gets saturated, indicating that they cannot adapt to non-stationary dynamics. They propose to overcome this limitation by replacing the forget gate with two cascaded modules, dedicated, respectively, to capture non-stationary and stationary behaviours.

Finally, the aforementioned works are summarised in Table 3.

Benefits and Drawbacks of Hybrid Convolutional–Recurrent Networks

Convolutional and recurrent neural networks are suited for the processing of spatial and temporal data, respectively. Hybrid CNN-RNN models combine the best of both worlds, becoming a reasonable choice for the modelling of spatio-temporal data. Unlike fully connected RNNs, such hybrid models can process images as tensors directly, without the need of linearizing them into vectors and, therefore, preserving the input spatial structure. Furthermore, the use of convolutional kernels gives these hybrid models the additional weight sharing property, typical of CNNs. Therefore, hybrid models can generally have a smaller number of parameters than fully connected RNNs. However, since they are often based on Euclidean kernels, they may not capture long range or irregular spatial dependencies (see Section 3).

Unlike convolutional models, thanks to the use of recurrent units, CNN-RNN hybrid models can capture temporal dependencies more precisely and at a finer level. However, similar to RNNs, these hybrid models can still suffer from vanishing/exploding gradients and from short-term memory problems when modelling very long sequences (see Section 4).

6. Transformers

Transformers [11] were originally designed for sequence-to-sequence transduction tasks in natural language processing, and they have soon been adapted for different tasks in the spatio-temporal field as well, such as spatio-temporal forecasting and video processing. Videos are actual examples of regular raster spatio-temporal data since any temporal sequence of spatial maps (i.e., images) can be seen as a video. Like RNNs, transformers can process sequences. Transformers, however, compare favourably against RNNs in sequence modelling tasks as they overcome two main RNN drawbacks: the difficulty in parallelising computations due to RNNs’ inherent sequential nature and short-term memory [11] (see Section 4). Transformers address these limitations through the use of attention mechanisms, which enable models to selectively focus on specific parts of the input depending on the information being processed. An attention mechanism operates on a query vector

\vec{q}

and a set of N value vectors

{\vec{v}}_{i}

, each associated with a corresponding key vector

{\vec{k}}_{i}

. The query encodes the information currently being processed, while the keys and values represent the input elements over which attention is applied. Typically, keys and values are the same vectors. Among several definitions of attention, transformers adopt the scaled dot-product attention [11], which implements a dot product between query and key vectors. A transformer operates on sequences of queries, keys and values, which are generally referred to as tokens, a term borrowed from natural language processing, indicating an element of a sequence as processed by a transformer. Typically, there are multiple queries, and an attention mechanism is applied to each one of them, yielding multiple output vectors as follows:

Y = Attention (K, V, Q) = softmax (\frac{Q K^{⊤}}{\sqrt{D_{k}}}) V .

(16)

Here,

Q \in R^{M} \otimes R^{D}

denotes the matrix whose rows correspond to the query vectors. Analogously, the keys and values are arranged as rows of the matrices

K \in R^{N} \otimes R^{D}

and

V \in R^{N} \otimes R^{D}

, respectively. The attention output

Y \in R^{M} \otimes R^{D}

is a matrix whose i-th row represents the output vector associated with the i-th query. Within transformers, attention mechanisms are typically distinguished into self-attention and cross-attention. In self-attention, queries, keys and values originate from the same sequence, enabling attention among its elements, whereas in cross-attention, queries and keys/values come from different sequences.

In practice, transformers employ Multi-Head Attention (MHA) in which queries, keys and values are linearly projected H times into subspaces of dimensions

D_{k}

,

D_{v}

and

D_{k}

, respectively, using learned projection matrices. Scaled dot-product attention is then applied independently to each set of projections, after which the resulting outputs are concatenated and linearly mapped back to the original space. Each projection-attention couple defines an attention head

h_{i}

with three learned matrices

W_{i}^{K} \in R^{D} \otimes R^{D_{k}}

,

W_{i}^{V} \in R^{D} \otimes R^{D_{v}}

and

W_{i}^{Q} \in R^{D} \otimes R^{D_{k}}

, used to project keys, values and queries, respectively. Each attention head computes scaled dot-product attention, as defined in Equation (16), over its projected inputs:

h_{i} = Attention (K W_{i}^{K}, V W_{i}^{V}, Q W_{i}^{Q}) \forall i \in [1, H]

(17)

The outputs

h_{i}

from all attention heads are concatenated into a single matrix and linearly re-projected to the original D-dimensional space via an additional projection matrix

W^{o} \in R^{H D_{v}} \otimes R^{D}

.

MHA (K, V, Q) = Concatenate (h_{1}, h_{2}, \dots, h_{H}) W^{o}

(18)

Standard transformers adopt an encoder–decoder architecture [11]. The encoder processes the input sequence and, through self-attention, transforms it into a hidden representation. The decoder generates the output sequence by combining cross-attention over the encoder’s hidden representation with self-attention over its previous outputs. Variants of the architecture often rely exclusively on either the encoder [87] or the decoder [88], depending on the specific task and the modelling paradigm. It is worthwhile to remark that transformers assume no intrinsic ordering of inputs, that is, input is considered as a collection of samples rather than an ordered stream of samples. If a specific input ordering must be considered for the task at hand, it must be explicitly encoded in the input embedding. A common approach to encode sample position within a sequence consists in summing a positional embedding

E_{p o s}

to the main sample embedding

X \in R^{L} \otimes R^{D}

[11]:

i n p u t = X + E_{p o s} .

(19)

Traditional transformers are naturally suited to process sequences. Efforts have been spent to adapt the transformer architecture for the processing of spatio-temporal data, mostly in the video processing domain. The extension has been gradual: first, transformers have been adapted to process images (i.e., regularly sampled spatial data), with models such as the vision transformer [89] and the Swin transformer [90]. Then, these models have been extended for the processing of videos [27,28], which are a further example of spatio-temporal data. In this section, two families of spatio-temporal transformers are discussed: spatio-temporal transformers developed for video processing tasks, primarily video classification tasks (Section 6.1), and spatio-temporal transformers built for specific, spatio-temporal tasks, such as Earth systems modelling or traffic flow prediction (Section 6.2).

6.1. Vision Transformers and Transformers for Video Processing

Vision transformers [89] (ViT) were introduced to tackle the problem of image classification without relying on convolution-based models. ViTs are based exclusively on the transformer architecture [11], and their main innovation concerns how their input is tokenized. Unlike the Natural Language Processing domain, in a vision transformer, representing an image as a set of tokens is not straightforward. ViT tokenizes an image by dividing it into small, non-overlapping 2D patches of fixed size. Each patch matrix is flattened and encoded through a linear embedding. A position embedding is added to encode the patch position within the original image. The sequence of patches is then processed by self-attention through a standard transformer encoder [11].

A video-transformer-based model, inspired by ViT, is the time–space transformer (TimeSformer) [91]. TimeSformer tokenizes videos by sampling N frames at random and encoding each frame as a sequence of small patches, as in ViT. Patch sequences from all frames are concatenated in a single sequence, representing the whole video, which is processed by a standard transformer encoder, as in ViT. The computational efficiency of TimeSformer is improved by a divided space–time attention mechanism, which applies attention separately in time and space. Firstly, TimeSformer applies time attention among all tokens from the same temporal index; then, it applies space attention among all tokens with the same spatial index. A similar approach is followed by the video vision transformer (ViViT) [27]. It tokenizes the video in one of two ways: uniform frame sampling, i.e., the same tokenization procedure adopted in TimeSformer, and tubelet embedding, which extracts tokens from small non-overlapping spatio-temporal boxes that cover the entire input video. Each spatio-temporal box is flattened, encoded through a linear embedding and concatenated into one long token sequence representing the whole video. The resulting sequence is finally processed by a standard transformer encoder, as in ViT. However, ViViT offers some less computationally demanding variants, which factorise the processing of the spatial and temporal dimensions of the video at either architecture level or self-attention level. Figure 5 shows an example pipeline of a generic video vision transformer inspired by ViViT. First, patches are extracted from video frames according to the patching mechanism adopted by the specific model. A D-dimensional embedding is then computed for each patch, and positional embedding is added to encode patch position within the video. The sequence of patches is then input to a stack composed of L transformer encoder layers. Finally, a prediction head produces the output according to the task being tackled.

The multiview transformer (MTV) [92] is another spatio-temporal transformer model for video processing. MTV, based on ViViT, introduces a multiview architecture to process videos by considering multiple temporal scales. MTV tokenizes videos using the aforementioned tubelet embedding. Multi-scale views of the input video are constructed by tokenizing the video multiple times, and tubelets with different temporal sizes are used at each time. Views with smaller tubelets capture fine-grained video details, while larger tubelets capture coarser aspects of the video. Thus, MTV builds a sequence of multiscale views ordered from the coarsest to the finest scale. Each view is processed with a proper transformer encoder, just as ViT and ViViT. Furthermore, in MTV, each encoder has a cross-view fusion module that aggregates information sequentially between encoders of two adjacent views. Since MTV was proposed for video classification, the encoder of each view maintains a classification token that aggregates from all other tokens in the view the information required for the classification. Finally, classification tokens from each view are concatenated and processed by a further transformer encoder to yield the final classification.

Unlike the aforementioned ViT-based models, the unified transformer (UniFormer) [93] unifies under one transformer architecture the spatio-temporal locality awareness (typical of 3D convolutions) and the skill of capturing long-term dependencies, typical of transformers. UniFormer introduces a specific relation aggregator block, with two variants: a local variant, capturing local spatio-temporal dependencies, and a global one, grasping global spatio-temporal dependencies across the whole video. The former block type is based on 3D convolutions, which, in a less computationally demanding way, model dependencies efficiently in small spatio-temporal neighbourhoods, while the latter is based on self-attention. Both blocks work with multiple heads, each, with its own parameters, computing an output according to the block type. Outputs of all heads are concatenated and linearly projected back to the input space. These relation aggregator blocks are organised in such a way that local dependencies are grasped in shallow layers of the network, while global ones are modelled by deeper layers. In fact, UniFormer is composed of four relation aggregator blocks: the first two are local while the remaining two, the deepest in the network, are global. Three-dimensional convolution layers are employed between each block pair in order to downsample both spatially and temporally the original input video.

The Swin transformer [90] is an alternative to ViT in the image processing domain. Its main novelty is the shifted window attention. As in ViT, Swin tokenizes images from small non-overlapping patches, which are concatenated along the depth and encoded through a linear embedding. Tokens are then divided into small non-overlapping, fixed-size squares (windows), and multi-head attention is applied just among tokens of the same window. To connect tokens of different windows, attention is applied in two steps. First, attention is applied as just described; then, each window is shifted by half its size towards the bottom-right corner of the image, and attention is reapplied on these shifted windows. In this way, tokens take into account the content of adjacent windows thanks to the application of the just-described procedure. The application of windowed attention is, therefore, always followed by the application of a shifted window attention. Furthermore, patch merging mechanisms downsample the input image in deeper layers. The video Swin transformer [28] is a straightforward extension of the Swin transformer to the video processing domain. First, video Swin tokenizes videos by extracting small, non-overlapping spatio-temporal boxes. Just like Swin, these boxes are concatenated along the depth and projected linearly. Video Swin also adopts the shifted window attention, but, in this case, 3D windows of a fixed size are considered. Attention is again applied in two phases: first, using a spatio-temporal windowing of all tokens and then considering a shifted window configuration in order to connect the neighbouring windows. Patch merging is employed before each attention block, and it works exactly as in Swin transformers, namely, spatial dimensions are each downsampled, while the temporal dimension is left untouched.

There are several other relevant transformer-based models for video processing. For instance, there is the Video Transformer Network (VTN) [94], for video classification, that is based on the well-known Longformer model [95] to process longer video clips; the Video Action Transformer Network [26] for action identification in videos; Trackformer [96] for multiple object tracking in videos; the Spatio–temporal Masked Autoencoder [97], which explores unsupervised pre-training and transfer learning on videos. Table 4 summarises all transformer-based models for video processing that have been analysed herein.

6.2. Transformers for Spatio-Temporal Processing in Heterogeneous Domains

In addition to video processing, the transformer architecture has also been adapted for spatio-temporal modelling in other applicative domains, such as Earth systems forecasting and traffic flow prediction. A relevant work in this area is Earthformer [13], a transformer-based model for spatio-temporal Earth systems forecasting. Earthformer proposes a generic space–time attention, the cuboid attention, which can capture long-term dependencies between input values while maintaining a relatively low computational complexity. Another relevant work using transformers to tackle Earth-related tasks is Contextformer [15], specifically proposed to forecast vegetation health from satellite imagery by leveraging the GreenEarthNet dataset [15], proposed in the same article. Contextformer relies solely on transformers to model both space and time. It follows a similar methodology to MMST-ViT [112], and it is based on pyramid vision transformers [113,114] and Presto [115], a pre-trained transformer for satellite time series, to encode visual inputs and to model temporal relationships, respectively. The SwinLSTM Network [116] is yet another architecture that tackles heterogeneous spatio-temporal prediction tasks, such as traffic flow prediction, by integrating the Swin tranformer attention block (see Section 6.1) in the LSTM unit (see Section 4). Papers using transformer-based deep learning models to tackle spatio-temporal tasks in heterogeneous domains are summarised in Table 5.

6.3. Benefits and Drawbacks of Transformers

Transformers offer several advantages for regular raster spatio-temporal prediction tasks. First, the use of attention mechanisms allows transformers to efficiently catch global dependencies, which is fundamental in many spatio-temporal prediction tasks, where long-range dependencies exist among input data. Moreover, attention mechanisms can be easily parallelized, making transformers generally more scalable than recurrent models. This property is particularly useful in the regular raster spatio-temporal context, where dependencies need to be captured simultaneously across space and time. However, the computational cost of self-attention scales quadratically with the number of queries involved, making it crucial to limit sequence length or adopt strategies that reduce the effective number of queries. Over the last years, several works have focused on dealing with longer sequences while keeping a relatively low computational cost, e.g., Longformer [95], to quote one.

Transformers easily support the integration of data from multiple sources and modalities through attention mechanisms, particularly cross-attention. This capability is very useful in spatio-temporal prediction tasks, such as those in the environmental domain, where data are often heterogeneous and multimodal.

Transformers are also well known for being massive data hungry, and their performance typically depends on the availability of large datasets. Fortunately, nowadays, there are many pre-trained transformers, especially in the vision domain, which have been trained on extensive datasets and can be fine-tuned to regular raster spatio-temporal prediction tasks, e.g., video analysis or environmental forecasting.

7. Diffusion Models for Spatio-Temporal Modelling

Generative models have been a breakthrough in machine learning due to their large applicability and their high-performance in various applicative domains, such as computer vision [124,125,126], natural language processing [87,88] and time-series prediction [127,128,129]. Generative Adversarial Networks (GANs) [130] were, historically, at the forefront of generative modelling as they showed a superior performance compared to other deep learning methods in data generation tasks. However, In recent years, diffusion models (DMs) [131,132] have noticeably outperformed GANs in generative modelling tasks, gaining significant attention not only for their performance but also for several key advantages. First, they are relatively simple: the neural networks employed in DMs often rely on simple architectures, e.g., autoencoders, and require comparatively few parameters given the complexity of the problems that they address. In addition, they are firmly grounded in statistical physics, dynamics and Bayesian theory, which provides them strong theoretical foundations. Finally, their versatility makes them highly applicable to many relevant real-world tasks, often requiring simple and limited modifications to their original architectures. Such large applicability of DMs has also manifested in a rising interest in spatio-temporal problems, such as generation [133], forecasting [134], classification [135] and anomaly detection [136].

The goal of generative models is to learn the underlying probability distribution

p (x)

, generating the observed data x. If this probability distribution is known, generating new samples

x^{'}

simply consists of sampling from

p (x)

. Diffusion models can be considered as an extension of Variational Auto Encoders (VAEs) [137] with three additional characteristics. First, a diffusion model is a Markovian cascade of VAEs, yielding a sequence of latent variables

x_{t} \forall t \in [1, T]

whose dimensionality is the same of the one of the input space. Second, the structure of the encoder is fixed as a linear Gaussian model that iteratively uses as the mean of the Gaussian distribution at the next iteration the output sampled from the Gaussian distribution at the previous iteration. Finally, the encoder process should guarantee that the latent distribution of the final iteration T is a standard Gaussian distribution

N (\vec{0}, I)

. According to the first characteristic, the encoder distribution of a DM can be written as

q (x_{1 : T} | x_{0}) = \prod_{t = 1}^{T} q (x_{t} | x_{t - 1}),

(20)

where

x_{0}

is the input sample, and T is the number of diffusion iterations. Based on the second point, the distribution

q (x_{t} | x_{t - 1})

is a Gaussian distribution whose mean depends on

x_{t - 1}

due to the Markovian assumption. Specifically, the mean

μ_{t} (x_{t}) = \sqrt{α_{t}} x_{t - 1}

and the variance

Σ_{t} (x_{t}) = (1 - α_{t}) I

of the t-th distribution incorporate the coefficient

α_{t}

that preserves the variance among all the latent variables. In short, the second assumption can be mathematically defined as

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{α_{t}} x_{t - 1}, (1 - α_{t}) I)

(21)

The final characteristic is guaranteed by letting

α_{t}

evolve gradually and sufficiently over time in a way such that

x_{T} \sim N (0, I)

. The application of such an iterative Markovian process on the input sample

x_{0}

defines the so-called forward process, which produces, after T iterations, the final latent sample

x_{T}

. The goal is now to define the reverse process, whose objective is to generate a sample

x_{0}^{'}

, starting from

x_{T}

, which is as similar as possible to the original one. Mathematically, this process is defined by the joint distribution:

p (x_{0 : T}) = p (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} | x_{t}),

(22)

where

p (x_{T})

is a standard Gaussian, as stated before, and

p_{θ} (x_{t - 1} | x_{t})

is a learned conditional distribution that effectively models the unknown transition distribution of the reverse process,

q (x_{t - 1} | x_{t})

. The distribution

p_{θ}

is generally implemented as a neural network such that

p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{t}^{θ} (x_{t}), σ_{t}^{2} I),

(23)

where

μ_{t}^{θ} (x_{t})

is the mean predicted by a

θ

-parameterised neural network, and

σ_{t}^{2} I

defines a fixed noise scale according to the schedule

σ_{t} = 1 - α_{t}

.

In principle, an Evidence Lower Bound (ELBO) objective function could be used to train DMs. Unfortunately, it can be proven that such an objective has a high variance, causing instabilities during training [131]. To avoid such drawback, Denoising Diffusion Probabilistic Models (DDPMs) [132] are proposed, recasting DMs’ training as a denoising task according to the following objective:

\underset{θ}{argmin} γ (α_{t}) [| | ϵ - ϵ_{t}^{θ} (x_{t}) {| |}_{2}^{2}],

(24)

where

ϵ \sim N (0, I)

is the noise amount used to corrupt the original sample

x_{0}

,

ϵ_{t}^{θ} (x_{t})

is the amount of noise to subtract predicted by a

θ

-parameterised neural network and

γ (α_{t})

indicates a positive coefficient that depends on the noise schedule parameter

α_{t}

.

A crucial limitation of DDPMs is their slow sampling speed. To generate a single sample

x_{0}^{'}

, the entire sequence of reverse diffusion steps must be performed as the forward process is assumed to be Markovian. If the number T of diffusion iterations is large, the inference time of DMs can become unfeasible. Denoising Diffusion Implicit Models (DDIMs) [138] overcome this limitation, reformulating the forward transition

q (x_{t} | x_{t - 1})

by conditioning it on the original sample

x_{0}

via Bayes’ rule:

q (x_{t} | x_{t - 1}, x_{0}) = \frac{q (x_{t - 1} | x_{t}, x_{0}) q (x_{t} | x_{0})}{q (x_{t - 1} | x_{0})} .

(25)

The resulting forward process is, therefore, non-Markovian; however, it must be noted that DDIM’s forward process retains the same marginal distributions

q (x_{t} | x_{t - 1})

as DDPMs. DDIM’s sampling procedure is derived by conditioning the learned reverse distribution on the original sample

x_{0}

:

p_{σ, θ} (x_{t - 1} | x_{t}, x_{0}) = N (\sqrt{α_{t - 1}} x_{0} + \sqrt{1 - α_{t - 1} - σ_{t}^{2}} \frac{x_{t} - \sqrt{α_{t}} x_{0}}{\sqrt{1 - α_{t}}}, σ_{t}^{2} I),

(26)

where

σ_{t} \in R^{+}

introduces a controllable level of stochasticity in the reverse process. Notably, assuming the denoising setting, the term

x_{t}

is only used to estimate the source noise

ϵ

, expressed, in Equation (26), as follows:

ϵ = \frac{x_{t} - \sqrt{α_{t}} x_{0}}{\sqrt{1 - α_{t}}} .

(27)

Due to the unavailability of the original sample

x_{0}

in the reverse process, it has to be estimated. This can be achieved by inverting Equation (27) and by introducing a neural network

ϵ_{θ} (x_{t})

that predicts the noise

ϵ

:

x_{0} ≃ f_{θ} (x_{t}) = \frac{(x_{t} - \sqrt{1 - α_{t}} \cdot ϵ_{θ} (x_{t}))}{\sqrt{α_{t}}} .

(28)

Equations (27) and (28) provide the first interpretation of DMs, where the training goal is to learn the source signal noise. However, other equivalent DM interpretations can be found in the literature [131]. In Table 6, some applications of DMs for spatio-temporal prediction tasks are reported.

Benefits and Drawbacks of Diffusion Models

DMs are powerful probabilistic generative models that have attracted significant attention for their dramatic outperformance in generative modelling and beyond. In addition to their strong performance, their probabilistic nature enables them to estimate prediction uncertainty, which can greatly enhance the value of the model predictions. Furthermore, the DM framework is highly flexible as models trained only for data generation task can be easily readapted to different tasks, such as inpainting or missing data imputation.

DMs are not free of drawbacks. Because of their nature, they are bounded to have a Gaussian distribution as a starting distribution [144]. This limits the solution space for the Markovian process that DMs have to approximate since it is constrained to always start from a normal distribution. This limitation is overcome by Flow Models that can start from any distribution [145,146].

The generation process of DMs is stochastic as randomness is inherent in their generative process. Unlike other models, such as Flow Models, this property causes that the reverse process in DMs converges more slowly to the target, often requiring several reverse diffusion steps. To address that, alternative sampling methods have been developed to reduce the number of steps during the reverse process and thereby accelerating sampling while still preserving the marginal distributions [138,147].

DMs, and Flow Models alike, enforce a very specific instantaneous transformation of the sample during each time step in the generation process. This aspect, which may limit model expressiveness, is overcome by the latest frontier of generative modelling, Generator Matching [148], which encompasses DMs, Flow Models and jump processes [146], all in one framework.

8. Discussions and Future Research Directions

Previous sections presented the most commonly adopted deep learning architectures for regular raster spatio-temporal prediction. As shown in Figure 6, there is a fairly even distribution of articles with respect to the reviewed architectures, with a slightly higher concentration of works on transformers for video processing.

Literature analysis highlights five applicative domains in which regular raster spatio-temporal data are most popular, namely, video processing [26,27,28,29], environmental monitoring [12,13,14,15], remote sensing [16,17,18,19], traffic modelling [20,21,22,23] and energy production prediction [24,25]. As displayed in Figure 7, video processing is the realm where spatio-temporal prediction on regular raster data is most often encountered, followed by environmental monitoring, remote sensing, traffic modelling and energy prediction. It should be noted that environmental monitoring and remote sensing fields may sometimes overlap as environmental applications often handle remote sensing data. In this setting, all the works categorised as remote sensing are those applying their methodology on remote sensing data without necessarily performing environmental monitoring (e.g., super resolution of satellite images).

Each architecture relies on specific functional mechanisms and principles, making them more or less suitable to be applied to a specific task. Figure 8 shows, for each reviewed architecture, the percentage of articles related to particular domains.

Since the introduction of ViTs, the transformer architecture has become state of the art in the video processing field. The reason of their success is twofold: the former resides in the flexibility of attention mechanisms, making transformers a solid foundation for building customised and more complex models for video processing. The latter reposes in the attention mechanisms’ capability to adaptively model long range spatio-temporal dependencies. This property makes transformers ideal candidates for environmental modelling tasks, where teleconnections and long-term dependencies generally exist within spatio-temporal data. Diffusion models have also achieved remarkable performance on video processing tasks, especially in the field of video generation. It is important to note, however, that diffusion models denote a deep learning framework rather than specific architectures. Early diffusion models were commonly built around CNNs based on U-Net [48] and, more recently, on diffusion transformers [149], a transformer architecture specifically designed for generative modelling and the integration of data from multiple modalities. Despite their effectiveness, transformer models also have some major drawbacks (see Section 6.3), making more traditional models, such as CNNs, RNNs and CNN-RNN hybrids, still relevant nowadays, especially in environmental sciences, as can be seen from Figure 8. Despite being limited in capturing global dependencies or long-range dependencies, the inductive biases of these models (e.g., the locality of CNNs) can make them competitive against more advanced architectures under different scenarios, e.g., when data availability is scarce and when having an estimate of prediction reliability is not crucial.

For the sake of completeness, it must be remarked that, although all the aforementioned architectures are presented for the processing of regularly spaced raster spatio-temporal data, they can also be adapted to process irregularly spaced data. However, this type of data is more naturally modelled and handled using spatio-temporal extensions of Graph Neural Networks [150,151], which are out of scope for this review. For an in-depth discussion on Graph Neural Networks for spatio-temporal prediction, the reader is referred to [62]. The topic of regular raster spatio-temporal prediction is wide, disparate and complex. Hence, there are some aspects, not analysed herein, that deserve much attention. In the following, major future lines of research are reported.

Uncertainty estimation for spatio-temporal predictions. With the only exception of diffusion models, other reviewed deep learning approaches do not yield uncertainty estimates on their spatio-temporal predictions. This aspect is crucial in many spatio-temporal prediction tasks concerning environmental safety [152,153] or extreme events forecasting [154]. To introduce prediction uncertainty quantification, the Bayesian learning framework [155] can be used, or it can be achieved using generative AI models, such as Diffusion or Flow Models (see Section 7).
Physics constraint in spatio-temporal prediction. The spatio-temporal dynamics of many real-world phenomena behave according to some physical laws and properties. Deep learning is of great utility to infer such spatio-temporal dynamics, but, in many cases, its data-driven nature can be empowered by embedding physical constraints, governing equations and first principles directly into the training process [156]. In this way, predictions can be not only statistically accurate but also coherent with known physical laws. This property is particularly important in many regular raster spatio-temporal prediction tasks, such as climate modelling [157,158] and fluid dynamics, where respecting physical laws is critical.
Model explainability. While deep learning models can achieve strong predictive performance, they act as black boxes, making it difficult, especially for users that are not familiar with deep learning, to understand why certain predictions are obtained and thus limiting trust and applicability. Therefore, it is important to have strategies and methods that allow explainability of deep learning models in spatio-temporal prediction tasks [159]. A relevant direction towards model explainability is causality [160], namely, discovering cause–effect relationships within spatio-temporal input data. Integrating causality into deep learning models can provide not only better interpretable insights into the underlying spatio-temporal dynamics, but it can also yield more robust models with respect to generalisation.
Foundation models. Foundation models [161] are a relevant recent topic in deep learning. These, representing the next generation of pre-trained models, are trained on massive and heterogeneous datasets, providing general-purpose backbones that can be fine-tuned or extended to diverse tasks. Recently, pre-trained foundation models have also been proposed for spatio-temporal data, especially for Earth observation. Some of the most relevant models in this field are Aurora [162], Prithvi [163], Clay [164], Presto [115] and DOFA [165], which leverage the concept of neural plasticity to adaptively integrate multiple data modalities according to the task being tackled. In fact, multimodal learning is another relevant topic in regular raster spatio-temporal prediction. Multimodal learning refers to the ability of integrating and using multiple data sources to address a given task, leveraging the diverse complementary information provided by heterogeneous data modalities. This capability plays a crucial role in many Earth observation tasks, such as in cloud removal, where the combined use of optical and radar data has proven particularly advantageous [166]. Furthermore, it is worthwhile to remark that a few foundation models based on Large Language Models (LLMs) have been applied to spatio-temporal prediction [167]. Finally, in the field of pre-trained models, the Mixture-of-Experts [168] is a further relevant topic. The mixture-of-experts, generally empowered by transformers [169], can be used to build large pre-trained models while maintaining a moderate computational complexity. Such models are deep learning architectures where different experts, which are still neural networks, process different parts of the input. A routing network decides which experts should process a given input, activating only a few at a time. Such a sparse computation guarantees mixture-of-experts models to have similar performance to traditional models while also being less computationally demanding and scalable.

9. Conclusions

Herein, a survey of the most relevant deep learning architectures for regular raster spatio-temporal prediction is provided. For each reviewed deep learning architecture, the following aspects have been analysed and discussed:

Methodology and techniques underlying reviewed models;
Benefits and drawbacks, allowing the identification of domains where some models can be more effective than others;
Diverse applicative realms, with their respective datasets, where regular raster spatio-temporal prediction is relevant.

Specifically, literature analysis highlighted that the most commonly used deep learning methodologies for processing regular raster spatio-temporal data are CNNs, RNNs, Hybrid CNN-RNN models, transformers and diffusion models. Moreover, the fields in which regular raster spatio-temporal prediction is most often encountered are environmental monitoring and video processing, even if it is still popular in remote sensing, traffic modelling and energy production prediction. Finally, a brief discussion on other very recent and future research directions concerning regular raster spatio-temporal prediction is given, paying particular attention to prediction uncertainty estimation, physics-constrained models, explainability, pre-trained foundation models and multimodal learning.

Author Contributions

Conceptualization, V.C., A.C. and F.C.; methodology, V.C. and A.C.; validation, V.C., A.C. and F.C.; formal analysis, F.C.; investigation, V.C., and A.C.; resources, V.C., A.C. and F.C.; writing—original draft preparation, V.C.; writing—review and editing, V.C., A.C. and F.C.; visualization, V.C.; supervision, F.C.; project administration, F.C. All authors have read and agreed to the published version of the manuscript.

Funding

Francesco Camastra’s work was supported by the Digital Twin and Fintech services for sustainable supply chain (SmarTwin) project (Fondo per la Crescita Sostenibile—Accordi per l’innovazione di cui al D.M. 31 dicembre 2021e D.D. 18 marzo 2022-CUP B69J23000500005) Ministero dello Sviluppo Economico (MISE); the context-AwaRe deCision-making for Autonomus unmmaneD vehicles in mArine environmental monitoring (ARCAD-IA) project (PE00000013_1-CUP E63C22002150007) cascade call of the Future Artificial Intelligence Research (FAIR) project Spoke 3-Resilient AI, within the National Recovery and Resilience Plan (PNRR) of the Italian Ministry of University and Research (MUR).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xiao, C.; Chen, N.; Hu, C.; Wang, K.; Xu, Z.; Cai, Y.; Xu, L.; Chen, Z.; Gong, J. A spatiotemporal deep learning model for sea surface temperature field prediction using time-series satellite data. Environ. Model. Softw. 2019, 120, 104502. [Google Scholar] [CrossRef]
Censi, A.M.; Ienco, D.; Gbodjo, Y.J.E.; Pensa, R.G.; Interdonato, R.; Gaetano, R. Attentive Spatial Temporal Graph CNN for Land Cover Mapping From Multi Temporal Remote Sensing Data. IEEE Access 2021, 9, 23070–23082. [Google Scholar] [CrossRef]
Glomb, K.; Rué Queralt, J.; Pascucci, D.; Defferrard, M.; Tourbier, S.; Carboni, M.; Rubega, M.; Vulliémoz, S.; Plomp, G.; Hagmann, P. Connectome spectral analysis to track EEG task dynamics on a subsecond scale. NeuroImage 2020, 221, 117137. [Google Scholar] [CrossRef] [PubMed]
Pillai, K.G.; Angryk, R.A.; Banda, J.M.; Schuh, M.A.; Wylie, T. Spatio-temporal Co-occurrence Pattern Mining in Data Sets with Evolving Regions. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining Workshops, Brussels, Belgium, 10–13 December 2012; pp. 805–812. [Google Scholar] [CrossRef]
Morosin, R.; De La Cruz Rodríguez, J.; Díaz Baso, C.J.; Leenaarts, J. Spatio-temporal analysis of chromospheric heating in a plage region. Astron. Astrophys. 2022, 664, A8. [Google Scholar] [CrossRef]
Cressie, N.A.C. Statistics for Spatial Data; Wiley series in probability and mathematical statistics; Wiley: New York, NY, USA, 1993. [Google Scholar]
Cressie, N.A.C.; Wikle, C.K. Statistics for Spatio-Temporal Data; Wiley series in probability and statistics; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
Atluri, G.; Karpatne, A.; Kumar, V. Spatio-Temporal Data Mining: A Survey of Problems and Methods. Acm Comput. Surv. 2019, 51, 1–41. [Google Scholar] [CrossRef]
Wang, S.; Cao, J.; Yu, P.S. Deep Learning for Spatio-Temporal Data Mining: A Survey. IEEE Trans. Knowl. Data Eng. 2020, 34, 3681–3700. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 6000–6010. [Google Scholar]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-k.; WOO, W.-c. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28, pp. 802–810. [Google Scholar]
Gao, Z.; Shi, X.; Wang, H.; Zhu, Y.; Wang, Y.B.; Li, M.; Yeung, D.Y. Earthformer: Exploring Space-Time Transformers for Earth System Forecasting. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 25390–25403. [Google Scholar]
Wang, Y.; Wu, H.; Zhang, J.; Gao, Z.; Wang, J.; Yu, P.S.; Long, M. PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2208–2225. [Google Scholar] [CrossRef]
Benson, V.; Robin, C.; Requena-Mesa, C.; Alonso, L.; Carvalhais, N.; Cortés, J.; Gao, Z.; Linscheid, N.; Weynants, M.; Reichstein, M. Multi-modal Learning for Geospatial Vegetation Forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 27788–27799. [Google Scholar]
Mou, L.; Bruzzone, L.; Zhu, X.X. Learning Spectral-Spatial-Temporal Features via a Recurrent Convolutional Neural Network for Change Detection in Multispectral Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 924–935. [Google Scholar] [CrossRef]
Xiao, Y.; Yuan, Q.; He, J.; Zhang, Q.; Sun, J.; Su, X.; Wu, J.; Zhang, L. Space-time super-resolution for satellite video: A joint framework based on multi-scale spatial-temporal transformer. Int. J. Appl. Earth Obs. Geoinf. 2022, 108, 102731. [Google Scholar] [CrossRef]
Ebel, P.; Fare Garnot, V.S.; Schmitt, M.; Wegner, J.D.; Zhu, X.X. UnCRtainTS: Uncertainty Quantification for Cloud Removal in Optical Satellite Time Series. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 2086–2096. [Google Scholar] [CrossRef]
Panboonyuen, T.; Charoenphon, C.; Satirapod, C. SatDiff: A Stable Diffusion Framework for Inpainting Very High-Resolution Satellite Imagery. IEEE Access 2025, 13, 51617–51631. [Google Scholar] [CrossRef]
Guo, S.; Lin, Y.; Li, S.; Chen, Z.; Wan, H. Deep Spatial–Temporal 3D Convolutional Neural Networks for Traffic Data Forecasting. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3913–3926. [Google Scholar] [CrossRef]
Cui, Z.; Ke, R.; Pu, Z.; Wang, Y. Stacked bidirectional and unidirectional LSTM recurrent neural network for forecasting network-wide traffic state with missing values. Transp. Res. Part Emerg. Technol. 2020, 118, 102674. [Google Scholar] [CrossRef]
Yan, H.; Ma, X.; Pu, Z. Learning Dynamic and Hierarchical Traffic Spatiotemporal Features with Transformer. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22386–22399. [Google Scholar] [CrossRef]
Yu, D.; Guo, G.; Wang, D.; Zhang, H.; Li, B.; Xu, G.; Deng, S. Modeling dynamic spatio-temporal correlations and transitions with time window partitioning for traffic flow prediction. Expert Syst. Appl. 2024, 252, 124187. [Google Scholar] [CrossRef]
Huang, H.; Castruccio, S.; Genton, M.G. Forecasting High-Frequency Spatio-Temporal Wind Power with Dimensionally Reduced Echo State Networks. J. R. Stat. Soc. Ser. Appl. Stat. 2022, 71, 449–466. [Google Scholar] [CrossRef]
Žalik, M.; Mongus, D.; Lukač, N. High-resolution spatiotemporal assessment of solar potential from remote sensing data using deep learning. Renew. Energy 2024, 222, 119868. [Google Scholar] [CrossRef]
Girdhar, R.; Joao Carreira, J.; Doersch, C.; Zisserman, A. Video Action Transformer Network. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 244–253. [Google Scholar] [CrossRef]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar] [CrossRef]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar] [CrossRef]
Zhai, S.; Ye, Z.; Liu, J.; Xie, W.; Hu, J.; Peng, Z.; Xue, H.; Chen, D.; Wang, X.; Yang, L.; et al. StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2025; pp. 26822–26833. [Google Scholar]
Wikle, C.K.; Zammit Mangion, A.; Cressie, N.A.C. Spatio-Temporal Statistics with R; Chapman & Hall/CRC: The R Series; CRC Press: Boca Raton, FL, USA; Taylor & Francis Group: London, UK; New York, NY, USA, 2019. [Google Scholar]
Pfeifer, P.E.; Deutsch, S.J. A Three-Stage Iterative Procedure for Space-Time Modeling. Technometrics 1980, 22, 35. [Google Scholar] [CrossRef]
Pfeifer, P.E.; Deutsch, S.J. Seasonal Space-Time ARIMA Modeling. Geogr. Anal. 1981, 13, 117–133. [Google Scholar] [CrossRef]
Stoffer, D.S. Estimation and Identification of Space-Time ARMAX Models in the Presence of Missing Data. J. Am. Stat. Assoc. 1986, 81, 762–772. [Google Scholar] [CrossRef]
Vapnik, V.N. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 2000. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Bishop, C.M.; Bishop, H. Deep Learning: Foundations and Concepts, 1st ed.; Springer International Publishing: Berlin/Heidelberg, Germany, 2024. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal Convolutional Networks for Action Segmentation and Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1003–1012. [Google Scholar] [CrossRef]
Zhang, J.; Zheng, Y.; Qi, D.; Li, R.; Yi, X.; Li, T. Predicting citywide crowd flows using deep spatio-temporal residual networks. Artif. Intell. 2018, 259, 147–166. [Google Scholar] [CrossRef]
Ham, Y.G.; Kim, J.H.; Luo, J.J. Deep learning for multi-year ENSO forecasts. Nature 2019, 573, 568–572. [Google Scholar] [CrossRef] [PubMed]
Ayzel, G.; Heistermann, M.; Sorokin, A.; Nikitin, O.; Lukyanova, O. All convolutional neural networks for radar-based precipitation nowcasting. Procedia Comput. Sci. 2019, 150, 186–192. [Google Scholar] [CrossRef]
Zammit-Mangion, A.; Wikle, C.K. Deep integro-difference equation models for spatio-temporal forecasting. Spat. Stat. 2020, 37, 100408. [Google Scholar] [CrossRef]
Andersson, T.R.; Hosking, J.S.; Pérez-Ortiz, M.; Paige, B.; Elliott, A.; Russell, C.; Law, S.; Jones, D.C.; Wilkinson, J.; Phillips, T.; et al. Seasonal Arctic sea ice forecasting with probabilistic deep learning. Nat. Commun. 2021, 12, 5124. [Google Scholar] [CrossRef]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [Google Scholar] [CrossRef]
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale Video Classification with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Dumoulin, V.; Visin, F. A guide to convolution arithmetic for deep learning. arXiv 2016, arXiv:1603.07285. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar]
Wu, P.; Yin, Z.; Yang, H.; Wu, Y.; Ma, X. Reconstructing Geostationary Satellite Land Surface Temperature Imagery Based on a Multiscale Feature Connected Convolutional Neural Network. Remote Sens. 2019, 11, 300. [Google Scholar] [CrossRef]
Hutchison, D.; Kanade, T.; Kittler, J.; Kleinberg, J.M.; Mattern, F.; Mitchell, J.C.; Naor, M.; Nierstrasz, O.; Pandu Rangan, C.; Steffen, B.; et al. Convolutional Learning of Spatio-temporal Features. In Computer Vision—ECCV 2010; Daniilidis, K., Maragos, P., Paragios, N., Eds.; Springer: Berlin/Heidelberg, Genmary, 2010; Volume 6316, pp. 140–153. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Gao, Z.; Tan, C.; Wu, L.; Li, S.Z. SimVP: Simpler yet Better Video Prediction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3160–3170. [Google Scholar] [CrossRef]
Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric Deep Learning: Going beyond Euclidean data. IEEE Signal Process. Mag. 2017, 34, 18–42. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Yang, X.; DelSole, T. Systematic Comparison of ENSO Teleconnection Patterns between Models and Observations. J. Clim. 2012, 25, 425–446. [Google Scholar] [CrossRef]
Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Jaeger, H. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note. Bonn Ger. Ger. Natl. Res. Cent. Inf. Technol. Gmd Tech. Rep. 2001, 148, 13. [Google Scholar]
Zhang, T.; Zheng, W.; Cui, Z.; Zong, Y.; Li, Y. Spatial–Temporal Recurrent Neural Network for Emotion Recognition. IEEE Trans. Cybern. 2019, 49, 839–847. [Google Scholar] [CrossRef] [PubMed]
Jain, A.; Zamir, A.R.; Savarese, S.; Saxena, A. Structural-RNN: Deep Learning on Spatio-Temporal Graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Capone, V.; Casolaro, A.; Camastra, F. Spatio-temporal prediction using graph neural networks: A survey. Neurocomputing 2025, 643, 130400. [Google Scholar] [CrossRef]
McDermott, P.L.; Wikle, C.K. An ensemble quadratic echo state network for non-linear spatio-temporal forecasting. Stat 2017, 6, 315–330. [Google Scholar] [CrossRef]
McDermott, P.L.; Wikle, C.K. Deep echo state networks with uncertainty quantification for spatio-temporal forecasting. Environmetrics 2019, 30, e2553. [Google Scholar] [CrossRef]
Fragkiadaki, K.; Levine, S.; Felsen, P.; Malik, J. Recurrent Network Models for Human Dynamics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Srivastava, N.; Mansimov, E.; Salakhudinov, R. Unsupervised Learning of Video Representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 843–852. [Google Scholar]
Jia, X.; Khandelwal, A.; Nayak, G.; Gerber, J.; Carlson, K.; West, P.; Kumar, V. Predict Land Covers with Transition Modeling and Incremental Learning. In Proceedings of the 2017 SIAM International Conference on Data Mining (SDM), Houston, TX, USA, 27–29 April 2017; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2017; pp. 171–179. [Google Scholar] [CrossRef]
Jia, X.; Khandelwal, A.; Nayak, G.; Gerber, J.; Carlson, K.; West, P.; Kumar, V. Incremental Dual-memory LSTM in Land Cover Prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 867–876. [Google Scholar] [CrossRef]
Reddy, D.S.; Prasad, P.R.C. Prediction of vegetation dynamics using NDVI time series data and LSTM. Model. Earth Syst. Environ. 2018, 4, 409–419. [Google Scholar] [CrossRef]
Ndikumana, E.; Ho Tong Minh, D.; Baghdadi, N.; Courault, D.; Hossard, L. Deep Recurrent Neural Network for Agricultural Classification using multitemporal SAR Sentinel-1 for Camargue, France. Remote Sens. 2018, 10, 1217. [Google Scholar] [CrossRef]
McDermott, P.L.; Wikle, C.K. Bayesian Recurrent Neural Network Models for Forecasting and Quantifying Uncertainty in Spatial-Temporal Data. Entropy 2019, 21, 184. [Google Scholar] [CrossRef]
Vlachas, P.; Pathak, J.; Hunt, B.; Sapsis, T.; Girvan, M.; Ott, E.; Koumoutsakos, P. Backpropagation algorithms and Reservoir Computing in Recurrent Neural Networks for the forecasting of complex spatiotemporal dynamics. Neural Netw. 2020, 126, 191–217. [Google Scholar] [CrossRef] [PubMed]
Lees, T.; Tseng, G.; Atzberger, C.; Reece, S.; Dadson, S. Deep Learning for Vegetation Health Forecasting: A Case Study in Kenya. Remote Sens. 2022, 14, 698. [Google Scholar] [CrossRef]
Liu, Q.; Yang, M.; Mohammadi, K.; Song, D.; Bi, J.; Wang, G. Machine Learning Crop Yield Models Based on Meteorological Features and Comparison with a Process-Based Model. Artif. Intell. Earth Syst. 2022, 1, e220002. [Google Scholar] [CrossRef]
Interdonato, R.; Ienco, D.; Gaetano, R.; Ose, K. DuPLO: A DUal view Point deep Learning architecture for time series classificatiOn. Isprs J. Photogramm. Remote Sens. 2019, 149, 91–104. [Google Scholar] [CrossRef]
Qiu, C.; Mou, L.; Schmitt, M.; Zhu, X.X. Local climate zone-based urban land cover classification from multi-seasonal Sentinel-2 images with a recurrent residual network. Isprs J. Photogramm. Remote Sens. 2019, 154, 151–162. [Google Scholar] [CrossRef]
Kaur, A.; Goyal, P.; Sharma, K.; Sharma, L.; Goyal, N. A Generalized Multimodal Deep Learning Model for Early Crop Yield Prediction. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 1272–1279. [Google Scholar] [CrossRef]
Graves, A. Generating Sequences with Recurrent Neural Networks. arXiv 2013, arXiv:1308.0850. [Google Scholar] [CrossRef]
Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.k.; WOO, W.c. Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5622–5632. [Google Scholar]
Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Wang, Y.; Zhang, J.; Zhu, H.; Long, M.; Wang, J.; Yu, P.S. Memory in Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity From Spatiotemporal Dynamics. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9146–9154. [Google Scholar] [CrossRef]
Yang, Y.; Dong, J.; Sun, X.; Lima, E.; Mu, Q.; Wang, X. A CFCC-LSTM Model for Sea Surface Temperature Prediction. IEEE Geosci. Remote Sens. Lett. 2018, 15, 207–211. [Google Scholar] [CrossRef]
Ienco, D.; Interdonato, R.; Gaetano, R.; Ho Tong Minh, D. Combining Sentinel-1 and Sentinel-2 Satellite Image Time Series for land cover mapping via a multi-source deep learning architecture. Isprs J. Photogramm. Remote Sens. 2019, 158, 11–22. [Google Scholar] [CrossRef]
Boulila, W.; Ghandorh, H.; Khan, M.A.; Ahmed, F.; Ahmad, J. A novel CNN-LSTM-based approach to predict urban expansion. Ecol. Inform. 2021, 64, 101325. [Google Scholar] [CrossRef]
Wu, H.; Yao, Z.; Wang, J.; Long, M. MotionRNN: A Flexible Model for Video Prediction with Spacetime-Varying Motions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15430–15439. [Google Scholar] [CrossRef]
Robin, C.; Requena-Mesa, C.; Benson, V.; Alonso, L.; Poehls, J.; Carvalhais, N.; Reichstein, M. Learning to Forecast Vegetation Greenness at Fine Resolution over Africa with ConvLSTMs. arXiv 2022, arXiv:2210.13648. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. Preprint, 2018; in press. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention All You Need for Video Understanding? In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Cambridge MA, USA, 2021; Volume 139, pp. 813–824. [Google Scholar]
Yan, S.; Xiong, X.; Arnab, A.; Lu, Z.; Zhang, M.; Sun, C.; Schmid, C. Multiview Transformers for Video Recognition. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3323–3333. [Google Scholar] [CrossRef]
Li, K.; Wang, Y.; Peng, G.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Neimark, D.; Bar, O.; Zohar, M.; Asselmann, D. Video Transformer Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 3163–3172. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
Meinhardt, T.; Kirillov, A.; Leal-Taixé, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8844–8854. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Li, Y.; He, K. Masked Autoencoders As Spatiotemporal Learners. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 35946–35958. [Google Scholar]
Caballero, J.; Ledig, C.; Aitken, A.; Acosta, A.; Totz, J.; Wang, Z.; Shi, W. Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2848–2857. [Google Scholar] [CrossRef]
Zeng, Y.; Fu, J.; Chao, H. Learning Joint Spatial-Temporal Transformations for Video Inpainting. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12361, pp. 528–543. [Google Scholar]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10428–10437. [Google Scholar] [CrossRef]
Alfasly, S.; Chui, C.K.; Jiang, Q.; Lu, J.; Xu, C. An Effective Video Transformer with Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 2496–2509. [Google Scholar] [CrossRef]
Lin, K.; Li, L.; Lin, C.C.; Ahmed, F.; Gan, Z.; Liu, Z.; Lu, Y.; Wang, L. SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17928–17937. [Google Scholar] [CrossRef]
Yu, Y.; Ni, R.; Zhao, Y.; Yang, S.; Xia, F.; Jiang, N.; Zhao, G. MSVT: Multiple Spatiotemporal Views Transformer for DeepFake Video Detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4462–4471. [Google Scholar] [CrossRef]
Tao, R.; Huang, B.; Zou, X.; Zheng, G. SVT-SDE: Spatiotemporal Vision Transformers-Based Self-Supervised Depth Estimation in Stereoscopic Surgical Videos. IEEE Trans. Med. Robot. Bionics 2023, 5, 42–53. [Google Scholar] [CrossRef]
Zhou, W.; Zhao, Y.; Zhang, F.; Luo, B.; Yu, L.; Chen, B.; Yang, C.; Gui, W. TSDTVOS: Target-guided spatiotemporal dual-stream transformers for video object segmentation. Neurocomputing 2023, 555, 126582. [Google Scholar] [CrossRef]
Hsu, T.C.; Liao, Y.S.; Huang, C.R. Video Summarization with Spatiotemporal Vision Transformer. IEEE Transactions on Image Processing 2023, 32, 3013–3026. [Google Scholar] [CrossRef]
Gupta, A.; Tian, S.; Zhang, Y.; Wu, J.; Martín-Martín, R.; Fei-Fei, L. MaskViT: Masked Visual Pre-Training for Video Prediction. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Liang, J.; Cao, J.; Fan, Y.; Zhang, K.; Ranjan, R.; Li, Y.; Timofte, R.; Van Gool, L. VRT: A Video Restoration Transformer. IEEE Trans. Image Process. 2024, 33, 2171–2182. [Google Scholar] [CrossRef]
Korban, M.; Youngs, P.; Acton, S.T. A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6055–6069. [Google Scholar] [CrossRef]
Gu, F.; Lu, J.; Cai, C.; Zhu, Q.; Ju, Z. RTSformer: A Robust Toroidal Transformer with Spatiotemporal Features for Visual Tracking. IEEE Trans.-Hum.-Mach. Syst. 2024, 54, 214–225. [Google Scholar] [CrossRef]
Li, M.; Li, F.; Meng, B.; Bai, R.; Ren, J.; Huang, Z.; Gao, C. Spatiotemporal Representation Enhanced ViT for Video Recognition. In MultiMedia Modeling; Rudinac, S., Hanjalic, A., Liem, C., Worring, M., Jonsson, B., Liu, B., Yamakata, Y., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2024; Volume 14554, pp. 28–40. [Google Scholar] [CrossRef]
Lin, F.; Crawford, S.; Guillot, K.; Zhang, Y.; Chen, Y.; Yuan, X.; Chen, L.; Williams, S.; Minvielle, R.; Xiao, X.; et al. MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; pp. 5751–5761. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 548–558. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Tseng, G.; Cartuyvels, R.; Zvonkov, I.; Purohit, M.; Rolnick, D.; Kerner, H. Lightweight, Pre-trained Transformers for Remote Sensing Timeseries. arXiv 2023, arXiv:2304.14065. [Google Scholar] [CrossRef]
Tang, S.; Li, C.; Zhang, P.; Tang, R. SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; pp. 13470–13479. [Google Scholar]
Li, Z.; Chen, G.; Zhang, T. A CNN-Transformer Hybrid Approach for Crop Classification Using Multitemporal Multisensor Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 847–858. [Google Scholar] [CrossRef]
Aksan, E.; Kaufmann, M.; Cao, P.; Hilliges, O. A Spatio-temporal Transformer for 3D Human Motion Prediction. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 565–574. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607514. [Google Scholar] [CrossRef]
Huang, L.; Mao, F.; Zhang, K.; Li, Z. Spatial-Temporal Convolutional Transformer Network for Multivariate Time Series Forecasting. Sensors 2022, 22, 841. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Wang, Y.; Hong, D.; Sha, J.; Gao, L.; Liu, L.; Zhang, Y.; Rong, X. Spectral–Spatial–Temporal Transformers for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5536814. [Google Scholar] [CrossRef]
Yu, M.; Masrur, A.; Blaszczak-Boxe, C. Predicting hourly PM2.5 concentrations in wildfire-prone areas using a SpatioTemporal Transformer model. Sci. Total Environ. 2023, 860, 160446. [Google Scholar] [CrossRef]
Yi, Z.; Zhang, H.; Tan, P.; Gong, M. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. Version Number: 4. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. Version Number: 3. [Google Scholar] [CrossRef]
Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv 2018, arXiv:1809.11096. [Google Scholar] [CrossRef]
Yoon, J.; Jarrett, D.; van der Schaar, M. Time-series Generative Adversarial Networks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F.d., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Liao, S.; Ni, H.; Szpruch, L.; Wiese, M.; Sabate-Vidales, M.; Xiao, B. Conditional Sig-Wasserstein GANs for Time Series Generation. arXiv 2020, arXiv:2006.05421. [Google Scholar] [CrossRef]
Li, X.; Metsis, V.; Wang, H.; Ngu, A.H.H. TTS-GAN: A Transformer-Based Time-Series Generative Adversarial Network. In Artificial Intelligence in Medicine; Michalowski, M., Abidi, S.S.R., Abidi, S., Eds.; Springer International Publishing: Cham, Switzerland, 2022; Volume 13263, pp. 133–143. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
Luo, C. Understanding Diffusion Models: A Unified Perspective. arXiv 2022, arXiv:2208.11970. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Yuan, H.; Zhou, S.; Yu, S. EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models. arXiv 2023, arXiv:2303.05656. [Google Scholar]
Rühling Cachay, S.; Zhao, B.; Joren, H.; Yu, R. DYffusion: A Dynamics-informed Diffusion Model for Spatiotemporal Forecasting. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 45259–45287. [Google Scholar]
Han, X.; Zheng, H.; Zhou, M. CARD: Classification and Regression Diffusion Models. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 18100–18115. [Google Scholar]
Awasthi, A.; Ly, S.T.; Nizam, J.; Mehta, V.; Ahmad, S.; Nemani, R.; Prasad, S.; Nguyen, H.V. Anomaly Detection in Satellite Videos Using Diffusion Models. In Proceedings of the 2024 IEEE 26th International Workshop on Multimedia Signal Processing (MMSP), West Lafayette, IN, USA, 2–4 October 2024; pp. 1–6. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar] [CrossRef]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Ye, X.; Bilodeau, G.A. STDiff: Spatio-Temporal Diffusion for Continuous Stochastic Video Prediction. Proc. Aaai Conf. Artif. Intell. 2024, 38, 6666–6674. [Google Scholar] [CrossRef]
Zhao, Z.; Dong, X.; Wang, Y.; Hu, C. Advancing Realistic Precipitation Nowcasting with a Spatiotemporal Transformer-Based Denoising Diffusion Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Zou, X.; Li, K.; Xing, J.; Zhang, Y.; Wang, S.; Jin, L.; Tao, P. DiffCR: A Fast Conditional Diffusion Framework for Cloud Removal From Optical Satellite Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
Liu, H.; Liu, J.; Hu, T.; Ma, H. Spatio-Temporal Probabilistic Forecasting of Wind Speed Using Transformer-Based Diffusion Models. IEEE Trans. Sustain. Energy 2025, 1–13. [Google Scholar] [CrossRef]
Yao, S.; Zhang, X.; Liu, X.; Liu, M.; Cui, Z. STDD: Spatio-Temporal Dual Diffusion for Video Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 12575–12584. [Google Scholar]
Anderson, B.D. Reverse-time diffusion equation models. Stoch. Processes Their Appl. 1982, 12, 313–326. [Google Scholar] [CrossRef]
Lipman, Y.; Chen, R.T.Q.; Ben-Hamu, H.; Nickel, M.; Le, M. Flow Matching for Generative Modeling. arXiv 2022, arXiv:2210.02747. [Google Scholar] [CrossRef]
Lipman, Y.; Havasi, M.; Holderrieth, P.; Shaul, N.; Le, M.; Karrer, B.; Chen, R.T.Q.; Lopez-Paz, D.; Ben-Hamu, H.; Gat, I. Flow Matching Guide and Code. arXiv 2024, arXiv:2412.06264. [Google Scholar] [CrossRef]
Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models. Mach. Intell. Res. 2025, 22, 730–751. [Google Scholar] [CrossRef]
Holderrieth, P.; Havasi, M.; Yim, J.; Shaul, N.; Gat, I.; Jaakkola, T.; Karrer, B.; Chen, R.T.Q.; Lipman, Y. Generator Matching: Generative modeling with arbitrary Markov processes. In Proceedings of the International Conference on Representation Learning, Singapore, 24–28 April 2025; Volume 2025, pp. 52153–52219. [Google Scholar]
Peebles, W.; Xie, S. Scalable Diffusion Models with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023. [Google Scholar] [CrossRef]
Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; Volume 2, pp. 729–734. [Google Scholar] [CrossRef]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
Xu, L.; Chen, N.; Yang, C.; Yu, H.; Chen, Z. Quantifying the uncertainty of precipitation forecasting using probabilistic deep learning. Hydrol. Earth Syst. Sci. 2022, 26, 2923–2938. [Google Scholar] [CrossRef]
Casolaro, A.; Capone, V.; Camastra, F. Predicting ground-level nitrogen dioxide concentrations using the BaYesian attention-based deep neural network. Ecol. Inform. 2025, 87, 103097. [Google Scholar] [CrossRef]
Kapoor, A.; Negi, A.; Marshall, L.; Chandra, R. Cyclone trajectory and intensity prediction with uncertainty quantification using variational recurrent neural networks. Environ. Model. Softw. 2023, 162, 105654. [Google Scholar] [CrossRef]
MacKay, D.J.C. A Practical Bayesian Framework for Backpropagation Networks. Neural Comput. 1992, 4, 448–472. [Google Scholar] [CrossRef]
Raissi, M.; Perdikaris, P.; Karniadakis, G. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Hess, P.; Drüke, M.; Petri, S.; Strnad, F.M.; Boers, N. Physically constrained generative adversarial networks for improving precipitation fields from Earth system models. Nat. Mach. Intell. 2022, 4, 828–839. [Google Scholar] [CrossRef]
Harder, P.; Hernandez-Garcia, A.; Ramesh, V.; Yang, Q.; Sattegeri, P.; Szwarcman, D.; Watson, C.; Rolnick, D. Hard-Constrained Deep Learning for Climate Downscaling. J. Mach. Learn. Res. 2023, 24, 1–40. [Google Scholar]
Verdone, A.; Scardapane, S.; Panella, M. Explainable Spatio-Temporal Graph Neural Networks for multi-site photovoltaic energy production. Appl. Energy 2024, 353, 122151. [Google Scholar] [CrossRef]
Peters, J.; Janzing, D.; Schölkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms; Adaptive computation and machine learning series; The MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
Awais, M.; Naseer, M.; Khan, S.; Anwer, R.M.; Cholakkal, H.; Shah, M.; Yang, M.H.; Khan, F.S. Foundation Models Defining a New Era in Vision: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2245–2264. [Google Scholar] [CrossRef] [PubMed]
Bodnar, C.; Bruinsma, W.P.; Lucic, A.; Stanley, M.; Allen, A.; Brandstetter, J.; Garvan, P.; Riechert, M.; Weyn, J.A.; Dong, H.; et al. A foundation model for the Earth system. Nature 2025, 641, 1180–1187. [Google Scholar] [CrossRef]
Szwarcman, D.; Roy, S.; Fraccaro, P.; Gislason, T.E.; Blumenstiel, B.; Ghosal, R.; de Oliveira, P.H.; Almeida, J.L.d.S.; Sedona, R.; Kang, Y.; et al. Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications. arXiv 2024, arXiv:2412.02732. [Google Scholar] [CrossRef]
Clay. Clay Foundation Model. 2025. Available online: https://madewithclay.org/ (accessed on 18 October 2025).
Xiong, Z.; Wang, Y.; Zhang, F.; Stewart, A.J.; Hanna, J.; Borth, D.; Papoutsis, I.; Saux, B.L.; Camps-Valls, G.; Zhu, X.X. Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation. arXiv 2024, arXiv:2403.15356. [Google Scholar] [CrossRef]
Ebel, P.; Xu, Y.; Schmitt, M.; Zhu, X.X. SEN12MS-CR-TS: A Remote-Sensing Data Set for Multimodal Multitemporal Cloud Removal. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Fang, L.; Xiang, W.; Pan, S.; Salim, F.D.; Chen, Y.P.P. Spatiotemporal Pretrained Large Language Model for Forecasting with Missing Values. IEEE Internet Things J. 2025, 12, 13838–13850. [Google Scholar] [CrossRef]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]

Figure 1. (a) An example of regularly spaced spatio-temporal data. (b) An example of irregularly spaced spatio-temporal data.

Figure 2. An example application of 2D convolution to a spatio-temporal input. Each coloured block in the convolution input tensor indicates a different application of the kernel. The resulting output is highlighted with the same colour.

Figure 3. An example application of 3D convolution to a spatio-temporal input. Each coloured block in the convolution input tensor indicates a different application of the kernel. The resulting output is highlighted with the same colour.

Figure 4. Example of a ConvLSTM in which matrix products, typical of fully connected RNNs, are substituted with convolutions. Terms

h_{t}

and

z_{t}

denote, at time t, the hidden state and the input tensor of ConvLSTM, respectively.

Figure 4. Example of a ConvLSTM in which matrix products, typical of fully connected RNNs, are substituted with convolutions. Terms

h_{t}

and

z_{t}

denote, at time t, the hidden state and the input tensor of ConvLSTM, respectively.

Figure 5. Depiction of a video vision transformer pipeline inspired by ViViT.

Figure 6. Distribution of reviewed articles with respect to reviewed architectures.

Figure 7. Distribution of reviewed articles with respect to the applicative domains.

Figure 8. Distribution of reviewed articles with respect to architectures and applicative domains. Each bar represents the percentage of articles using a particular architecture in a particular domain.

Table 1. List of papers where CNNs are used, underlining the type of convolution employed.

Ref.	Year	Type	Applications	Datasets
[50]	2010	2D	Video Processing; Action Recognition	NORB Synthetic; KTH Action; Hollywood2
[44]	2013	3D	Action Recognition	Trecvid
[45]	2014	3D	Video Classification; Action Recognition	Sports-1M; UCF-101
[46]	2015	3D	Video Processing; Action Recognition; Scene Recognition; Object Recognition	Sports-1M; UCF-101; YUPENN; Maryland; Egocentric
[39]	2018	2D	Crowd Flow Forecasting	TaxiBJ; BikeNYC
[40]	2019	2D	El Niño Index Prediction	CMIP5; SST Reanalysis Data
[49]	2019	2D	Satellite Image Missing Data Reconstruction	FY-2G LST; MSG-SEVIRI LST
[20]	2019	3D	Traffic Forecasting	TaxiBJ; BikeNYC
[41]	2019	2D	Precipitation Forecasting	DWD RY Product
[51]	2019	2D	Action Recognition	Kinetics; AVA
[42]	2020	2D	Sea Surface Temperature Modelling	CMEMS SST
[43]	2021	2D	Sea Ice Forecasting	EUMETSAT OSI-SAF SIC
[52]	2022	2D	Video Prediction	MovingMNIST; TaxiBJ; Human3.6M; KTH
[18]	2023	3D	Cloud Removal	SEN12MSCR; SEN12MS-CR-TS
[25]	2024	2D	Solar Potential Forecasting	Custom Datasets

Table 2. List of publications based on RNNs along with the type of recurrent unit they use.

Ref.	Year	Type	Applications	Datasets
[65]	2015	LSTM	Human Motion Prediction	Human3.6M
[66]	2015	LSTM	Video Representation Learning; Video Action Recognition	Sports-1M; UCF101; HMDB51
[63]	2017	ESN	Sea Surface Temperature Modelling	NOAA ERSST
[67]	2017	RNN; LSTM	Land Cover Prediction	MODIS; RSPO; Tree Plantation
[68]	2017	LSTM	Land Cover Prediction	MODIS; RSPO; Tree Plantation
[69]	2018	LSTM	Vegetation Dynamics Forecasting	MODIS NDVI
[70]	2018	LSTM; GRU	Land Cover Classification	Sentinel-1A/1B SAR
[64]	2019	ESN	Soil Moisture Forecasting	NOAA CPC GMSM
[71]	2019	Bayesian RNN	Sea Surface Temperature Modelling	Lorenz-96; NOAA ERSST
[72]	2020	LSTM; GRU; ESN	Spatio-temporal Forecasting	Lorenz-96; Kuramoto–Sivashinsky
[21]	2020	LSTM	Traffic Forecasting; Missing Data Imputation	Loop-Sea; PeMS
[73]	2022	LSTM	Vegetation Health Forecasting	MODIS NDVI
[74]	2022	LSTM	Crop Yield Prediction	USDA NASS
[24]	2022	ESN	Wind Power Forecasting	WRF Simulation Data

Table 3. Articles proposing hybrid convolutional–recurrent models.

Ref.	Year	Type	Applications	Datasets
[12]	2015	LSTM	Precipitation Nowcasting	MovingMNIST; Radar Echo
[79]	2017	GRU	Precipitation Nowcasting	MovingMNIST++; HKO-7
[80]	2017	LSTM	Video Prediction	MovingMNIST; KTH; Radar Echo
[82]	2018	LSTM	Sea Surface Temperature Modelling	NOAA OISST; NASA AVHRR
[81]	2019	LSTM	Spatio-temporal Forecasting	MovingMNIST; TaxiBJ; Human3.6M; Radar Echo
[75]	2019	GRU	Land Cover Classification	Sentinel-2 Data
[76]	2019	LSTM	Urban Land Cover Classification	Sentinel-2 Data
[83]	2019	GRU	Land Cover Classification	Sentinel-1 Data; Sentinel-2 Data
[1]	2019	LSTM	Sea Surface Temperature Modelling	NOAA OISST
[16]	2019	RNN	Remote Sensing Change Detection	Landsat ETM+
[84]	2021	LSTM	Urban Expansion Prediction	SPOT Satellite Data
[85]	2021	Various	Video Prediction; Precipitation Nowcasting	Human3.6M; Shangai Precipitation Data; MovingMNIST
[77]	2022	LSTM	Crop Yield Prediction	MODIS; Landsat8; Sentinel-2
[86]	2022	LSTM	NDVI Forecasting	Sentinel-2; ERA5; SMAP; SRTM
[14]	2023	LSTM	Spatio-temporal Forecasting	MovingMNIST; KTH; Radar Echo; Traffic4Cast; BAIR

Table 4. Transformer-related publications for video processing along with their publication year, their applications and the used datasets.

Ref.	Year	Applications	Datasets
[98]	2017	Video Super-Resolution	CDVL
[26]	2019	Video Action Recognition	AVA
[99]	2020	Video Inpainting	YouTube-VOS; DAVIS
[94]	2021	Video Action Recognition	Kinetics; MiT
[91]	2021	Video Action Recognition	Kinetics; Something-Something; Diving-48
[27]	2021	Video Action Recognition; Video Classification	Kinetics; Something-Something; Epic-Kitchens; MiT
[100]	2021	Object Tracking	LaSOT; GOT-10K; COCO2017; TrackingNet
[96]	2022	Object Tracking	MOT17; MOTS20
[92]	2022	Video Classification	Kinetics; Something-Something; Epic-Kitchens; MiT
[93]	2022	Video Classification	Kinetics; Something-Something
[101]	2022	Video Action Recognition	Kinetics; Something-Something
[28]	2022	Video Classification	Kinetics; Something-Something
[102]	2022	Video Captioning	MSVD; YouCookII; MSRVTT; TVC; Vatex
[97]	2022	Video Representation Learning; Video Classification	Instagram Video Data; Kinetics
[103]	2023	Deep Fake Video Detection	FaceForensics++; DeepFakeDetection; Celeb-DF-v2; DeeperForensics-1.0; WildDeepfake
[104]	2023	Video Depth Estimation	dVPN; SCARED
[105]	2023	Video Object Segmentation	Youtube-VOS; DAVIS
[106]	2023	Video Summarization	SumME; TVSum
[107]	2023	Video Prediction	BAIR; KITTI; RoboNet
[108]	2024	Video Super-Resolution; Video Deblurring; Video Denoising	Multiple datasets
[109]	2024	Video Action Recognition	AVA; UCF101; Epic-Kitchens
[110]	2024	Object Tracking	LaSOT; GOT-10K; UAV123; NfS; OTB2015; VOT2018; TempleColor128
[111]	2024	Video Action Recognition	Kinetics; Something-Something

Table 5. Transformer-related publications for spatio-temporal modelling in heterogeneous domains.

Ref.	Year	Applications	Datasets
[117]	2020	Crop Classification	Sentinel-2; Landsat-8
[118]	2021	3D Human Motion Prediction	Human3.6M; AMASS
[119]	2022	Remote Sensing Change Detection	LEVIR; WHU-CD
[17]	2022	Remote Sensing Super Resolution	Jilin-1 Custom
[13]	2022	Earth Systems Forecasting	MovingMNIST; SEVIR; ICAR-ENSO
[120]	2022	Multivariate Spatial Time Series Forecasting	PeMS; Electricity; Traffic
[22]	2022	Traffic Forecasting	METR-LA; Urban-BJ; Ring-BJ
[121]	2022	Remote Sensing Change Detection	CDD; WHU-CD; OSCD; HRSCD
[122]	2022	Remote Sensing Change Detection	Farmland-CD; Barbara-CD; BayArea-CD
[123]	2023	Spatio-temporal PM2.5 Forecasting	Custom EPA AQS
[116]	2023	3D Human Motion Prediction; Traffic Forecasting; Action Recognition	MovingMNIST; Human3.6M; TaxiBJ; KTH
[112]	2023	Crop Yield Prediction	USDA Crop Data; HRRR Data; Sentinel-2
[23]	2024	Traffic Forecasting	PeMS; Zhengzhou
[15]	2024	Vegetation Forecasting	GreenEarthNet

Table 6. Publications for spatio-temporal prediction using diffusion models.

Ref.	Year	Applications	Datasets
[134]	2023	Forecasting	SST; Navier–Stokes; Spring-Mesh
[136]	2024	Anomaly Detection	GOES-16; GOES-17
[139]	2024	Video Prediction	KITTI; Cityscapes; KTH; BAIR; MovingMNIST
[140]	2024	Precipitation Nowcasting	SEVIR
[141]	2024	Cloud Removal	Sen2MTCNew; WHUS2-CRv; SEN12MS-CR
[142]	2025	Wind Speed Prediction	Wind Toolkit Data V2
[19]	2025	Satellite Image Inpainting	Massachussets Roads; DeepGlobe 2018
[29]	2025	Video Generation	RealEstate; ACID; DL3DV
[143]	2025	Video Generation	Sky Time-Lapse; UUCF101; MHAD; WebVid

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Capone, V.; Casolaro, A.; Camastra, F. Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview. Information 2025, 16, 917. https://doi.org/10.3390/info16100917

AMA Style

Capone V, Casolaro A, Camastra F. Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview. Information. 2025; 16(10):917. https://doi.org/10.3390/info16100917

Chicago/Turabian Style

Capone, Vincenzo, Angelo Casolaro, and Francesco Camastra. 2025. "Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview" Information 16, no. 10: 917. https://doi.org/10.3390/info16100917

APA Style

Capone, V., Casolaro, A., & Camastra, F. (2025). Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview. Information, 16(10), 917. https://doi.org/10.3390/info16100917

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview

Abstract

1. Introduction

2. Preliminaries

3. Convolutional Neural Networks

Benefits and Drawbacks of Convolutional Neural Networks

4. Recurrent Neural Networks

Benefits and Drawbacks of Recurrent Neural Networks

5. Hybrid Convolutional–Recurrent Networks

Benefits and Drawbacks of Hybrid Convolutional–Recurrent Networks

6. Transformers

6.1. Vision Transformers and Transformers for Video Processing

6.2. Transformers for Spatio-Temporal Processing in Heterogeneous Domains

6.3. Benefits and Drawbacks of Transformers

7. Diffusion Models for Spatio-Temporal Modelling

Benefits and Drawbacks of Diffusion Models

8. Discussions and Future Research Directions

9. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI