Consider a dynamic system that generates
T measurements at predetermined times. Each measurement can be correlated spatially to
size grid data, denoted as
. The purpose of spatiotemporal prediction is to generate sequence images
of the most probable future time period from sequence images
of the historical time period, as shown in Equation (
1) [
10]:
Currently, convolution-based recurrent neural networks and convolutional neural networks are the most widely used models for spatiotemporal prediction. The convolution-based recurrent neural network technique, as illustrated in the left part of
Figure 2, typically uses LSTM as the core, with the input
, the hidden unit
, and the memory unit
from the prior moment generating the output
at the next moment. The efficiency of the overall model prediction process is determined by the input and output sequence length. Combining this fixed-order structure with convolution can effectively express spatiotemporal information since it has sufficient timing on its own. ConvLSTM [
10], the first recurrent neural network based on convolutions, uses a convolution structure in place of LSTM’s fully connected structure to capture temporal and spatial information. Since the convolutional recursive structure in the model of ConvLSTM is position invariant, a trajectory GRU (TrajGRU) model was proposed [
11], which can actively learn the position change structure of repeated connections. According to PredRNN [
19], the memory state of spatiotemporal prediction learning should be remembered simultaneously in a single memory pool rather than being restricted to each LSTM unit as it currently is. As a result, a novel spatiotemporal LSTM (ST-LSTM) unit was developed, which can extract and recall both spatial and temporal information at the same time. Since then, a new phase of development for the convolution-based recurrent neural network has begun. To address the issue that gradient disappearance is likely to occur during the training of the PredRNN model, PredRNN++ [
12] proposes to combine the Gradient Highway Unit (GHU) structure and the Causal LSTM structure connection. The spatiotemporal sequence information is made up of stationary and non-stationary information, according to MIM [
20] analysis. Deep learning networks are excellent at predicting stationary data, but struggle to do so with non-stationary data due to their irregular nature. Afterward, MIM structures are stacked to mine high-order, non-stationary information. It is notable that deterministic terms, time-variable polynomials, and zero-mean random terms could all be used to deconstruct non-stationary information. E3D-LSTM [
21] combines 3D convolution [
22] and RNN to predict spatiotemporal sequences. SA-convlstm [
23] was used to discover that the short-term local spatial information can only be effectively used by existing model methods, and the self-attention memory (SAM) was proposed based on the self-attention mechanism, which nested the current structure with SAM to obtain long-term global spatial information. CrevNet [
24] uses reversible neural networks based on 3D convolution to encode and decode the input, with PredRNN serving as the prediction unit for spatiotemporal prediction. PredRNNv2 [
25] asserts that a pair of memory units in PredRNN is redundant, proposes a memory decoupling loss function, and proposes a reverse sampling strategy to force the model to perform long-term spatiotemporal learning from the context framework. Movement in the real world may be separated into two types of movement: motion trends and transitory variations in space and time, which is something that MotionRNN takes into consideration [
15]. On this basis, the modeling of motion trends and transient change using the MotionRNN model is presented. The aforementioned models, which are built on the convolution-based recurrent neural network, can only predict data once every time and cannot be trained and forecasted in parallel. As the duration of the prediction sequence increases, the total forecast impact will deviate due to the mistake in each individual prediction. However, past spatiotemporal information cannot be used to make predictions straight; rather, it can only be transformed into spatiotemporal properties that have a significant impact on future predictions by passing data between fixed-scale memory units at various times.
Spatiotemporal sequence modeling is a method based on pure convolution with a U-net structure as a benchmark. The U-net structure is presented first in the field of medical image segmentation [
16,
26,
27]. Due to its great extensibility, researchers in the field of spatiotemporal prediction have found several ways to improve the U-net structure’s capacity for extracting spatial information. Meanwhile, information about the time dimension is integrated into the channel dimension to determine the dependent relationship of the time dimension. As shown on the right of
Figure 2, the input of the method is the historical spatiotemporal sequence
, where
T represents the length of the input sequence, and the output is the predicted spatiotemporal sequence
, where
represents the length of the output sequence. SE-ResUnet [
28] develops a nowcasting model that is completely based on convolutional neural networks, outperforming the standard model’s performance at the time-convolution-based recurrent neural networks by combining the benefits of U-net, Squeeze-and-Excitation, and residual networks. Due to the use of separable convolution and the addition of an attention mechanism to the U-net, SmaAT-UNet [
29] uses just a fourth of the trainable parameters and achieves prediction performance that is on par with other test models. SimVP [
30] suggested a convolution-only neural network model that was both straightforward and effective. Spatiotemporal information could be adequately modeled using the Inception module and group convolution, and state-of-the-art (SOTA) results have been produced on several datasets. The end-to-end structure of a convolutional spatiotemporal prediction network, in contrast to the stacked structure of a convolution-based recurrent neural network, is better able to extract spatiotemporal distribution information and spatiotemporal motion information at all moments, preventing information loss throughout continuous transmission. However, at the moment, these models frequently disregard the precise modeling of the temporal dimension and instead concentrate on the feature modeling of the spatial dimension, potentially missing the spatiotemporal information at various time scales.
Inspired by the construction model of SimVP, this paper constructs an end-to-end spatiotemporal prediction network based on 3D convolution. In contrast to other predictive learning techniques, the proposed model utilizes 3D separated convolution to describe complicated spatiotemporal motion states at various time scales and spatial scales and dynamically develops the relationship between the spatial and temporal dimensions. The method proposed in this paper effectively enhances the capabilities of pure convolution-based spatiotemporal sequence modeling methods to represent temporal dimension information by absorbing their advantages.