ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention

Qiu, Yunan; Lu, Bingjian; Xiong, Wenrui; Lu, Zhenyu; Sun, Le; Cui, Yingjie

doi:10.3390/rs17121966

Open AccessArticle

ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention

by

Yunan Qiu

¹,

Bingjian Lu

²

,

Wenrui Xiong

³,

Zhenyu Lu

^4,*,

Le Sun

^5,6

and

Yingjie Cui

^2,7

¹

School of Information Engineering, Jiangsu Open University, Nanjing 210044, China

²

School of Electronics and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China

³

Research and Development Department, Beijing Wenze Zhiyuan Information Technology Co., Beijing 100000, China

⁴

School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing 210044, China

⁵

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China

⁶

Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing University of Information Science and Technology, Nanjing 210044, China

⁷

Data Intelligence Development Center, Geely Automobile Research Institute Co., Ltd., Ningbo 315336, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(12), 1966; https://doi.org/10.3390/rs17121966

Submission received: 15 April 2025 / Revised: 25 May 2025 / Accepted: 3 June 2025 / Published: 6 June 2025

Download

Browse Figures

Versions Notes

Abstract

Weather radar, as a crucial component of remote sensing data, plays a vital role in convective weather forecasting through radar echo extrapolation techniques. To address the limitations of existing deep learning methods in radar echo extrapolation, this paper proposes a radar echo extrapolation model based on video vision transformer and spatiotemporal sparse attention (ViViT-Prob). The model takes historical sequences as input and initially maps them into a fixed-dimensional vector space through 3D convolutional patch encoding. Subsequently, a multi-head spatiotemporal fusion module with sparse attention encodes these vectors, effectively capturing spatiotemporal relationships between different regions in the sequences. The sparse constraint enables better utilization of data structural information, enhanced focus on critical regions, and reduced computational complexity. Finally, a parallel output decoder generates all time step predictions simultaneously, then maps back to the prediction space through a deconvolution module to reconstruct high-resolution images. Our experimental results on the Moving MNIST and real radar echo dataset demonstrate that the proposed model achieves superior performance in spatiotemporal sequence prediction and improves the prediction accuracy while maintaining structural consistency in radar echo extrapolation tasks, providing an effective solution for short-term precipitation forecasting.

Keywords:

radar echo extrapolation; deep learning; video vision transformer; spatiotemporal fusion; short-term precipitation forecast

1. Introduction

Precipitation, as one of the fundamental components of the water cycle, exerts a significant influence on the global climate system. Monitoring and forecasting of precipitation are critical for agricultural, industrial, and energy production [1], as well as essential tasks for maintaining societal safety. Therefore, accurate precipitation forecasting is of great importance for mitigating the impacts of adverse weather events, safeguarding life and property, and promoting sustainable development. Remote sensing data serve as a vital source for observing meteorological phenomena, among which weather radar is particularly effective for precipitation observation [2]. In precipitation forecasting, radar echo maps hold substantial application value, as their analysis enables the extraction of precipitation type, tropospheric height, wind direction and speed, thereby providing key constraints for numerical modeling and physical process studies in precipitation forecasting [3]. Furthermore, radar echo series can be utilized to monitor the movement trends and developmental stages of precipitation systems, making them indispensable tools for precipitation monitoring and forecasting. However, the spatiotemporal characteristics of precipitation evolution exhibit considerable uncertainty [4], which poses significant challenges to precise modeling and prediction.

Traditional methods for radar echo extrapolation include centroid tracking, cross-correlation, and the optical flow method [5,6,7,8]. The centroid tracking method operates by monitoring the centroid position of the radar echo core and estimating precipitation movement and development based on its displacement. However, this approach only focuses on the centroid point, making it ineffective for tracking echoes undergoing splitting or merging. In operational practice, the cross-correlation method (TREC) is predominantly employed. This technique determines future echo positions and intensities by computing cross-correlation coefficients between echoes at different ranges, thereby enabling echo extrapolation. Compared to centroid tracking, TREC exhibits improved accuracy and stability, yet it suffers from echo distortion in long-term forecasts and discontinuous divergence artifacts at echo boundaries. To address these limitations, Li et al. [9] proposed an enhanced cross-correlation approach (COTREC) that incorporates a horizontal divergence-free constraint to satisfy the continuity equation, resulting in smoother and more coherent extrapolated echoes. In computer vision applications, optical flow methods [10] are commonly adopted for radar echo prediction, which estimate pixel-wise motion by analyzing intensity variations across consecutive image sequences. Several researchers [11,12] have implemented diverse global optical flow methods to develop the SWIRLS nowcasting system for operational real-time forecasting in Hong Kong. Although these methods have made some progress, large computational costs and poor generalization ability prevent them from achieving better results in nowcasting.

In recent years, with the advancement of artificial intelligence technologies, researchers have explored the application of deep learning to precipitation forecasting based on radar echo extrapolation. Deep learning technology can make up for the shortcomings of traditional algorithms to a certain extent and can adapt to complex and changeable environments [2]. This approach treats radar echo maps at each time step as matrices and transforms the problem into a spatiotemporal sequence prediction task by forecasting future sequences from historical observations. Convolutional neural networks (CNNs) [13,14] autonomously learn feature representations from input data through their inherent architecture, enabling the extraction of abstract semantic features at deeper layers. Consequently, Ayzel et al. [15] employed CNNs for rainfall prediction based on radar echo images, demonstrating that the model could capture the complex nature of short-term precipitation field evolution while matching the performance of state-of-the-art optical flow methods. Building upon this, Ayzel et al. [16] developed RainNet, a deep learning model integrating concepts from U-Net and SegNet families of deep learning models (which were originally designed for binary segmentation tasks, known for its efficient memory usage during upsampling) [17,18], utilizing quality-controlled weather radar data from the German Meteorological Service to predict continuous precipitation intensity. However, conventional convolutional architectures primarily focus on spatial features while neglecting temporal dependencies. Several researchers have explored recurrent neural networks [19] (RNNs), including long short-term memory networks [20] (LSTMs), gated recurrent units [21] (GRUs), and temporal convolutional networks [22] (TCNs), to analyze temporal correlations in meteorological data and capture precipitation process characteristics. Effective precipitation forecasting requires consideration of both temporal dependencies across historical and current states and spatial interactions among neighboring regions, making the identification and extraction of dynamic spatiotemporal precipitation characteristics a significant research challenge [23]. To solve this problem, Shi et al. [24,25] proposed ConvLSTM and TrajGRU to handle spatially and temporally correlated data, primarily for radar echo prediction and video forecasting. Wang et al. [26] combined RNN with CNN to develop PredRNN, whose architecture centers on a novel spatiotemporal LSTM (ST-LSTM) capable of simultaneously extracting and preserving spatial and temporal representations. However, these networks rely on RNN-based structures, and as the number of time steps increases, their ability to propagate information gradually diminishes, leading to long-range dependency issues. Furthermore, although CNNs effectively capture spatial correlations, the inherent limitations of recurrent structures in RNNs may reduce the efficiency of spatial information propagation, potentially resulting in spatial information loss.

The transformer is an attention mechanism-based model [27] that has achieved remarkable success in natural language processing. Compared to traditional RNN architectures, the transformer possesses parallel processing capabilities, extracting features from each position in a sequence simultaneously through self-attention mechanisms, thereby significantly reducing computational time. Moreover, the self-attention mechanism enables each position to consider all elements in the input sequence rather than just local elements, granting the model global perception and effectively handling long-range dependencies in sequences while overcoming gradient vanishing and explosion issues inherent in RNNs. Additionally, transformer models require fewer parameters, demonstrate higher efficiency during both training and inference, and exhibit superior generalization capabilities compared to RNNs. Dosovitskiy et al. [28] proposed a novel vision recognition model (vision transformer, ViT) based on the transformer architecture, achieving state-of-the-art performance on multiple public datasets. Building on these advancements, Arnab et al. [29] introduced a transformer-based video classification model (video vision transformer, ViViT), inspired by recent breakthroughs in image classification. The concept of the transformer has further developed the application of deep learning in radar echo extrapolation, showing strong potential and advantages.

Although the application of deep learning technologies in radar echo extrapolation has made significant progress, some limitations still exist in practical applications [30]. The evolution of atmospheric systems exhibits complex nonlinear dynamics with inherent uncertainties and chaotic properties, making radar echo extrapolation substantially more challenging than conventional spatiotemporal sequence prediction tasks [31]. While recurrent neural network architectures have achieved notable progress in sequential data modeling, they remain fundamentally constrained by error accumulation mechanisms during extended-range forecasting. This limitation manifests as progressively amplified prediction biases across successive time steps, severely degrading temporal extrapolation accuracy. Moreover, these models demonstrate suboptimal performance in both extracting spatiotemporal correlations through structural information mining and dynamically focusing on critical regions, presenting two fundamental challenges in operational nowcasting systems.

To address these challenges, we propose ViViT-Prob for spatiotemporal sequence prediction and radar echo extrapolation. Built on the transformer architecture, ViViT-Prob integrates 3D patch coding, parallel decoding, and spatialtemporal attention modules. In addition, sparse attention is used to replace the traditional attention to further optimize model efficiency. The model is trained and tested on Moving MNIST and real radar echo data sets.

The paper is structured as follows. Section 2 introduces the dataset. Section 3 presents model framework, loss function, and evaluation methods. Section 4 presents experimental results, including results on two different datasets and ablation experiments. Section 5 discuss and analysis the the study. Section 6 concludes the study and discusses potential avenues for future research.

2. Data

To evaluate the performance of the proposed model in radar echo extrapolation task, we utilize a publicly available spatiotemporal sequence prediction dataset and a real radar echo dataset. The details of each dataset are described as follows.

2.1. Moving MNIST Dataset

The Moving MNIST dataset is a widely used benchmark for evaluating spatiotemporal sequence prediction models [32]. In the Moving MNIST dataset, two digits are randomly selected from the MNIST dataset and placed within a 64 × 64 pixel frame. These digits move independently within the frame, with their trajectories determined by random initial velocities and directions. The movement of handwritten digits within a sequence simulates dynamic spatial and temporal patterns, which are analogous to the evolving trends of radar echoes in meteorological data. By using the Moving MNIST dataset, we can effectively evaluate the ability of models to capture and predict spatiotemporal dynamics, which is essential for radar echo extrapolation.

The dataset comprises a total of 20,000 frames, with 10,000 frames allocated to the training set and 10,000 frames to the test set. In this experiment, the model is tasked with predicting the next 10 frames of the image sequence based on the past 10 frames as input.

2.2. Radar Echo Dataset

The data used for our scientific research and project studies were obtained from the National Meteorological Science Data Center and the China Meteorological Data website (https://data.cma.cn/). Given that our project and research focus involve artificial intelligence and large language models, the provided radar echo data has undergone specific processing to ensure data security and confidentiality. Sensitive information regarding radar location, model type, and timestamps has been removed, with only pre-processed composite base reflectivity data being made available after transformation.

The data has undergone basic quality control and network mosaicking, including the removal of electromagnetic interference clutter and ground clutter, covering the entire Jiangsu Province region. The data consists of time series generated from radar echoes, reflecting the base reflectivity factor at 3 km altitude. The weather radar reflectivity factor is influenced by various physical characteristics of hydrometeors in clouds, including phase, size, and number concentration. The base reflectivity factor is colored according to intensity, with visualization results shown in Figure 1.

The data values range from 0 to 70 dBZ, with a horizontal resolution of 0.01° (approximately 1 km) and a temporal resolution of 6 minutes. Each single image has a grid size of 480 × 560 pixels. The data spans from June to September each year between 2016 and 2019. Theoretically, the dataset should contain 70,272 entries; however, due to equipment maintenance issues and data transmission losses, the actual dataset is slightly smaller than the theoretical value. For the experiments, each frame was resized to 128 × 128 pixels. The input data were read as images and converted numerically to obtain the base reflectivity matrix. The model was provided with a sequence of radar echo images at half-hour intervals over the past two hours to predict the radar echo conditions at half-hour intervals over the next two hours. All sequence samples were divided into two parts with 80% served as training set and the remaining 20% as testing set.

As shown in Figure 2, the distribution of maximum values in the training and test sets is illustrated. The horizontal axis represents the maximum value within a single sample, while the vertical axis represents the data density. The blue curve corresponds to the distribution of the training set, and the orange curve corresponds to the distribution of the test set. From the figure, it can be observed that the maximum values in the training set are primarily concentrated within the range [35, 55], with the majority of samples falling within [40, 50]. The distribution of maximum values in the test set is consistent with that of the training set, indicating a similar statistical pattern between the two sets.

3. Method

3.1. Network Structure

As illustrated in Figure 3, the ViViT-Prob model is built upon the transformer architecture. Initially, the model employs a patch encoding based on 3D convolution to map historical sequences into a fixed-dimensional vector space. Subsequently, a sparse attention module based on multi-head spatiotemporal fusion is utilized to encode these vectors, enabling the model to learn the spatiotemporal relationships among different regions within the image sequences. Finally, a parallel output decoder generates predictions for all time steps simultaneously and then maps to the prediction space through a deconvolution module to reconstructing high-resolution images. The detailed structure of the model is described in the following sections.

3.1.1. Transformer Structure

The transformer network is composed of an encoder and a decoder, with core components including positional encoding (PE) to comprehend the order of sequences. Multi-head attention (MHA) and feed-forward networks (FFN) are used for computation, as shown in Figure 4.

The encoder is primarily utilized for encoding the input sequence and typically consists of MHA, add & norm modules, and FFN. The MHA enables each position in the sequence to attend to relevant information, while the add & norm operation combines residual connections with layer normalization, which are employed to adjust the distribution of data and enhance training efficacy. The FFN, implemented as a fully connected neural network, is responsible for computing the output. The decoder shares a similar architecture with the encoder and is primarily responsible for decoding the extracted features to generate the final prediction output.

MHA is a key module in transformer, which is used to calculate the attention score of vectors to indicate the attention of the current vector to other vectors. The module is composed of multiple self attention layers. For each vector, query Q, key value K, and value V are calculated by linear transformation. Then, the attention score is obtained by calculating the dot product with the Q of all input vectors, and then the context information is obtained by calculating the weighted sum of the V of the vectors and the attention scores of all vectors. Thus, the coding information of different subspaces can be fully utilized to enhance the expression ability of the model. The calculation formula is as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}})

(1)

where Q, K, and V represent query, key value, and value respectively, and

d_k

is the scaling factor to avoid excessive inner product of Q and K.

3.1.2. Patch Coding Based on 3D Convolution

As shown in Figure 5, this paper refers to the patch coding operation of the ViT model and introduces it into three-dimensional space. Through the use of 3D convolution, several blocks of spatiotemporal information are extracted from the input sequence to form the spatiotemporal feature with dimension d. For the input sequence with dimension

X \in R^{H \times W \times T}

, a 3D convolution layer with core size

(h \times w \times t)

is used. Steps of

(h \times w \times t)

are used to extract

n_{h} = H / h

,

n_{w} = W / w

, and

n_{t} = T / t

non-coincident blocks from the height, width, and time dimensions, respectively, and finally extract

N = n_{t} \times n_{h} \times n_{w}

the spatiotemporal features with dimension d. The calculation depends on the dimension of the block. A smaller dimension will produce more blocks, resulting in an increase cost in the calculation. Compared with the patch coding operation using only 2D convolution, the use of 3D convolution can be carried out in both time and space dimensions at the same time. It is more conducive to the fusion of spatiotemporal information from different frames, thereby better analyze the features of input sequence.

3.1.3. Sparse Attention Module Based on Multi-Head Spatiotemporal Fusion

As shown in Figure 6, we use different attention heads to calculate temporal and spatial attention separately [29]. Initially, the spatial and temporal dimensions are combined as

N = n_{t} \times n_{h} \times n_{w}

. In this paper, the corresponding temporal

K_{t}

,

V_{t}

(

K_{t}, V_{t} \in R^{n_{h} \times n_{w} \times d}

) as well as spatial

K_{s}

,

V_{s}

(

K_{s}, V_{s} \in R^{n_{h} \times n_{w} \times d}

) are constructed. For half of the attention heads, spatial attention is computed as

A_{s} = Attention (Q, K_{s}, V_{s})

, while for the other half, temporal attention is computed as

A_{t} = Attention (Q, K_{t}, V_{t})

. The attention heads are then merged through a linear layer, assigning attention weights to each patch in both spatial and temporal dimensions. The calculation is as follows:

A = W_{O} [Concat (A_{s}, A_{t})]

(2)

where

A_{s}

and

A_{t}

represent the attention weights in the spatial and temporal dimensions, respectively,

Concat (\cdot)

denotes the linear concatenation operation, and

W_{O} (\cdot)

represents the fusion output module after concatenation.

The self-attention in traditional transformers suffers from high computational complexity, requiring

O (L_{Q} L_{K})

memory and incurring quadratic dot-product computation costs, which is a significant limitation for its predictive capabilities. Furthermore, the traditional self-attention scores exhibit a long-tailed distribution, meaning that only a small fraction of dot products contribute significantly to the attention, while the majority can be neglected, as illustrated in Figure 7. This suggests that the probability matrix does not necessarily exhibit a positive distribution with obvious gap and bias and may instead be close to a uniform distribution. Therefore, inspired by the Informer model [33], this paper introduces the ProbSparse self-attention mechanism. ProbSparse self-attention is a novel self-attention computation method that improves upon traditional self-attention by adopting a probabilistically sparse approach. Compared to conventional self-attention, ProbSparse self-attention focuses only on a small number of key features, thereby enhancing the efficiency of attention computation, accelerating training speed, and reducing the risk of model overfitting. The calculation is as follows:

A_{p r o b} (Q, K, V) = Softmax (\frac{\bar{Q} K^{T}}{\sqrt{d}}) V

(3)

Under the long-tailed distribution, only

U = L_{Q} ln L_{K}

randomly sampled dot-product pairs are required to compute sparsity of query vectors. The top u query vectors are then selected to form

\bar{Q}

, which is a sparse matrix of the same size as Q, where

u = c ln L_{Q}

and c is the sampling factor. Finally, the average of the V vectors is used to replace the query vectors corresponding to the remaining uniform distribution, thereby obtaining the self-attention feature map. This reduces the memory requirement to

O (L_{Q} ln L_{K})

.

3.1.4. Parallel Decoding with Temporal-Spatial Relationships

The decoding process in transformer models utilizes the output of the encoder and the predictions of the current word to generate the next word. As shown in Figure 8, the decoding process begins with a known target language sequence and generates a probability distribution through a prediction network to produce the next word. The model iteratively repeats the decoding steps to ultimately generate the complete target language sequence. Each generated word is appended to the end of the target language sequence and serves as the context for the next word generation. The decoding process terminates when the generated word matches the end-of-sequence token, and the final target language sequence is output as the result. However, due to the step-by-step decoding approach, this method suffers from prolonged dynamic decoding time and the issue of error accumulation.

Inspired by the concept of generative inference, this paper avoids using specific tokens as markers. Instead, the last frame of the input sequence is sampled as a long sequence and fed into the decoder. This allows the network to learn the spatiotemporal relationships between past and current time steps and map these relationships to future time steps, thereby decoding the predicted spatiotemporal sequence. For example, as illustrated in Figure 9, when predicting the next four frames using the past four frames, the known four-frame sequence is input into the encoder, and the fourth frame is stacked and fed into the decoder. The model then outputs the predictions for frames 5 to 8 in a single forward pass. Unlike traditional step-by-step decoding, this approach does not require iterative operations but outputs predictions for all time steps simultaneously. This method significantly reduces computation time and mitigates the issue of error accumulation.

3.2. Loss Function

To train the proposed model and optimize its parameters, this paper employs focal frequency loss (FFL) and mean square error (MSE) to define the loss function. FFL is used to minimize the discrepancy between the predicted and ground truth images in the frequency domain. Its core idea is to reduce the weight of easily synthesized frequency components, allowing the model to adaptively focus on the more challenging frequency components. The calculation is as follows:

FFL = \frac{1}{M N} \sum_{u = 0}^{M - 1} \sum_{v = 0}^{N - 1} w (u, v) {|F_{r} (u, v) - F_{f} (u, v)|}^{2}

(4)

where

F_{r} (u, v)

and

F_{f} (u, v)

represent the frequency domain representations of the ground truth and predicted images, respectively.

w (u, v)

denotes the spectral weight matrix. M and N are the height and width of the image in pixels. u and v represent the pixel coordinates.

The MSE is a commonly used metric to measure the average squared difference between the predicted values and the ground truth values.

The overall loss function is defined as follows, where

α

represents the weight of FFL in the total loss:

Loss = MSE + α \cdot FFL

(5)

3.3. Evaluation Metrics

To effectively evaluate the performance of the proposed model across various datasets, this paper employs the following metrics: mean absolute error (MAE) and mean squared error (MSE) to quantify the error between predicted and observed values; structural similarity index measure (SSIM) to assess image quality [34]; and critical success index (CSI), probability of detection (POD), and false alarm ratio (FAR) to evaluate forecasting accuracy [24].

MAE and MSE measure the error between predicted and true values. Smaller values indicate better prediction performance. SSIM evaluates image quality by considering contrast, luminance, structural information, and human visual sensitivity. Its value ranges from −1 to 1, with values closer to 1 indicating higher similarity between images. CSI assesses the accuracy of predicting precipitation above a certain threshold. Higher CSI values indicate better prediction performance. POD and FAR range from 0 to 1. A higher POD and a lower FAR indicate higher forecasting accuracy.The specific calculation methods are as follows:

MAE = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} |

(6)

MSE = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}

(7)

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(8)

CSI = \frac{TP}{TP + FN + FP}

(9)

POD = \frac{TP}{TP + FN}

(10)

FAR = \frac{FP}{TP + FP}

(11)

where

y_{i}

is the true value,

{\hat{y}}_{i}

is the predicted value, and N is the total number of samples.

μ_{x}

and

μ_{y}

are the means of x and y,

σ_{x}^{2}

and

σ_{y}^{2}

are their variances,

σ_{x y}

is their covariance, and

C_{1}

and

C_{2}

are constants to stabilize the division. TP (true positives) are correctly predicted events, FN (false negatives) are missed events, and FP (false positives) are incorrectly predicted events.

4. Experiments and Analysis

To train the proposed model, this paper employs Xavier initialization [35] for model parameters to accelerate convergence. During training, the Adam optimizer is utilized to automatically adjust the learning rate for each parameter, which effectively mitigates issues such as gradient explosion and vanishing gradients, thereby facilitating faster convergence to the optimal solution. All experiments are conducted using the PyTorch framework (version 1.8.0) and CUDA (version 10.2), with training and testing performed on an NVIDIA GeForce RTX 2080 GPU.

To analyze the performance of the proposed model, this paper compares it with several models commonly used in spatiotemporal sequence prediction and weather forecasting, including: RNN-based spatiotemporal sequence prediction models (ConvLSTM [24], PredRNN [26], CausalLSTM [36], E3D-LSTM [37], MIM [38], SA-ConvLSTM [39] and STAE [40]), transformer-based spatiotemporal sequence prediction model (VPTR [41]), and radar echo extrapolation models (OpticalFlow [8] and RainNet [16]). A description of the specific model is shown in Table 1.

4.1. Moving MNIST Experiments

First, we compare the proposed model with state-of-the-art spatiotemporal sequence prediction models on the publicly available Moving MNIST dataset to evaluate their ability to capture and predict spatiotemporal dynamics.

Table 2 presents the experimental results of different models on the Moving MNIST dataset, with the optimal and suboptimal results highlighted in bold and underlined, respectively. The results of the comparison models are obtained from published papers, and * indicates that the specific metric was not provided in the original paper.

On the Moving MNIST dataset, the proposed model achieves the best performance in terms of MSE, outperforming the second-best model, STAE, by 16.2%. Additionally, the proposed model achieves the second-best results in SSIM and MAE, with gaps of only 0.6% and 3.1%, respectively, compared to the best-performing models.

These experiments demonstrate that the proposed model achieves smaller prediction errors and higher image similarity on the public spatiotemporal sequence prediction dataset, outperforming most evaluation metrics and showing competitive performance compared to state-of-the-art RNN-based models.

To better illustrate the comparative performance of different models on the Moving MNIST dataset, we selected two representative test samples exhibiting digit overlap phenomena for visualization. The evaluated models include two classical RNN-based approaches (ConvLSTM and PredRNN) and two transformer-based architectures (VPTR-NAR and our proposed ViViT-Prob).

Figure 10 presents the comparative prediction results of different models on the Moving MNIST dataset, with both input and output sequences displayed at two-step intervals. In the Figure 10a, where digit overlap occurs at the transition between input (final frames) and output (initial frames), all four models accurately predict the overlapping digits at step 11. However, from step 13, ConvLSTM, PredRNN, and VPTR-NAR fail to properly separate the irregular light-colored digit “6” in the lower-right region. The outputs of ConvLSTM become blurred, with digit intensities converging, while the outputs of PredRNN maintain better sharpness and intensity contrast but still have slight structural distortions. VPTR-NAR exhibits similar limitations with more severe feature loss in irregular digits. The proposed model demonstrates superior performance in preserving edge details of irregular digits, maintaining shape fidelity of regular digits, and correctly representing intensity differences between overlapping digits.

Figure 10b shows that all models achieve satisfactory predictions in early non-overlapping steps (steps 11–13). However, when digit overlap emerges at step 15, only PredRNN and the proposed model successfully capture the interaction between digits “1” and “8”. By step 17, the outputs of PredRNN show deformation in digit “8”, while the proposed model preserves the “8” shape but with slight blurring in digit “1”. At the final step 19, ViViT-Prob reconstructs both digits with remarkable clarity, outperforming VPTR-NAR and PredRNN which exhibit severe distortions in digit “8”, while the outputs of ConvLSTM become unrecognizable.

The experimental results based on Moving MNIST dataset provide an important basis for the applicability of this research model in radar echo extrapolation task. The model can accurately capture the nonlinear interaction behavior such as the intersection and fusion of multiple numbers in the process of movement, which directly corresponds to the echo overlapping and merging process of different precipitation systems in the meteorological field (such as the squall line merging phenomenon). At the same time, the continuous prediction performance of the model for digital motion trajectory verifies its ability to deal with the long-term dependence of spatio-temporal sequence, which is highly consistent with the requirement of maintaining the spatio-temporal continuity of precipitation system in radar echo extrapolation. These results jointly prove that the multi-scale feature preservation, nonlinear motion capture and other capabilities obtained by the model on Moving MNIST dataset can effectively migrate to the radar echo extrapolation task with similar physical characteristics and provide strong evidence for the reliability of the model in practical meteorological applications.

Figure 11 illustrates the visualization effect using class activation mapping (CAM) [42] during the decoding process. CAM represents a weighted linear sum of visual patterns at different spatial locations, and by simply upsampling the class activation map to the size of the given image, it highlights the most relevant regions of the image. As shown in the Figure 11, the model assigns higher attention to the areas where digits appear, including the edges of the digits and the overlapping regions between two digits. This focused attention results in clear edges, small overall errors, and high image similarity in the predicted images.

4.2. Radar Echo Dataset Experiments

Next, we conduct experiments on a real radar echo dataset to verify the predictive ability for echo development.

Table 3 presents a comparative analysis of the proposed model against OpticalFlow, RainNet, and two spatiotemporal sequence models used in operational meteorological forecasting (ConvLSTM and PredRNN) on the radar echo dataset. All comparison models were tested using their original publicly released code. The results demonstrate that the proposed model achieves lower prediction errors and superior image similarity, outperforming the second-best model by 21.4%, 6.9%, and 3.8% across the key metrics. This confirms the model’s robust predictive capability for radar echo extrapolation tasks.

For quantitative precipitation forecasts (QPFs), we converted radar reflectivity to precipitation intensity following established methodologies [8,16,25]. While the maximum reflectivity values in our dataset predominantly ranged between 40–50 dBZ, the spatial distribution and average reflectivity values resulted in most grid points corresponding to light-to-moderate rainfall intensity, with limited samples of heavy rainfall. Consequently, our evaluation focused on three precipitation levels: no-rain, light rain, and moderate rain.

The reflectivity–precipitation relationship varies significantly depending on precipitation type, seasonality, and regional characteristics [43]. As this conversion is not our primary research focus, we employed generalized coefficients which may introduce some quantification errors.

Figure 12 presents a comparative analysis of the inversion results across different precipitation intensity levels, where higher CSI and POD scores coupled with lower FAR values indicate better performance. Most models perform well for no-rain conditions, but their effectiveness diminishes significantly with increasing precipitation intensity, particularly for ConvLSTM and OpticalFlow. RainNet and PredRNN achieve suboptimal CSI and POD scores across multiple intensity levels; however, RainNet exhibits a higher FAR, indicating more frequent false alarms. In contrast, while PredRNN shows moderate performance for light rain and below, it achieves competitive FAR and CSI scores for moderate rain, suggesting that its predictions for mid-level precipitation exhibit smaller echo value errors and more accurate spatial coverage. Comparatively, the proposed model consistently outperforms other models across multiple metrics in all three precipitation levels. However, due to the predominance of no-rain grid points in the dataset, some underestimation of precipitation intensity persists. These experimental results demonstrate that the proposed model generates reasonable echo predictions, providing reliable auxiliary reference for precipitation forecasting and effectively reflecting expected rainfall intensity within two hours.

Figure 13, Figure 14 and Figure 15 present the prediction results of the models on the radar echo dataset, where both the input and output sequences are displayed at 1-step intervals. The interval between each step is 30 min.

Figure 13a illustrates the gradual movement of a large area of low-intensity radar echoes toward the upper right. We use red circles and orange boxes to mark the range of echo above 25 dBZ and overall echo. Initially, the ConvLSTM successfully predicted the presence of echoes exceeding 30 dBZ, albeit with positional deviations, and failed to capture the movement direction of the echoes over time. The ConvLSTM displayed echoes above 25 dBZ within the red circles, whereas the target mostly showed 20–25 dBZ echoes with only a few areas exceeding 25 dBZ. In contrast, PredRNN provided more accurate predictions for the movement trends of echoes above 25 dBZ, though the contours of echoes exceeding 5 dBZ in the final frame appeared blurred and spatially overestimated. RainNet produced smoother overall echo boundaries but exhibited a tendency to overestimate echo intensities. Although RainNet showed yellow echoes within the red circles, the corresponding positions in the target exhibit almost no yellow echoes, except for sporadic yellow echoes in the last frame where the echo coverage is significantly smaller than predictions. Additionally, RainNet exhibited a noticeable void in the lower-left corner of the orange bounding box. In the last column of images, the left side of the orange box shows echo intensities above 25 dBZ, while the corresponding target area registers below 5 dBZ. ConvLSTM predicted larger areas but with more diffuse and coarser edges. The proposed model initially correctly predicted the positions of three 25 dBZ echo regions and broadly simulated their movement direction over time. In the final frame, the contours of echoes above 10 dBZ closely matched the actual observations, though some scattered echoes were missed, and low-intensity echo coverage was overestimated.

In summary, although both RainNet and ConvLSTM models produce forecasts with stronger echo intensities, these do not correspond to high-intensity echoes in the target data. Compared to the target, both models overestimate echo intensity and coverage, resulting in increased false positives (FP). Since FP appears in the denominator when calculating the critical success index (CSI), this explains the relatively low CSI scores for both models. Furthermore, the FP rate also affects the false alarm ratio (FAR), as evident in Figure 12 where both models show elevated FAR values, particularly for the moderate category.

Figure 13b shows the movement of a moderate echo exceeding 35 dBZ. Both ConvLSTM and PredRNN failed to accurately predict the echo intensity, capturing only the echo contours. RainNet provided more accurate intensity forecasts but exhibited positional deviations for moderate-intensity echoes. In comparison, the proposed model demonstrated relatively accurate predictions of echo intensity and position in the early stages, though it underestimated the intensity by one level in the later stages, with minor low-intensity clutter present in the predictions. Overall, the proposed model exhibits robust feature extraction and analysis capabilities for moderate-to-strong echoes, partially capturing echo movement trends, which could provide support for operational meteorological forecasting.

Figure 14a shows the motion of a large-scale moderate intensity echo. Overall, the prediction strength of ConvLSTM is relatively low, with all prediction results being less than 30 dBZ. Both PredRNN and RainNet successfully predicted some echoes reaching 30 dBZ, but their coverage area gradually diminished over time, resulting in a significant underestimation in the end. Only the proposed model predicted the part with intensity higher than 35 dBZ, and the underestimation was lighter compared to other models. For the part in the red box with an intensity greater than 35 dBZ, compared to other models, the proposed model can better focus on this part and predict the echo intensity closer to the ground truth.

Figure 14b shows the process of a medium intensity echo moving from left to middle, with the echo intensity center marked by a red box. The prediction results of ConvLSTM exhibit an overall undersized coverage area. While RainNet produces predictions with a spatial extent closer to observations, the intensities are underestimated. PredRNN successfully captures some orange echoes, but the proposed model demonstrates superior performance in predicting echoes exceeding 35 dBZ. Specifically, for the red-boxed region, the proposed model achieves more accurate predictions in terms of both intensity and spatial coverage, although some underestimation persists in the lower-left corner.

Figure 15a shows the process of both upper and lower echoes moving simultaneously. For the echo above, the proposed model predicted the intensity and range of the echo higher than 35 dBZ well in the first two frames, while other models significantly underestimated it. All models showed underestimation in the end. For the echo below, only the proposed model predicted the red echo, whereas other models failed to capture echoes above 30 dBZ.

Figure 15b shows the process of the echo moving upwards. The proposed model better demonstrates the intensity and range of echoes and is more accurate in predicting yellow and orange echoes. However, other models only captured the general spatial patterns of the echoes while underestimating echo intensities to varying degrees.

4.3. Ablation Experiments

In order to verify the impact of each module on the predictive performance of the proposed model, we conducted ablation experiments on the radar echo dataset. The effects of using different encoding decoders and attention mechanisms on the prediction results were tested, as shown in Table 4. In the table, * represents direct output without using a decoder module; 3DCNN * represents a randomly initialized matrix as the input to the decoder. From Table 4, it can be seen that the model without a decoder has the poorest prediction results. The model with 3D CNN encoding–decoding strategy performs better than simple patch strategy. Furthermore, initialization proved crucial for final prediction quality. The input initialization of the decoder proved crucial for final prediction quality. The model with the last observed frame as input performs better than random initialization. We evaluated three multi-head attention variants: self-attention, cross-attention, and sparse attention. The results showed that the model using sparse attention achieved optimal performance. These systematic experiments provide conclusive evidence that each architectural component in our proposed model contributes positively to enhancing prediction performance, with the 3D CNN processing, proper decoder initialization, and sparse attention mechanism collectively demonstrating statistically significant improvements.

5. Discussion

Radar echo extrapolation is regarded as a typical spatiotemporal sequence prediction task in the field of computer vision, and Moving MNIST dataset serves as one of the most authoritative public benchmark datasets for this task. The validation of its dynamic characteristics provides crucial reference value for subsequent real-world data modeling. Although Moving MNIST represents an idealized dataset, the spatiotemporal dynamics it simulates (such as object displacement, deformation, and interaction) share fundamental similarities at the modeling level with the evolution of precipitation systems observed in weather radar echoes (including storm movement, intensity variation, and multi-system interactions). Through these experiments, we have specifically validated our model’s capability in capturing complex spatiotemporal dependencies, which constitutes the critical aspect of radar echo extrapolation.

Experimental results indicate that compared to transformer-based models, RNN-based prediction models, benefiting from their earlier development and longer research history, exhibit a larger number of published variants with faster iterative improvements.

We found that our model achieved a superior MSE index on the Moving MNIST dataset compared with RNN-based spatiotemporal sequence prediction models, with a 16.2% improvement over the second-best model. It obtained the second-best values in SSIM and MAE, with differences of 0.6% and 3.1% from the optimal model, respectively. These results demonstrate that the proposed model better captures long-term spatiotemporal dynamics, which is essential for radar echo extrapolation. In real-world radar echo data experiments, our model enhances prediction performance through a spatiotemporal attention module and 3D convolutional patch encoding. These innovative designs enable the model to better handle small-scale variations and motion uncertainties in radar echoes. Experimental results on real radar data show that our approach achieves improvements of 21.4% in MSE and 3.8% in SSIM compared to the PredRNN baseline. We have further validated the contributions of each component through comprehensive ablation studies. Therefore, in the visualization shown in Figure 13, Figure 14 and Figure 15, our model successfully identified the locations of echoes at the beginning and roughly simulated the movement direction of the echoes over time. The experimental results indicate that our model has a good ability to extract and analyze the characteristics of moderate and strong echoes, which reflect the movement trend of the echoes and provide auxiliary effects for relevant meteorological forecasting services.

The proposed model employs an end-to-end approach for radar echo extrapolation and can assist in short-term precipitation prediction to a certain extent, providing early warnings for urban flood control, power supply, transportation, and other applications, demonstrating strong potential for smart city development. In future research, we plan to incorporate additional meteorological factors during extrapolation to explore their relationships with precipitation, thereby further improving its performance in precipitation forecasting.

6. Conclusions

In this paper, we present a video vision transformer-based model with spatiotemporal sparse attention (ViViT-Prob) for spatiotemporal sequence prediction and radar echo extrapolation, particularly for precipitation nowcasting. The model takes historical sequences as input and initially maps them into a fixed-dimensional vector space through 3D convolutional patch encoding. Subsequently, a multi-head spatiotemporal fusion module with sparse attention encodes these vectors, effectively capturing spatiotemporal relationships between different regions in the image sequences. The sparse constraint enhances the utilization of structural information in the data, prioritizes focus on critical regions, and reduces computational complexity. Finally, a parallel output decoder generates predictions for all time steps, then mapped back to the prediction space through a deconvolution module to reconstruct high-resolution images. Experimental results demonstrate that the proposed model achieves superior performance in spatiotemporal sequence prediction and radar echo extrapolation, suggesting its potential for assisting operational meteorological applications.

Our future research will focus on integrating data from more sources by processing high-altitude data and ground data simultaneously. By integrating the spatiotemporal characteristics of both data types, the model can uncover deeper latent spatiotemporal features and improve performance in radar echo extrapolation and short-term precipitation forecasting.

Author Contributions

Conceptualization, Y.Q. and B.L.; data curation, W.X.; formal analysis, Y.Q.; funding acquisition, Z.L.; methodology, Y.Q.; project administration, Z.L. and L.S.; resources, Z.L.; software, Y.Q. and B.L.; supervision, Z.L. and L.S.; validation, Y.Q. and W.X.; visualization, Y.C.; writing—original draft, Y.Q.; writing—review and editing, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the Zhejiang Provincial Natural Science Foundation Project (NO.LZJMD25D050002), and the Key program of National Natural Science Foundation of China (U20B2061).

Data Availability Statement

The radar echo dataset acquired by National Meteorological Information Center is openly available in the official website of National Meteorological Information Center at https://data.cma.cn/. Restrictions apply to the availability of the Radar dataset.

Conflicts of Interest

Author Wenrui Xiong was employed by the Beijing Wenze Zhiyuan Information Technology Co. She participated in data curation and validation in the study. The role of the company was information service. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Hartman, B.; Cutler, H.; Shields, M.; Turner, D. The economic effects of improved precipitation forecasts in the United States due to better commuting decisions. Growth Change 2021, 52, 2149–2171. [Google Scholar] [CrossRef]
Niu, X.; Zhang, L.; Wang, C.; Shen, K.; Tian, W.; Liao, B. A Generative Adversarial and Spatiotemporal Differential Fusion Method in Radar Echo Extrapolation. Remote Sens. 2023, 15, 5329. [Google Scholar] [CrossRef]
Huang, X.; Chen, G.; Zhao, K.; Li, W.; Huang, J.; Zhao, L.; Fang, J. Improved Nowcasting of Short-Time Heavy Precipitation and Thunderstorm Gale Based on Vertical Profile Characteristics of Dual-Polarization Radar. Meteorol. Mon. 2024, 50, 1519–1530. [Google Scholar] [CrossRef]
Imhoff, R.O.; Brauer, C.; Overeem, A.; Weerts, A.H.; Uijlenhoet, R. Spatial and temporal evaluation of radar rainfall nowcasting techniques on 1,533 events. Water Resour. Res. 2020, 56, e2019WR026723. [Google Scholar] [CrossRef]
Wu, J.; Chen, M.; Qin, R.; Gao, F.; Song, L. The variational echo tracking method and its application in convective storm nowcasting. In Proceedings of the EGU General Assembly Conference, Vienna, Austria, 3–8 May 2020; p. 1293. [Google Scholar]
Zhang, W.; Fang, B. The Study of Severe Convection Weather Forecast by the Method of Tracking Echo Centroids. Meteorol. Mon. 1995, 21, 13–18. [Google Scholar] [CrossRef]
Cao, W.; Chen, M.; Gao, F.; Cheng, C.; Qin, R.; Wu, J.; Zhong, J. A vector blending study based on object-based tracking vectors and cross correlation tracking vectors. Acta Meteorol. Sin. 2019, 77, 1015–1027. [Google Scholar] [CrossRef]
Ayzel, G.; Heistermann, M.; Winterrath, T. Optical flow models as an open benchmark for radar-based precipitation nowcasting (rainymotion v0.1). Geosci. Model Dev. 2019, 12, 1387–1402. [Google Scholar] [CrossRef]
Li, L.; Schmid, W.; Joss, J. Nowcasting of motion and growth of precipitation with radar over a complex orography. J. Appl. Meteorol. Climatol. 1995, 34, 1286–1300. [Google Scholar] [CrossRef]
Horn, B.K.; Schunck, B.G. Determining optical flow. Artif. Intell. 1981, 17, 185–203. [Google Scholar] [CrossRef]
Liu, Y.; Xi, D.G.; Li, Z.L.; Hong, Y. A new methodology for pixel-quantitative precipitation nowcasting using a pyramid Lucas Kanade optical flow approach. J. Hydrol. 2015, 529, 354–364. [Google Scholar] [CrossRef]
Cheung, P.; Yeung, H. Application of optical-flow technique to significant convection nowcast for terminal areas in Hong Kong. In Proceedings of the 3rd WMO International Symposium on Nowcasting and Very Short-Range Forecasting (WSN12), Rio de Janeiro, Brazil, 6–10 August 2012; pp. 6–10. [Google Scholar]
Ajit, A.; Acharya, K.; Samanta, A. A review of convolutional neural networks. In Proceedings of the 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE), Vellore, India, 24–25 February 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
Ayzel, G.; Heistermann, M.; Sorokin, A.; Nikitin, O.; Lukyanova, O. All convolutional neural networks for radar-based precipitation nowcasting. Procedia Comput. Sci. 2019, 150, 186–192. [Google Scholar] [CrossRef]
Ayzel, G.; Scheffer, T.; Heistermann, M. RainNet v1.0: A convolutional neural network for radar-based precipitation nowcasting. Geosci. Model Dev. 2020, 13, 2631–2644. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Weng, L.; Xu, Y.; Xia, M.; Zhang, Y.; Liu, J.; Xu, Y. Water Areas Segmentation from Remote Sensing Images Using a Separable Residual SegNet Network. ISPRS Int. J. -Geo-Inf. 2020, 9, 256. [Google Scholar] [CrossRef]
Salehinejad, H.; Sankar, S.; Barfett, J.; Colak, E.; Valaee, S. Recent advances in recurrent neural networks. arXiv 2017, arXiv:1801.01078. [Google Scholar]
Lindemann, B.; Müller, T.; Vietz, H.; Jazdi, N.; Weyrich, M. A survey on long short-term memory networks for time series prediction. Procedia CIRP 2021, 99, 650–655. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 156–165. [Google Scholar]
Atluri, G.; Karpatne, A.; Kumar, V. Spatio-temporal data mining: A survey of problems and methods. ACM Comput. Surv. (CSUR) 2018, 51, 1–41. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A machine learning approach for precipitation nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 1, pp. 802–810. [Google Scholar]
Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Deep learning for precipitation nowcasting: A benchmark and a new model. Advances in Neural Information Processing Systems. arXiv 2017, arXiv:1706.03458. [Google Scholar]
Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. PredRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 879–888. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar]
He, G.; Wu, W.; Han, J.; Luo, J.; Lei, L. EOST-LSTM: Long Short-Term Memory Model Combined with Attention Module and Full-Dimensional Dynamic Convolution Module. Remote Sens. 2025, 17, 1103. [Google Scholar] [CrossRef]
Ji, C.; Xu, Y. trajPredRNN+: A new approach for precipitation nowcasting with weather radar echo images based on deep learning. Heliyon 2024, 10, e36134. [Google Scholar] [CrossRef]
Srivastava, N.; Mansimov, E.; Salakhudinov, R. Unsupervised learning of video representations using lstms. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 843–852. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Sun, W.; Su, F.; Wang, L. Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 2018, 278, 34–40. [Google Scholar] [CrossRef]
Wang, Y.; Gao, Z.; Long, M.; Wang, J.; Philip, S.Y. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5123–5132. [Google Scholar]
Wang, Y.; Jiang, L.; Yang, M.H.; Li, L.J.; Long, M.; Li, F. Eidetic 3D LSTM: A model for video prediction and beyond. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Wang, Y.; Zhang, J.; Zhu, H.; Long, M.; Wang, J.; Yu, P.S. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9154–9162. [Google Scholar]
Lin, Z.; Li, M.; Zheng, Z.; Cheng, Y.; Yuan, C. Self-attention convlstm for spatiotemporal prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11531–11538. [Google Scholar]
Chang, Z.; Zhang, X.; Wang, S.; Ma, S.; Ye, Y.; Gao, W. Stae: A spatiotemporal auto-encoder for high-resolution video prediction. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Ye, X.; Bilodeau, G.A. Video prediction by efficient transformers. Image Vis. Comput. 2023, 130, 104612. [Google Scholar] [CrossRef]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Wang, G.; Liu, L.; Ding, Y. Improvement of radar quantitative precipitation estimation based on real-time adjustments to ZR relationships and inverse distance weighting correction schemes. Adv. Atmos. Sci. 2012, 29, 575–584. [Google Scholar] [CrossRef]

Figure 1. Radar echo map colored by intensity.

Figure 2. Distribution of maximum echo value in single sample of training set and test set.

Figure 3. The entire structure of ViViT-Prob.

Figure 4. The structure of transformer [27].

Figure 5. (a) Two-dimensional patch coding in ViT [28], (b) Two-dimensional patch coding directly introduced into the video sequence, (c) patch coding based on 3D convolution.

Figure 6. Multi-head attention based on spatio-temporal fusion.

Figure 7. (a) Two distribution plots of self-attention, (b) visualization of attention in the actual model, (c) density comparison of attention scores in the actual model.

Figure 8. Step-by-step decoding process in traditional transformers.

Figure 9. Parallel decoding with temporal–spatial relationships.

Figure 10. Comparison of prediction results of various models on the Moving MNIST dataset. (a,b) are two different samples, from top to bottom are input, target and results of different models: ViViT-Prob, VPTR-NAR, ConvLSTM, and PredRNN.

Figure 11. Visualization effect using class activation mapping (CAM) on Moving MNIST dataset. Red indicates the key focus areas of the model during the decoding process.

Figure 12. Comparison of inversion results scores at no-rain, light rain, and moderate rain levels. The x-axis of the Figure is the different models, and the y-axis is the value of scores. (a) The CSI on three precipitation levels. (b) The FAR on three precipitation levels. (c) The POD on three precipitation levels.

Figure 13. Comparison of prediction results of various models on the radar echo dataset. (a,b) are two different samples, from top to bottom are input, target and results of different models: ViViT-Prob, ConvLSTM, PredRNN, and RainNet.

Figure 14. Comparison of prediction results of various models on the radar echo dataset. (a,b) are two different samples, from top to bottom are input, target and results of different models: ViViT-Prob, ConvLSTM, PredRNN, and RainNet.

Figure 15. Comparison of prediction results of various models on the radar echo dataset. (a,b) are two different samples, from top to bottom are input, target and results of different models: ViViT-Prob, ConvLSTM, PredRNN, and RainNet.

Table 1. Comparison models in the experiment, which are commonly used in spatiotemporal sequence prediction and weather forecasting.

Models	Description
ConvLSTM	A classic spatiotemporal sequence prediction network that uses convolutional operations in both the network states and input data to capture spatial features, effectively modeling temporal and spatial correlations.
PredRNN	A recurrent neural network that employs a unified memory pool to store spatial representations and temporal changes. The hidden states are no longer confined within individual LSTM cells but can propagate in both horizontal and vertical directions.
CausalLSTM	A novel spatiotemporal memory unit that enhances feature amplification through additional nonlinear operations while maintaining hierarchical invariance, enabling better capture of short-term dynamic changes.
E3D-LSTM	A spatiotemporal sequence prediction model that integrates LSTM with 3D convolutions and incorporates self-attention mechanisms, demonstrating strong capabilities in predicting multidimensional spatiotemporal data.
MIM	A neural network prediction model designed to learn high-order non-stationarity in spatiotemporal dynamics. It combines historical information with the current state to predict future states, capturing complex relationships in spatiotemporal data.
SA-ConvLSTM	A ConvLSTM model enhanced with self-attention mechanisms. It dynamically adjusts information flow within the network to better learn complex patterns in spatiotemporal data.
STAE	A spatiotemporal sequence prediction model that utilizes temporal attention mechanisms to weight time information, thereby improving prediction performance.
VPTR	A novel transformer-based video prediction model, available in three variants: fully autoregressive (VPTR-FAR), partially autoregressive (VPTR-PAR), and non-autoregressive (VPTR-NAR).
OpticalFlow	A radar echo extrapolation model based on optical flow methods. It tracks the motion trends of precipitation features from a series of radar echo images and infers the precipitation field at the next time step.
RainNet	A deep learning-based radar echo extrapolation model that uses quality-controlled weather radar data provided by the German Meteorological Service to predict persistent echoes and precipitation intensity.

Table 2. Performance comparison on Moving MNIST dataset. The optimal (or suboptimal) results are marked by bold (or underlined). The experimental results of the models are obtained from published papers, and * indicates that the specific metric was not provided in the original paper.

	Models	MSE↓	MAE↓	SSIM↑
RNN-Based Models	ConvLSTM	103.3	182.9	0.707
	PredRNN	56.8	126.1	0.867
	CausalLSTM	46.5	106.8	0.898
	MIM	44.2	101.1	0.910
	E3D-LSTM	41.3	86.4	0.910
	SA-ConvLSTM	43.9	94.7	0.913
	STAE	35.2	*	0.929
Transformer-Based Models	VPTR-FAR	107.2	*	0.844
	VPTR-PAR	93.2	*	0.859
	VPTR-NAR	63.6	*	0.882
	Ours(ViViT-Prob)	29.1	89.1	0.923

Table 3. Performance Comparison on Radar echo dataset. The optimal (or suboptimal) results are marked by bold (or underlined).

	Models	MSE↓	MAE↓	SSIM↑
Radar Echo Extrapolation Models	OpticalFlow	87.8	1.745	0.604
Radar Echo Extrapolation Models	RainNet	84.7	2.335	0.572
Spatiotemporal Sequence Models	ConvLSTM	100.6	1.719	0.613
	PredRNN	74.3	1.657	0.717
	Ours (ViViT-Prob)	58.4	1.542	0.744

Table 4. Ablation experiment of each module in the model. * represents outputting directly without using a decoder module; 3DCNN * represents a randomly initialized matrix as the input to the decoder.

Encoder	Decoder	Attention	MSE↓	MAE↓	SSIM↑
Patch	*	Multihead Self-Attention	78.7	1.736	0.623
Patch	Patch	Multihead Self-Attention	72.9	1.712	0.646
Patch	Patch	Multihead Probsparse Self-Attention	67.3	1.637	0.682
3DCNN	3DCNN *	Multihead Self-Attention	70.4	1.692	0.667
3DCNN	3DCNN	Multihead Self-Attention	64.8	1.583	0.727
3DCNN	3DCNN	Multihead Cross-Attention	68.2	1.624	0.703
3DCNN	3DCNN	Multihead Probsparse Self-Attention	58.4	1.542	0.744

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiu, Y.; Lu, B.; Xiong, W.; Lu, Z.; Sun, L.; Cui, Y. ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention. Remote Sens. 2025, 17, 1966. https://doi.org/10.3390/rs17121966

AMA Style

Qiu Y, Lu B, Xiong W, Lu Z, Sun L, Cui Y. ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention. Remote Sensing. 2025; 17(12):1966. https://doi.org/10.3390/rs17121966

Chicago/Turabian Style

Qiu, Yunan, Bingjian Lu, Wenrui Xiong, Zhenyu Lu, Le Sun, and Yingjie Cui. 2025. "ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention" Remote Sensing 17, no. 12: 1966. https://doi.org/10.3390/rs17121966

APA Style

Qiu, Y., Lu, B., Xiong, W., Lu, Z., Sun, L., & Cui, Y. (2025). ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention. Remote Sensing, 17(12), 1966. https://doi.org/10.3390/rs17121966

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention

Abstract

1. Introduction

2. Data

2.1. Moving MNIST Dataset

2.2. Radar Echo Dataset

3. Method

3.1. Network Structure

3.1.1. Transformer Structure

3.1.2. Patch Coding Based on 3D Convolution

3.1.3. Sparse Attention Module Based on Multi-Head Spatiotemporal Fusion

3.1.4. Parallel Decoding with Temporal-Spatial Relationships

3.2. Loss Function

3.3. Evaluation Metrics

4. Experiments and Analysis

4.1. Moving MNIST Experiments

4.2. Radar Echo Dataset Experiments

4.3. Ablation Experiments

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI