Short-Term Precipitation Forecast Based on Diffusion Spatiotemporal Network

Dong, Zanqiang; Yang, Zhaofeng; Yu, Wenbin; Qian, Hongjie; Fan, Yanfeng; Zhu, Konglin; Liu, Gaoping

doi:10.3390/rs18101574

Open AccessArticle

Short-Term Precipitation Forecast Based on Diffusion Spatiotemporal Network

by

Zanqiang Dong

¹,

Zhaofeng Yang

^2,3

,

Wenbin Yu

^2,3,4,*

,

Hongjie Qian

^2,3

,

Yanfeng Fan

^2,3,

Konglin Zhu

⁵

and

Gaoping Liu

⁶

¹

School of Computer Science, Zhengzhou University of Aeronautics, Zhengzhou 450046, China

²

School of Software, Nanjing University of Information Science and Technology, Nanjing 210044, China

³

Nanjing University of Information Science and Technology, Wuxi Institute of Technology, Wuxi 214000, China

⁴

Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), Nanjing University of Information Science and Technology, Nanjing 210044, China

⁵

The Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA

⁶

Anhui Provincial Meteorological Information Centre, Hefei 230031, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1574; https://doi.org/10.3390/rs18101574

Submission received: 14 February 2026 / Revised: 30 April 2026 / Accepted: 8 May 2026 / Published: 14 May 2026

(This article belongs to the Special Issue AI Applications to Remote Sensing of Cloud and Precipitation: Monitoring, Modeling, and Prediction)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A ViT-modulated diffusion spatiotemporal prediction network (VSTPN) is proposed, combining recurrent prediction with ViT-conditioned diffusion refinement for short-term radar-based precipitation forecasting on HKO-7.
VSTPN achieves the best MSE and SSIM among tested baselines, and exhibits the best average rank across reported metrics while improving CSI, HSS, and POD.

What are the implications of the main findings?

The model demonstrates high robustness, maintaining the highest CSI/POD retention when scaling from moderate (dBZ ≥ 20) to severe (dBZ ≥ 40) precipitation intensities.
The results at dBZ ≥ 40 reveal a specific detection–false alarm trade-off: while VSTPN significantly improves CSI, HSS, and POD, its FAR is slightly higher than that of ETCJ-PredNet.

Abstract

Short-term precipitation forecasting is essential for disaster prevention, urban management, and weather-sensitive decision making, yet radar-based nowcasting remains challenging because precipitation systems evolve nonlinearly and high-frequency echo structures are easily over-smoothed by deterministic sequence models. This paper proposes a ViT-modulated diffusion spatiotemporal prediction network (VSTPN) that cascades a spatiotemporal prediction module with a ViT-conditioned diffusion refinement module. The spatiotemporal module models the temporal evolution of radar echoes, whereas the ViT-Diffusion module uses global contextual features as conditional guidance during iterative denoising to refine spatial structures. Experiments on the HKO-7 benchmark show that VSTPN achieves lower MSE and higher SSIM than the tested baselines and improves CSI, HSS, and POD at the evaluated reflectivity thresholds. At the 40 dBZ threshold, the model improves CSI, HSS, and POD, while its FAR is slightly higher than that of ETCJ-PredNet, indicating a recall–false alarm trade-off for intense echoes. Additional post-hoc diagnostic analyses of relative gains, metric consistency, threshold sensitivity, and component effect sizes further support the stability of the reported improvements under the current experimental protocol. The results suggest that coupling spatiotemporal sequence modeling with diffusion-based radar echo refinement is a feasible direction for short-term precipitation forecasting; nevertheless, probabilistic uncertainty evaluation, multi-domain validation, and additional generative-quality metrics remain important directions for future work.

Keywords:

spatiotemporal prediction network; diffusion model; ViT; rainfall; deep learning

1. Introduction

Short-term precipitation forecasting, also referred to as precipitation nowcasting, aims to predict the location, extent, and intensity of rainfall within the next few hours. Accurate nowcasting is important for flood warning, urban drainage management, aviation safety, traffic control, and agricultural production. The task is difficult because radar echoes are governed by multiscale atmospheric processes, including advection, growth and decay of convective cells, orographic effects, and rapid intensity changes. These processes make future radar fields highly nonlinear, spatially heterogeneous, and uncertain.

Traditional precipitation forecasting methods can be broadly grouped into statistical modeling, numerical weather prediction, and radar echo extrapolation. Statistical methods use historical meteorological variables, such as temperature, humidity, pressure, wind direction, and wind speed, to build regression or probabilistic relationships. These methods are relatively interpretable and computationally efficient, but their extrapolation ability is limited when the atmospheric state deviates from the historical distribution. Numerical weather prediction (NWP) solves dynamical and thermodynamical equations under physical constraints and remains a central technology in modern meteorology [1]. Early ensemble forecasting laid an important foundation for probabilistic numerical prediction [2], while subsequent studies improved cloud initialization and precipitation verification methods [3,4,5]. Recent NWP-related work has also explored satellite retrieval, cloud deployment of the Weather Research and Forecasting model, and radar data assimilation for heavy-rainfall events [6,7,8]. Nevertheless, for very short lead times, NWP may suffer from spin-up time, coarse resolution, and data assimilation limitations, especially in rapidly developing convective events.

Radar echo extrapolation provides a fast and operationally attractive alternative because radar observations offer high-frequency measurements of precipitation structures. Early work on radar-based echo motion estimation dates back to Ligda’s radar tracking study [9]. Cross-correlation and tracking–radar echo methods were later developed to estimate echo displacement [10], and COTREC-type approaches introduced continuity constraints to correct noisy or inconsistent motion vectors [11]. Probabilistic extrapolation systems such as STEPS combine radar extrapolation with downscaled NWP to produce ensemble precipitation forecasts [12]. Object-based cell tracking [13], optical-flow-based nowcasting [14,15,16,17], and satellite-based convection tracking methods such as Cb-TRAM [18] further improved motion estimation. However, extrapolation methods are still limited when precipitation systems grow, decay, split, or merge because the future evolution cannot be explained by advection alone.

Deep learning has been increasingly used to learn nonlinear radar echo evolution directly from data. Early neural network studies in hydrology and meteorology demonstrated the potential of data-driven rainfall and runoff prediction [19,20,21,22]. In precipitation nowcasting, ConvLSTM introduced convolutional recurrent units for spatiotemporal sequence prediction [23]; TrajGRU improved recurrent connections by learning location-variant motion structures [24]; PredRNN and PredRNN-v2 strengthened memory flow for long-term spatiotemporal dependency modeling [25,26]; and recent radar or satellite nowcasting models such as SATcast and ETCJ-PredNet further exploit temporal attention, multi-source information, and jump connections [27,28]. These methods provide strong deterministic baselines, but they often produce overly smooth predictions at longer lead times because pixelwise regression losses average over multiple plausible futures.

Recent transformer and generative approaches have attempted to address these limitations from different perspectives. DGMR formulates radar nowcasting as probabilistic generation and emphasizes ensemble forecast quality [29]. NowcastNet combines physical evolution and conditional generative learning for extreme precipitation nowcasting [30]. EarthFormer explores space–time Transformer blocks for Earth system forecasting [31], whereas PreDiff uses latent diffusion models and knowledge alignment for probabilistic precipitation nowcasting [32]. These studies indicate that global context modeling and generative refinement are important for preserving multiscale precipitation structures. They also show that deterministic pixelwise metrics alone are insufficient for fully evaluating generative nowcasting systems.

Motivated by these developments, this paper proposes a ViT-modulated diffusion spatiotemporal prediction network (VSTPN). The model first uses a spatiotemporal prediction module to estimate the coarse future radar sequence and then applies a ViT-conditioned diffusion module to refine spatial details under global contextual guidance. The intended contribution is not to claim that a cascaded architecture is universally superior to all end-to-end generative nowcasting models, but to evaluate whether a diffusion-based refinement stage can improve radar echo clarity and deterministic nowcasting scores when coupled with a recurrent spatiotemporal predictor. The main contributions are summarized as follows:

A cascaded VSTPN framework is formulated for short-term radar echo prediction, combining recurrent spatiotemporal modeling with ViT-conditioned diffusion refinement.
The ViT-Diffusion module is described as a conditional denoising network in which ViT-derived global contextual tokens guide U-Net feature reconstruction through attention-based modulation.
Experiments on the HKO-7 benchmark compare VSTPN with representative recurrent and attention-based baselines under deterministic and threshold-based metrics, while the limitations of single-domain evaluation, deterministic scoring, and unreported probabilistic uncertainty are explicitly discussed.
Additional diagnostic experiments summarize relative gains over the strongest baseline, threshold-wise POD–FAR behavior, cross-metric robustness, and component effect sizes using the reported test-set metrics.

2. Materials and Methods

2.1. Problem Formulation

Let Let

X_{1 : m} = {X_{1}, \dots, X_{m}}

denote the observed radar echo sequence and let

Y_{1 : n} = {Y_{1}, \dots, Y_{n}}

denote the future sequence to be predicted, where each frame

X_{i}, Y_{j} \in R^{H \times W \times C}

. The objective of radar-based nowcasting is to learn a mapping

{\hat{Y}}_{1 : n} = F_{Θ} (X_{1 : m}),

(1)

where

{\hat{Y}}_{1 : n}

is the predicted future radar sequence, and

Θ

represents all learnable parameters. In the proposed cascade, this mapping is decomposed into a spatiotemporal prediction stage and a diffusion refinement stage:

{\tilde{Y}}_{1 : n} = S_{ϕ} (X_{1 : m}), {\hat{Y}}_{1 : n} = D_{θ} ({\tilde{Y}}_{1 : n}, X_{1 : m}),

(2)

where

S_{ϕ}

produces a coarse deterministic forecast and

D_{θ}

refines the spatial structure through ViT-conditioned diffusion denoising. This design separates motion-oriented sequence prediction from image-generation-oriented refinement. The separation is useful in the present experimental setting, but it should be interpreted as a modeling choice rather than proof that cascading is always preferable to an end-to-end diffusion forecaster.

2.2. Spatiotemporal Prediction Module

The spatiotemporal module is designed to capture temporal dependencies, echo displacement, and multiscale spatial structures before diffusion refinement. For the l-th recurrent layer at time t, the hidden and cell states are denoted by

H_{t}^{l}

and

C_{t}^{l}

. To reduce the information isolation problem of stacked recurrent units, the module aggregates the lower-layer state, the previous state of the current layer, and a temporal attention summary. A self-attention summary over historical hidden states, following scaled dot-product attention in Transformer models [33], can be expressed as

A_{t, τ}^{l} = softmax (\frac{(W_{q}^{l} H_{t - 1}^{l}) {(W_{k}^{l} H_{τ}^{l})}^{⊤}}{\sqrt{d}}), {\bar{H}}_{t}^{l} = \sum_{τ < t} A_{t, τ}^{l} W_{v}^{l} H_{τ}^{l},

(3)

where

W_{q}^{l}

,

W_{k}^{l}

, and

W_{v}^{l}

are learnable projections, and d is the attention dimension. The recurrent update is then written as

Z_{t}^{l} = [H_{t}^{l - 1}, H_{t - 1}^{l}, C_{t}^{l - 1}, {\bar{H}}_{t}^{l}],

(4)

\begin{matrix} i_{t}^{l} & = σ (W_{i}^{l} * Z_{t}^{l} + b_{i}^{l}), f_{t}^{l} = σ (W_{f}^{l} * Z_{t}^{l} + b_{f}^{l}), \\ o_{t}^{l} & = σ (W_{o}^{l} * Z_{t}^{l} + b_{o}^{l}), g_{t}^{l} = tanh (W_{g}^{l} * Z_{t}^{l} + b_{g}^{l}), \\ C_{t}^{l} & = f_{t}^{l} ⊙ C_{t - 1}^{l} + i_{t}^{l} ⊙ g_{t}^{l}, H_{t}^{l} = o_{t}^{l} ⊙ tanh (C_{t}^{l}), \end{matrix}

(5)

where * denotes convolution, ⊙ denotes element-wise multiplication, and

σ (\cdot)

is the sigmoid activation. Jump connections are used to pass shallow spatial features to deeper prediction layers, thereby preserving boundary information and mitigating gradient attenuation. The output of this module,

{\tilde{Y}}_{1 : n}

, serves as the coarse prediction to be refined by ViT-Diffusion.

2.3. ViT-Diffusion

As a classical encoder–decoder architecture, U-Net [34] is used as the denoising backbone because it combines hierarchical feature extraction with skip-connected reconstruction. As shown in Figure 1, the encoder progressively compresses the radar field to extract multi-level features, and the decoder restores spatial resolution by fusing low-level textural information and high-level semantic information.

The U-Net configuration follows the channel layout shown in Figure 1. The encoder uses 3 × 3 convolutions and 2 × 2 max-pooling with channel widths of 16, 32, 64, 128, and 256. The decoder is symmetric and uses 2 × 2 upsampling, skip/block-copy connections, and 1 × 1 projection to recover a one-channel radar echo field. Dropout is applied at progressively deeper levels, as indicated in the architecture diagram, to reduce overfitting during feature reconstruction. These details make the denoising backbone explicit and distinguish the implemented module from a generic DDPM template.

To introduce global contextual guidance, the radar frame is divided into non-overlapping patches of size

P \times P

(with

P = 16

in the reported implementation), following the patch-token design of the Vision Transformer [35]. For an input radar image

X

, the patch sequence is embedded as

e_{i} = W_{p} vec (p_{i}) + e_{i}^{pos}, i = 1, \dots, N,

(6)

where

p_{i}

is the i-th image patch,

N = H W / P^{2}

is the number of patches,

W_{p}

is the patch-projection matrix, and

e_{i}^{pos}

is the positional embedding. The ViT encoder produces contextual tokens

G = ViT (e_{1}, \dots, e_{N}) .

(7)

These tokens are linearly projected and used as conditional key–value features inside the U-Net attention blocks:

K_{c} = W_{K} G, V_{c} = W_{V} G .

(8)

For an intermediate U-Net feature map

F

, the query is

Q = W_{Q} F

, and the ViT-conditioned cross-attention update is

CA (F, G) = softmax (\frac{Q K_{c}^{⊤}}{\sqrt{d_{a}}}) V_{c}, F^{'} = F + W_{O} CA (F, G),

(9)

where

d_{a}

is the attention dimension, and

W_{O}

is the output projection. In this way, the ViT branch provides a global meteorological context, while the U-Net branch reconstructs local echo structures.

To explicitly map these multi-modal interactions within the denoising backbone, Figure 2 illustrates the overarching architecture of the ViT-enhanced U-Net noise predictor. The network integrates the noisy radar field, the temporal embedding of the current diffusion time step, and the ViT-extracted contextual features to jointly estimate the Gaussian noise component. The fine-grained integration mechanism is further detailed in Figure 3, which presents the internal residual and cross-attention blocks of the network. Within these blocks, the ViT-derived contextual tokens act as the key-value conditions, precisely modulating the U-Net query features to ensure that global meteorological patterns effectively guide the local spatial reconstruction. Figure 4 illustrates the ViT-Diffusion process. The implemented diffusion chain follows the DDPM formulation [36] and uses

T = 1000

time steps, consistent with the denoising schedule illustrated in the architecture diagram. In the forward process, Gaussian noise is gradually added to a clean radar frame

x_{0}

:

q (x_{t} | x_{0}) = N (\sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I),

(10)

which can be reparameterized as

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, ϵ \sim N (0, I),

(11)

where

{\bar{α}}_{t} = \prod_{s = 1}^{t} (1 - β_{s})

, and

β_{s}

is the noise schedule. The denoising network

ϵ_{θ}

predicts the injected noise conditioned on the diffusion step and ViT context

G

:

L_{diff} = E_{x_{0}, t, ϵ} [{∥ϵ - ϵ_{θ} (x_{t}, t, G)∥}_{2}^{2}] .

(12)

The reverse update is written as

p_{θ} (x_{t - 1} | x_{t}, G) = N (μ_{θ} (x_{t}, t, G), σ_{t}^{2} I),

(13)

with

μ_{θ} (x_{t}, t, G) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t, G)) .

(14)

Figure 2. ViT-enhanced U-Net noise predictor. The noisy radar field, diffusion time step, and ViT contextual features are jointly used to predict the Gaussian noise component.

Figure 3. Internal residual and cross-attention blocks of the ViT-enhanced U-Net. ViT-derived contextual tokens provide key–value conditions for U-Net query features.

Figure 4. ViT-Diffusion pipeline. The ViT encoder estimates global context from radar echoes, and the diffusion decoder performs iterative denoising under this context.

2.4. Overall VSTPN Framework and Training Protocol

The complete VSTPN architecture is shown in Figure 5. Module A is the spatiotemporal prediction module, which estimates the coarse future radar sequence. Module B is the ViT-Diffusion module, which refines the coarse prediction by denoising under global ViT guidance. The HKO-7 evaluation uses ten historical radar frames as input and predicts the following ten frames. In the experiments reported in this paper, the cascade follows a two-stage training-and-refinement protocol: the spatiotemporal predictor is optimized for sequence prediction, the ViT-Diffusion module is optimized for conditional denoising on HKO-7 radar images, and the diffusion module is then used to refine the output of the spatiotemporal predictor. The corresponding losses are

L_{seq} = \frac{1}{n} \sum_{t = 1}^{n} {∥Y_{t} - {\tilde{Y}}_{t}∥}_{2}^{2}, L_{diff} = E [{∥ϵ - ϵ_{θ} (x_{t}, t, G)∥}_{2}^{2}] .

(15)

This protocol provides a practical way to combine deterministic motion prediction and diffusion-based image refinement, but it does not replace a dedicated comparison with an end-to-end diffusion forecaster. Such a comparison is therefore reserved for future work rather than claimed as established evidence in this study.All models were implemented using the PyTorch v2.1 framework and trained on an NVIDIA A100 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 40 GB of memory.

2.5. Algorithmic Summary

Step 1:: Input the historical radar sequence $X_{1 : m}$ and extract temporal features with the spatiotemporal prediction module.
Step 2:: Generate the coarse future sequence ${\tilde{Y}}_{1 : n}$ through recurrent prediction and jump-connected feature fusion.
Step 3:: Partition radar frames into $16 \times 16$ patches and extract ViT contextual tokens $G$ .
Step 4:: Inject $G$ into U-Net denoising blocks through cross-attention and predict the noise term at each diffusion step.
Step 5:: Iteratively denoise the coarse prediction to obtain the refined forecast ${\hat{Y}}_{1 : n}$ .

3. Performance Analysis and Evaluation

3.1. Qualitative Comparison of Generative Models

To examine the radar image generation behavior of the proposed ViT-Diffusion module, GAN [37], VAE [38], and ViT-Diffusion outputs are compared visually. This comparison is intended as an illustrative qualitative analysis rather than a complete distribution-level evaluation of generative fidelity. Standard generative metrics such as FID [39], LPIPS [40], or spectral/texture–distance measures are not reported in the present study and are therefore not used to support claims about distributional superiority.

As shown in Figure 6, all three generative models gradually learn the main radar echo patterns during training. The GAN results tend to show stronger local contrast but may contain unstable artifacts, whereas the VAE results are smoother because of latent mean approximation. The ViT-Diffusion outputs preserve clearer echo boundaries in the displayed examples and show fewer visually apparent artifacts. Because this evidence is qualitative, the conclusion is phrased conservatively—the ViT-Diffusion module appears visually suitable for radar echo refinement in the shown cases, but additional distributional and texture-based metrics are required before making a general claim about generative image quality superiority.

3.2. Quantitative Analysis

The quantitative evaluation uses six metrics: mean squared error (MSE), structural similarity index (SSIM), critical success index (CSI), Heidke skill score (HSS), probability of detection (POD), and false alarm ratio (FAR). MSE and SSIM are deterministic pixel-wise or structural metrics; SSIM follows the structural similarity image quality formulation of Wang et al. [41]. These metrics are useful for comparison with previous HKO-7 studies but may penalize physically plausible forecasts that are spatially displaced. CSI, HSS, POD, and FAR evaluate thresholded precipitation events and are therefore complementary to MSE and SSIM. Since the current evaluation reports one deterministic forecast per input sequence, probability-oriented metrics such as CRPS and other proper scoring rules [42], diagnostic reliability analysis [43], rank histograms [44], and ensemble spread–skill analysis are not included.

All tested models are trained and evaluated under the same HKO-7 experimental setting used in this study [24]. HKO-7 is a single-region radar benchmark centered on Hong Kong; therefore, the results primarily demonstrate performance within this domain and should not be interpreted as proof of cross-climate or cross-radar generalization. The baseline set includes ConvLSTM, TrajGRU, PredRNN, PredRNN-v2, SATcast, and ETCJ-PredNet, following the original model definitions and reported settings where applicable [23,24,25,26,27,28]. In the comparison and component analysis tables, superscript author–year citations in the model column indicate the original method papers. Table 1 summarizes MSE and SSIM for these baselines and VSTPN.

The results in Table 1 show that VSTPN obtains the lowest MSE and highest SSIM among the tested models. Compared with ConvLSTM and TrajGRU, VSTPN substantially reduces pixel-wise error and improves structural similarity. Compared with PredRNN-v2, VSTPN reduces MSE by 52.1 and increases SSIM by 0.033. Compared with ETCJ-PredNet, the strongest baseline in this table, VSTPN reduces MSE by 23.8 and increases SSIM by 0.011. These results support the usefulness of diffusion-based refinement under deterministic HKO-7 metrics, although they do not by themselves establish probabilistic forecast superiority.

Figure 7 reports the variation in MSE and SSIM over prediction lead time. The VSTPN curve remains competitive over the evaluated horizon, suggesting that the diffusion refinement stage helps preserve structural information when recurrent prediction errors accumulate. Both panels use the same model ordering, so the legend in the MSE panel also applies to the SSIM panel.

Table 2, Table 3 and Table 4 show the threshold-based evaluation at dBZ ≥ 20, 30, and 40. As the threshold increases, positive samples become sparser, and the task becomes more sensitive to small spatial and intensity errors. Therefore, differences at dBZ ≥ 40 should be interpreted with caution unless the number of positive samples and statistical confidence intervals are also reported.

At dBZ ≥ 20, VSTPN achieves the best performance among the seven evaluated models across all four threshold-based metrics. Compared with ETCJ-PredNet, VSTPN improves CSI from 0.725 to 0.731, HSS from 0.699 to 0.722, and POD from 0.870 to 0.875, while slightly reducing FAR from 0.187 to 0.185. These results indicate that the proposed model improves both precipitation detection accuracy and false alarm control under the relatively low reflectivity threshold.

At dBZ ≥ 30, VSTPN also outperforms ETCJ-PredNet on all four threshold-based metrics. Specifically, CSI increases from 0.577 to 0.591, HSS from 0.506 to 0.521, and POD from 0.717 to 0.722, while FAR decreases from 0.252 to 0.247. This suggests that VSTPN maintains better detection capability and lower false alarm tendency under a moderately stronger precipitation threshold.

At dBZ ≥ 40, VSTPN obtains higher CSI, HSS, and POD than ETCJ-PredNet, with CSI increasing from 0.331 to 0.342, HSS from 0.353 to 0.361, and POD from 0.514 to 0.521. However, its FAR is slightly higher than that of ETCJ-PredNet, increasing from 0.518 to 0.523. This result indicates that the proposed model detects more intense echoes but also introduces a small increase in false alarms at the highest threshold. Therefore, the performance at dBZ ≥ 40 should be interpreted as a detection–false alarm trade-off rather than an unqualified improvement in all extreme precipitation metrics.

3.3. Qualitative Nowcasting Analysis

Figure 8 and Figure 9 compare the predicted radar sequences produced by the evaluated models. The input frames provide the historical context, the ground-truth frames show the observed future sequence, and the white regions correspond to stronger radar echoes.

The displayed cases suggest that ConvLSTM has difficulty preserving both echo appearance and motion trajectory over longer lead times. Several recurrent baselines retain the approximate echo envelope in early frames but gradually lose fine structures. VSTPN preserves the main echo contour and produces visually sharper structures in the shown examples, especially in high-reflectivity regions. These qualitative results are consistent with the improved SSIM and threshold-based detection scores. However, the examples should be regarded as case studies rather than proof of general visual superiority. Randomly sampled cases, representative failure cases, and quantitative texture or spectral metrics would further strengthen the evaluation.

3.4. Component Analysis

To examine the contribution of individual components, this section reports component-level experiments for the temporal attention mechanism (ST), jump connection strategy (JC), and ViT-Diffusion module (ViTDiff). The tables should be interpreted as incremental component analyses derived from independently trained configurations, not as a strict leave-one-out ablation starting from the final VSTPN checkpoint. This distinction is important because the full-configuration CSI values in Table 5, Table 6 and Table 7 differ slightly from the final VSTPN score in Table 2; the differences arise from separate training runs and component-specific settings.

Table 5 reports the component analysis for the temporal attention module at dBZ ≥ 20. Adding temporal attention improves CSI from 0.701 to 0.717 when JC and ViTDiff are retained, suggesting that temporal attention helps capture precipitation evolution.

Table 6 reports the component analysis for jump connections. With ST and ViTDiff retained, adding JC improves CSI from 0.698 to 0.714 and reduces FAR from 0.211 to 0.197, indicating that cross-layer feature propagation helps preserve spatial details.

Table 7 reports the component analysis for ViTDiff. With ST and JC retained, adding ViTDiff improves CSI from 0.712 to 0.718 and increases POD from 0.875 to 0.887, suggesting that diffusion refinement helps recover detectable echo structures. The gain is smaller than that of the recurrent components, so the result should be described as complementary rather than dominant.

Overall, the component analysis indicates that temporal attention, jump connections, and ViTDiff each contribute to the tested deterministic and threshold-based scores. However, because the current design is incremental rather than a strict leave-one-out ablation of the final VSTPN checkpoint, the results should be interpreted as supportive evidence rather than complete causal attribution.

3.5. Additional Diagnostic Experiments

To further strengthen the quantitative evidence, we added three diagnostic experiments based on the numerical outputs already reported in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6 and Table 7. These diagnostics do not use additional training data or unreported model predictions; instead, they re-analyze the existing test-set metrics from complementary perspectives. This design makes the reported improvements easier to audit because MSE/SSIM, CSI/HSS, POD, and FAR represent different operational requirements in precipitation nowcasting.

First, we computed the relative improvement of VSTPN over ETCJ-PredNet, which is the strongest baseline in the main comparison tables. For a metric k, the relative gain is defined as

G_{k} = \{\begin{matrix} (B_{k} - O_{k}) / B_{k} \times 100 %, & if lower values are better, \\ (O_{k} - B_{k}) / B_{k} \times 100 %, & if higher values are better . \end{matrix}

(16)

where

O_{k}

denotes the VSTPN value and

B_{k}

denotes the ETCJ-PredNet value. As shown in Figure 10, VSTPN reduces MSE by 8.5% and improves SSIM by 1.3% relative to ETCJ-PredNet. For event-based metrics, the largest relative gains appear in HSS at dBZ ≥ 20 and CSI at dBZ ≥ 40. The negative value of FAR at dBZ ≥ 40 is retained in the figure, confirming that the high-threshold result should be interpreted as a detection–false alarm trade-off rather than an unconditional improvement.

Second, we computed a cross-metric robustness summary. Each metric column was first min–max normalized across models, with MSE and FAR reversed so that a larger normalized value always indicates better performance. The composite normalized skill is the average of these normalized scores over 14 reported metrics, including MSE, SSIM, and CSI/HSS/POD/FAR at the three reflectivity thresholds. We also computed the average rank over the same metrics and the retention ratios of CSI and POD from dBZ ≥ 20 to dBZ ≥ 40. As shown in Table 8, VSTPN has the highest composite skill, the best average rank, and the highest CSI/POD retention ratios. These results indicate that the reported advantage is not driven by a single metric column and does not disappear when the verification threshold increases.

Third, we visualized the POD–FAR relationship across reflectivity thresholds. The diagnostic in Figure 11 shows the strongest baselines and VSTPN at dBZ thresholds of 20, 30, and 40. The upper-left region indicates higher detection probability and lower false alarm ratio. VSTPN remains near the leading edge of the trade-off curve at dBZ ≥ 20 and dBZ ≥ 30. At dBZ ≥ 40, VSTPN shifts slightly upward but also slightly rightward compared with ETCJ-PredNet, which means that it detects more intense echoes at the cost of a small FAR increase. This visualization gives a more transparent interpretation of the extreme precipitation metrics than a single table row.

Finally, we converted the component-analysis tables into absolute effect-size diagnostics. Figure 12 shows the changes in CSI, HSS, POD, and FAR reduction when temporal attention, jump connections, or ViT-Diffusion is added under the reported component-analysis protocol. Temporal attention and jump connections provide the largest CSI gains, whereas ViT-Diffusion contributes a smaller but consistent improvement in CSI, HSS, POD, and FAR. This supports the interpretation that the recurrent module and the diffusion refinement module play complementary roles: the recurrent components improve spatiotemporal event localization, while the diffusion module helps recover detectable echo structures.

4. Limitations and Future Work

The present study has several limitations that should be considered when interpreting the results. The additional diagnostic experiments improve transparency by summarizing the reported metrics from several angles, but they are still aggregate-metric analyses and do not replace sample-level significance testing, uncertainty verification, or validation on new datasets. First, the evaluation is conducted only on HKO-7, which is a single geographic radar domain. Additional datasets, such as MRMS or European radar composites, are needed to assess cross-region, cross-climate, and cross-radar generalization. Second, the current evidence does not include an end-to-end diffusion baseline that directly maps historical radar sequences to future frames. Therefore, the reported results support the usefulness of the proposed cascade under the tested protocol but do not prove that cascading is intrinsically superior to a unified generative forecaster. Third, although diffusion models are probabilistic by design, the present evaluation reports deterministic outputs and does not include ensemble spread, prediction intervals, uncertainty maps, CRPS, rank histograms, or reliability diagrams [42,43,44]. Fourth, the generative model comparison is mainly qualitative; future work should include distributional, perceptual, texture, and spectral metrics. Fifth, although the additional diagnostic experiments summarize relative gains, metric consistency, and threshold sensitivity, the high-threshold evaluation would still be more statistically rigorous if the number of positive samples and significance tests were reported for each dBZ threshold. Finally, diffusion inference can be computationally expensive because it requires iterative denoising. Although the architecture diagram specifies the U-Net channel layout, the 16 × 16 ViT patching strategy, and the

T = 1000

diffusion schedule, a fully reproducible operational report should also include optimizer settings, learning rate schedule, batch size, hardware, latency, parameter count, GPU memory, and training cost. Future operational implementations should additionally consider acceleration strategies such as DDIM sampling [45], reduced-step diffusion, or consistency distillation [46].

5. Conclusions

This paper presents VSTPN, a cascaded framework for short-term precipitation forecasting that integrates spatiotemporal sequence modeling and ViT-conditioned diffusion refinement. The spatiotemporal module captures radar echo evolution from historical frames, while the ViT-Diffusion module uses global contextual tokens to guide U-Net-based denoising and spatial detail reconstruction. On the HKO-7 benchmark, VSTPN obtains an MSE of 256.5 and an SSIM of 0.844, outperforming the tested deterministic baselines. Threshold-based evaluation further shows improved CSI, HSS, and POD at dBZ ≥ 20, 30, and 40. The additional diagnostic experiments show that these gains are consistent across the reported metrics, that VSTPN retains the highest CSI and POD ratios from dBZ ≥ 20 to dBZ ≥ 40, and that the strongest high-threshold result should be interpreted as improved detection with a slight FAR penalty. The component analysis indicates that temporal attention, jump connections, and ViTDiff refinement contribute to the final performance under the current experimental protocol.

The findings suggest that diffusion-based refinement can mitigate part of the smoothing problem observed in recurrent precipitation nowcasting models. At the same time, the conclusions should be interpreted within the scope of the reported experiments. The study does not yet provide multi-domain validation, probabilistic uncertainty scoring, statistical significance testing for rare high-threshold events, or a direct end-to-end diffusion baseline. Addressing these issues will be necessary for establishing the operational robustness and scientific generality of VSTPN in future work.

Author Contributions

Conceptualization, Z.D.; methodology, Z.D.; software, Z.Y.; validation, W.Y., H.Q. and K.Z.; formal analysis, Z.Y.; investigation, G.L.; resources, W.Y.; data curation, H.Q.; writing—original draft preparation, Z.Y.; writing—review and editing, W.Y.; visualization, H.Q.; supervision, Y.F.; project administration, K.Z.; funding acquisition, W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of China, via grant number 62473201, and the the Basic Research Program of Jiangsu, via grant number BK20231142.

Data Availability Statement

The HKO-7 benchmark data can be requested from the Hong Kong Observatory according to its data-use policy. The processed experimental outputs and implementation details used in this study are available from the corresponding author upon reasonable request.

Acknowledgments

We gratefully acknowledge the financial support received from the National Natural Science Foundation of China and the Natural Science Foundation of Jiangsu Province. We would also like to thank Nanjing University of Information Science and Technology for their administrative and technical support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bauer, P.; Thorpe, A.; Brunet, G. The quiet revolution of numerical weather prediction. Nature 2015, 525, 47–55. [Google Scholar] [CrossRef]
Epstein, E.S. Stochastic dynamic prediction. Tellus 1969, 21, 739–759. [Google Scholar] [CrossRef]
Huang, X.Y.; Sundqvist, H. Initialization of cloud water content and cloud cover for numerical prediction models. Mon. Weather Rev. 1993, 121, 2719–2726. [Google Scholar] [CrossRef]
Jiménez, P.A.; Dudhia, J.; Thompson, G.; Lee, J.A.; Brummet, T. Improving the cloud initialization in WRF-Solar with enhanced short-range forecasting functionality: The MAD-WRF model. Sol. Energy 2022, 239, 221–233. [Google Scholar] [CrossRef]
Hamill, T.M. Hypothesis tests for evaluating numerical precipitation forecasts. Weather Forecast. 1999, 14, 155–167. [Google Scholar] [CrossRef]
Surussavadee, C.; Staelin, D.H. Global millimeter-wave precipitation retrievals trained with a cloud-resolving numerical weather prediction model, Part I: Retrieval design. IEEE Trans. Geosci. Remote Sens. 2008, 46, 99–108. [Google Scholar] [CrossRef]
Powers, J.G.; Werner, K.K.; Gill, D.O.; Lin, Y.L.; Schumacher, R.S. Cloud computing efforts for the Weather Research and Forecasting model. Bull. Am. Meteorol. Soc. 2021, 102, E1261–E1274. [Google Scholar] [CrossRef]
He, Z.; Ye, J.; Li, Z.; Lin, C.; Song, L. Impacts of radar data assimilation on the forecast of “12.8” extreme rainstorm in Central China (2021). Atmosphere 2023, 14, 1722. [Google Scholar] [CrossRef]
Ligda, M.G.H. The Horizontal Motion of Small Precipitation Areas as Observed by Radar; Technical Report 21; Department of Meteorology, Massachusetts Institute of Technology: Cambridge, MA, USA, 1953. [Google Scholar]
Rinehart, R.E.; Garvey, E.T. Three-dimensional storm motion detection by conventional weather radar. Nature 1978, 273, 287–289. [Google Scholar] [CrossRef]
Li, L.; Schmid, W.; Joss, J. Nowcasting of motion and growth of precipitation with radar over a complex orography. J. Appl. Meteorol. 1995, 34, 1286–1300. [Google Scholar] [CrossRef]
Bowler, N.E.; Pierce, C.E.; Seed, A.W. STEPS: A probabilistic precipitation forecasting scheme which merges an extrapolation nowcast with downscaled NWP. Q. J. R. Meteorol. Soc. 2006, 132, 2127–2155. [Google Scholar] [CrossRef]
Crane, R.K. Automatic cell detection and tracking. IEEE Trans. Geosci. Electron. 1979, 17, 250–262. [Google Scholar] [CrossRef]
Horn, B.K.P.; Schunck, B.G. Determining optical flow. Artif. Intell. 1981, 17, 185–203. [Google Scholar] [CrossRef]
Bowler, N.E.H.; Pierce, C.E.; Seed, A.W. Development of a precipitation nowcasting algorithm based upon optical flow techniques. J. Hydrol. 2004, 288, 74–91. [Google Scholar] [CrossRef]
Ayzel, G.; Heistermann, M.; Winterrath, T. Optical flow models as an open benchmark for radar-based precipitation nowcasting (rainymotion v0.1). Geosci. Model Dev. 2019, 12, 1387–1402. [Google Scholar] [CrossRef]
Zhu, J.; Dai, J. A rain-type adaptive optical flow method and its application in tropical cyclone rainfall nowcasting. Front. Earth Sci. 2022, 16, 248–264. [Google Scholar] [CrossRef]
Zinner, T.; Mannstein, H.; Tafferner, A. Cb-TRAM: Tracking and monitoring severe convection from onset over rapid development to mature phase using multi-channel Meteosat-8 SEVIRI data. Meteorol. Atmos. Phys. 2008, 101, 191–210. [Google Scholar] [CrossRef]
French, M.N.; Krajewski, W.F.; Cuykendall, R.R. Rainfall forecasting in space and time using a neural network. J. Hydrol. 1992, 137, 1–31. [Google Scholar] [CrossRef]
Hsu, K.l.; Gupta, H.V.; Sorooshian, S. Artificial neural network modeling of the rainfall-runoff process. Water Resour. Res. 1995, 31, 2517–2530. [Google Scholar] [CrossRef]
Sajikumar, N.; Thandaveswara, B.S. A non-linear rainfall-runoff model using an artificial neural network. J. Hydrol. 1999, 216, 32–55. [Google Scholar] [CrossRef]
Maier, H.R.; Dandy, G.C. Neural networks for the prediction and forecasting of water resources variables: A review of modelling issues and applications. Environ. Model. Softw. 2000, 15, 101–124. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2015; Volume 28, pp. 802–810. [Google Scholar]
Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Deep learning for precipitation nowcasting: A benchmark and a new model. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2017; Volume 30, pp. 5622–5632. [Google Scholar]
Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. PredRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMs. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2017; Volume 30, pp. 879–888. [Google Scholar]
Wang, Y.; Wu, H.; Zhang, J.; Gao, Z.; Wang, J.; Yu, P.S.; Long, M. PredRNN: A recurrent neural network for spatiotemporal predictive learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2208–2225. [Google Scholar] [CrossRef]
Chen, H.; Zhong, X.; Zhai, Q.; Li, X.; Chan, Y.W.; Chan, P.W.; Yang, M.; Huang, Y.; Li, H.; Shi, X. Skillful short-term forecasting of clouds with a cascade diffusion model. J. Geophys. Res. Mach. Learn. Comput. 2026, 3, e2025JH000976. [Google Scholar] [CrossRef]
Yu, W.; Fu, D.; Zhang, C.; Chen, Y.; Liu, A.X.; An, J. Enhanced precipitation nowcasting via temporal correlation attention mechanism and innovative jump connection strategy. Remote Sens. 2024, 16, 3757. [Google Scholar] [CrossRef]
Ravuri, S.; Lenc, K.; Willson, M.; Kangin, D.; Lam, R.; Mirowski, P.; Fitzsimons, M.; Athanassiadou, M.; Kashem, S.; Madge, S.; et al. Skilful precipitation nowcasting using deep generative models of radar. Nature 2021, 597, 672–677. [Google Scholar] [CrossRef]
Zhang, Y.; Long, M.; Chen, K.; Xing, L.; Jin, R.; Jordan, M.I.; Wang, J. Skilful nowcasting of extreme precipitation with NowcastNet. Nature 2023, 619, 526–532. [Google Scholar] [CrossRef] [PubMed]
Gao, Z.; Shi, X.; Wang, H.; Zhu, Y.; Wang, Y.; Li, M.; Yeung, D.Y. Earthformer: Exploring space-time transformers for earth system forecasting. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2022; Volume 35, pp. 25390–25403. [Google Scholar] [CrossRef]
Gao, Z.; Shi, X.; Han, B.; Wang, H.; Jin, X.; Maddix, D.; Zhu, Y.; Li, M.; Wang, Y. PreDiff: Precipitation nowcasting with latent diffusion models. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2023; Volume 36. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2017; Volume 30, pp. 6000–6010. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2014; Volume 27, pp. 2672–2680. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS: San Diego, CA, USA, 2017; Volume 30, pp. 6626–6637. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 586–595. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
Murphy, A.H.; Winkler, R.L. Diagnostic verification of probability forecasts. Int. J. Forecast. 1992, 7, 435–455. [Google Scholar] [CrossRef]
Hamill, T.M. Interpretation of rank histograms for verifying ensemble forecasts. Mon. Weather Rev. 2001, 129, 550–560. [Google Scholar] [CrossRef]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. Consistency models. In Proceedings of the 40th International Conference on Machine Learning; Proceedings of Machine Learning Research: Cambridge, MA, USA, 2023; Volume 202, pp. 32211–32252. [Google Scholar]

Figure 1. U-Net backbone used in ViT-Diffusion. The encoder extracts multiscale radar echo features, and the decoder reconstructs refined echo fields through skip connections.

Figure 5. Overall VSTPN architecture. Module A predicts the coarse future radar sequence; Module B refines the predicted frames with ViT-conditioned diffusion denoising.

Figure 6. Qualitative comparison of three generative models for radar echo image generation: (a) GAN, (b) VAE, and (c) ViT-Diffusion. The figure illustrates visual differences in boundary sharpness, noise suppression, and structural continuity.

Figure 7. Performance trends over prediction lead time on the HKO-7 dataset. The curves compare error and structural similarity changes among the evaluated models.

Figure 8. Illustrative HKO-7 prediction case 1. The figure compares the input sequence, ground truth, and predicted future radar echoes generated by the evaluated models.

Figure 9. Illustrative HKO-7 prediction case 2. The figure compares the temporal evolution and spatial continuity of predicted radar echoes across the evaluated models.

Figure 10. Relative improvement of VSTPN over ETCJ-PredNet computed from Table 1, Table 2, Table 3 and Table 4. Positive values indicate better VSTPN performance; FAR and MSE are converted so that positive values still mean improvement.

Figure 11. POD–FAR threshold sensitivity diagnostic for the strongest baselines and VSTPN. Curves connect dBZ thresholds of 20, 30, and 40; the upper-left region is preferable.

Figure 12. Component effect-size diagnostic at dBZ ≥ 20. Bars show absolute metric changes when each component is added under the reported component-analysis protocol; FAR is shown as FAR reduction.

Table 1. Performance comparison on the HKO-7 test set. The symbols ↑ and ↓ indicate that higher and lower values are preferred, respectively.

Model	MSE ↓	SSIM ↑
ConvLSTM [23]	592.1	0.669
TrajGRU [24]	442.6	0.691
PredRNN [25]	366.5	0.721
PredRNN-v2 [26]	308.6	0.811
SATcast [27]	309.4	0.824
ETCJ-PredNet [28]	280.3	0.833
VSTPN (ours)	256.5	0.844

Table 2. Comparison of CSI, HSS, POD, and FAR across seven networks (dBZ ≥ 20). The symbols ↑ and ↓ indicate that higher and lower values are preferred for the respective metrics.

Model	CSI ↑	HSS ↑	POD ↑	FAR ↓
ConvLSTM [23]	0.594	0.534	0.785	0.290
TrajGRU [24]	0.603	0.547	0.790	0.281
PredRNN [25]	0.651	0.608	0.830	0.249
PredRNN-v2 [26]	0.691	0.658	0.850	0.213
SATcast [27]	0.712	0.663	0.859	0.201
ETCJ-PredNet [28]	0.725	0.699	0.870	0.187
VSTPN (ours)	0.731	0.722	0.875	0.185

Table 3. Comparison of CSI, HSS, POD, and FAR across seven networks (dBZ ≥ 30). The symbols ↑ and ↓ indicate that higher and lower values are preferred for the respective metrics.

Model	CSI ↑	HSS ↑	POD ↑	FAR ↓
ConvLSTM [23]	0.431	0.321	0.601	0.396
TrajGRU [24]	0.484	0.394	0.648	0.344
PredRNN [25]	0.504	0.416	0.672	0.331
PredRNN-v2 [26]	0.543	0.464	0.695	0.287
SATcast [27]	0.551	0.473	0.704	0.268
ETCJ-PredNet [28]	0.577	0.506	0.717	0.252
VSTPN (ours)	0.591	0.521	0.722	0.247

Table 4. Comparison of CSI, HSS, POD, and FAR across seven networks (dBZ ≥ 40). The symbols ↑ and ↓ indicate that higher and lower values are preferred for the respective metrics.

Model	CSI ↑	HSS ↑	POD ↑	FAR ↓
ConvLSTM [23]	0.211	0.255	0.373	0.672
TrajGRU [24]	0.234	0.264	0.394	0.634
PredRNN [25]	0.248	0.279	0.416	0.618
PredRNN-v2 [26]	0.271	0.296	0.441	0.587
SATcast [27]	0.295	0.328	0.464	0.548
ETCJ-PredNet [28]	0.331	0.353	0.514	0.518
VSTPN (ours)	0.342	0.361	0.521	0.523

Table 5. Component analysis of the temporal attention module (dBZ ≥ 20). The symbols ↑ and ↓ denote the direction of improvement for each metric.

Model	CSI ↑	HSS ↑	POD ↑	FAR ↓
PredRNN-v2 [26]	0.691	0.658	0.850	0.213
w/o ST, w JC, w ViTDiff	0.701	0.682	0.871	0.203
w ST, w JC, w ViTDiff	0.717	0.692	0.882	0.191

Table 6. Component analysis of the jump connection module (dBZ ≥ 20). The symbols ↑ and ↓ denote the direction of improvement for each metric.

Model	CSI ↑	HSS ↑	POD ↑	FAR ↓
PredRNN-v2 [26]	0.691	0.658	0.850	0.213
w ST, w/o JC, w ViTDiff	0.698	0.677	0.864	0.211
w ST, w JC, w ViTDiff	0.714	0.685	0.876	0.197

Table 7. Component analysis of the ViTDiff module (dBZ ≥ 20). The symbols ↑ and ↓ indicate that higher and lower values are preferred, respectively.

Model	CSI ↑	HSS ↑	POD ↑	FAR ↓
PredRNN-v2 [26]	0.691	0.658	0.850	0.213
w ST, w JC, w/o ViTDiff	0.712	0.688	0.875	0.201
w ST, w JC, w ViTDiff	0.718	0.697	0.887	0.197

Table 8. Cross-metric diagnostic summary computed from Table 1, Table 2, Table 3 and Table 4. The composite skill is a min–max normalized aggregate and is used only as a consistency diagnostic.The symbol ↑ indicates that a higher value represents better performance or higher retention ratios.

Model	Composite Skill ↑	Avg. Rank ↓	CSI Retention ↑	POD Retention ↑
ETCJ-PredNet [28]	0.942	1.93	0.457	0.591
SATcast [27]	0.780	3.07	0.414	0.540
PredRNN-v2 [26]	0.661	3.93	0.392	0.519
PredRNN [25]	0.412	5.00	0.381	0.501
TrajGRU [24]	0.209	6.00	0.388	0.499
ConvLSTM [23]	0.000	7.00	0.355	0.475
VSTPN (ours)	0.998	1.07	0.468	0.595

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dong, Z.; Yang, Z.; Yu, W.; Qian, H.; Fan, Y.; Zhu, K.; Liu, G. Short-Term Precipitation Forecast Based on Diffusion Spatiotemporal Network. Remote Sens. 2026, 18, 1574. https://doi.org/10.3390/rs18101574

AMA Style

Dong Z, Yang Z, Yu W, Qian H, Fan Y, Zhu K, Liu G. Short-Term Precipitation Forecast Based on Diffusion Spatiotemporal Network. Remote Sensing. 2026; 18(10):1574. https://doi.org/10.3390/rs18101574

Chicago/Turabian Style

Dong, Zanqiang, Zhaofeng Yang, Wenbin Yu, Hongjie Qian, Yanfeng Fan, Konglin Zhu, and Gaoping Liu. 2026. "Short-Term Precipitation Forecast Based on Diffusion Spatiotemporal Network" Remote Sensing 18, no. 10: 1574. https://doi.org/10.3390/rs18101574

APA Style

Dong, Z., Yang, Z., Yu, W., Qian, H., Fan, Y., Zhu, K., & Liu, G. (2026). Short-Term Precipitation Forecast Based on Diffusion Spatiotemporal Network. Remote Sensing, 18(10), 1574. https://doi.org/10.3390/rs18101574

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Short-Term Precipitation Forecast Based on Diffusion Spatiotemporal Network

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Problem Formulation

2.2. Spatiotemporal Prediction Module

2.3. ViT-Diffusion

2.4. Overall VSTPN Framework and Training Protocol

2.5. Algorithmic Summary

3. Performance Analysis and Evaluation

3.1. Qualitative Comparison of Generative Models

3.2. Quantitative Analysis

3.3. Qualitative Nowcasting Analysis

3.4. Component Analysis

3.5. Additional Diagnostic Experiments

4. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI