1. Introduction
Short-term precipitation forecasting, also referred to as precipitation nowcasting, aims to predict the location, extent, and intensity of rainfall within the next few hours. Accurate nowcasting is important for flood warning, urban drainage management, aviation safety, traffic control, and agricultural production. The task is difficult because radar echoes are governed by multiscale atmospheric processes, including advection, growth and decay of convective cells, orographic effects, and rapid intensity changes. These processes make future radar fields highly nonlinear, spatially heterogeneous, and uncertain.
Traditional precipitation forecasting methods can be broadly grouped into statistical modeling, numerical weather prediction, and radar echo extrapolation. Statistical methods use historical meteorological variables, such as temperature, humidity, pressure, wind direction, and wind speed, to build regression or probabilistic relationships. These methods are relatively interpretable and computationally efficient, but their extrapolation ability is limited when the atmospheric state deviates from the historical distribution. Numerical weather prediction (NWP) solves dynamical and thermodynamical equations under physical constraints and remains a central technology in modern meteorology [
1]. Early ensemble forecasting laid an important foundation for probabilistic numerical prediction [
2], while subsequent studies improved cloud initialization and precipitation verification methods [
3,
4,
5]. Recent NWP-related work has also explored satellite retrieval, cloud deployment of the Weather Research and Forecasting model, and radar data assimilation for heavy-rainfall events [
6,
7,
8]. Nevertheless, for very short lead times, NWP may suffer from spin-up time, coarse resolution, and data assimilation limitations, especially in rapidly developing convective events.
Radar echo extrapolation provides a fast and operationally attractive alternative because radar observations offer high-frequency measurements of precipitation structures. Early work on radar-based echo motion estimation dates back to Ligda’s radar tracking study [
9]. Cross-correlation and tracking–radar echo methods were later developed to estimate echo displacement [
10], and COTREC-type approaches introduced continuity constraints to correct noisy or inconsistent motion vectors [
11]. Probabilistic extrapolation systems such as STEPS combine radar extrapolation with downscaled NWP to produce ensemble precipitation forecasts [
12]. Object-based cell tracking [
13], optical-flow-based nowcasting [
14,
15,
16,
17], and satellite-based convection tracking methods such as Cb-TRAM [
18] further improved motion estimation. However, extrapolation methods are still limited when precipitation systems grow, decay, split, or merge because the future evolution cannot be explained by advection alone.
Deep learning has been increasingly used to learn nonlinear radar echo evolution directly from data. Early neural network studies in hydrology and meteorology demonstrated the potential of data-driven rainfall and runoff prediction [
19,
20,
21,
22]. In precipitation nowcasting, ConvLSTM introduced convolutional recurrent units for spatiotemporal sequence prediction [
23]; TrajGRU improved recurrent connections by learning location-variant motion structures [
24]; PredRNN and PredRNN-v2 strengthened memory flow for long-term spatiotemporal dependency modeling [
25,
26]; and recent radar or satellite nowcasting models such as SATcast and ETCJ-PredNet further exploit temporal attention, multi-source information, and jump connections [
27,
28]. These methods provide strong deterministic baselines, but they often produce overly smooth predictions at longer lead times because pixelwise regression losses average over multiple plausible futures.
Recent transformer and generative approaches have attempted to address these limitations from different perspectives. DGMR formulates radar nowcasting as probabilistic generation and emphasizes ensemble forecast quality [
29]. NowcastNet combines physical evolution and conditional generative learning for extreme precipitation nowcasting [
30]. EarthFormer explores space–time Transformer blocks for Earth system forecasting [
31], whereas PreDiff uses latent diffusion models and knowledge alignment for probabilistic precipitation nowcasting [
32]. These studies indicate that global context modeling and generative refinement are important for preserving multiscale precipitation structures. They also show that deterministic pixelwise metrics alone are insufficient for fully evaluating generative nowcasting systems.
Motivated by these developments, this paper proposes a ViT-modulated diffusion spatiotemporal prediction network (VSTPN). The model first uses a spatiotemporal prediction module to estimate the coarse future radar sequence and then applies a ViT-conditioned diffusion module to refine spatial details under global contextual guidance. The intended contribution is not to claim that a cascaded architecture is universally superior to all end-to-end generative nowcasting models, but to evaluate whether a diffusion-based refinement stage can improve radar echo clarity and deterministic nowcasting scores when coupled with a recurrent spatiotemporal predictor. The main contributions are summarized as follows:
A cascaded VSTPN framework is formulated for short-term radar echo prediction, combining recurrent spatiotemporal modeling with ViT-conditioned diffusion refinement.
The ViT-Diffusion module is described as a conditional denoising network in which ViT-derived global contextual tokens guide U-Net feature reconstruction through attention-based modulation.
Experiments on the HKO-7 benchmark compare VSTPN with representative recurrent and attention-based baselines under deterministic and threshold-based metrics, while the limitations of single-domain evaluation, deterministic scoring, and unreported probabilistic uncertainty are explicitly discussed.
Additional diagnostic experiments summarize relative gains over the strongest baseline, threshold-wise POD–FAR behavior, cross-metric robustness, and component effect sizes using the reported test-set metrics.
3. Performance Analysis and Evaluation
3.1. Qualitative Comparison of Generative Models
To examine the radar image generation behavior of the proposed ViT-Diffusion module, GAN [
37], VAE [
38], and ViT-Diffusion outputs are compared visually. This comparison is intended as an illustrative qualitative analysis rather than a complete distribution-level evaluation of generative fidelity. Standard generative metrics such as FID [
39], LPIPS [
40], or spectral/texture–distance measures are not reported in the present study and are therefore not used to support claims about distributional superiority.
As shown in
Figure 6, all three generative models gradually learn the main radar echo patterns during training. The GAN results tend to show stronger local contrast but may contain unstable artifacts, whereas the VAE results are smoother because of latent mean approximation. The ViT-Diffusion outputs preserve clearer echo boundaries in the displayed examples and show fewer visually apparent artifacts. Because this evidence is qualitative, the conclusion is phrased conservatively—the ViT-Diffusion module appears visually suitable for radar echo refinement in the shown cases, but additional distributional and texture-based metrics are required before making a general claim about generative image quality superiority.
3.2. Quantitative Analysis
The quantitative evaluation uses six metrics: mean squared error (MSE), structural similarity index (SSIM), critical success index (CSI), Heidke skill score (HSS), probability of detection (POD), and false alarm ratio (FAR). MSE and SSIM are deterministic pixel-wise or structural metrics; SSIM follows the structural similarity image quality formulation of Wang et al. [
41]. These metrics are useful for comparison with previous HKO-7 studies but may penalize physically plausible forecasts that are spatially displaced. CSI, HSS, POD, and FAR evaluate thresholded precipitation events and are therefore complementary to MSE and SSIM. Since the current evaluation reports one deterministic forecast per input sequence, probability-oriented metrics such as CRPS and other proper scoring rules [
42], diagnostic reliability analysis [
43], rank histograms [
44], and ensemble spread–skill analysis are not included.
All tested models are trained and evaluated under the same HKO-7 experimental setting used in this study [
24]. HKO-7 is a single-region radar benchmark centered on Hong Kong; therefore, the results primarily demonstrate performance within this domain and should not be interpreted as proof of cross-climate or cross-radar generalization. The baseline set includes ConvLSTM, TrajGRU, PredRNN, PredRNN-v2, SATcast, and ETCJ-PredNet, following the original model definitions and reported settings where applicable [
23,
24,
25,
26,
27,
28]. In the comparison and component analysis tables, superscript author–year citations in the model column indicate the original method papers.
Table 1 summarizes MSE and SSIM for these baselines and VSTPN.
The results in
Table 1 show that VSTPN obtains the lowest MSE and highest SSIM among the tested models. Compared with ConvLSTM and TrajGRU, VSTPN substantially reduces pixel-wise error and improves structural similarity. Compared with PredRNN-v2, VSTPN reduces MSE by 52.1 and increases SSIM by 0.033. Compared with ETCJ-PredNet, the strongest baseline in this table, VSTPN reduces MSE by 23.8 and increases SSIM by 0.011. These results support the usefulness of diffusion-based refinement under deterministic HKO-7 metrics, although they do not by themselves establish probabilistic forecast superiority.
Figure 7 reports the variation in MSE and SSIM over prediction lead time. The VSTPN curve remains competitive over the evaluated horizon, suggesting that the diffusion refinement stage helps preserve structural information when recurrent prediction errors accumulate. Both panels use the same model ordering, so the legend in the MSE panel also applies to the SSIM panel.
Table 2,
Table 3 and
Table 4 show the threshold-based evaluation at dBZ ≥ 20, 30, and 40. As the threshold increases, positive samples become sparser, and the task becomes more sensitive to small spatial and intensity errors. Therefore, differences at dBZ ≥ 40 should be interpreted with caution unless the number of positive samples and statistical confidence intervals are also reported.
At dBZ ≥ 20, VSTPN achieves the best performance among the seven evaluated models across all four threshold-based metrics. Compared with ETCJ-PredNet, VSTPN improves CSI from 0.725 to 0.731, HSS from 0.699 to 0.722, and POD from 0.870 to 0.875, while slightly reducing FAR from 0.187 to 0.185. These results indicate that the proposed model improves both precipitation detection accuracy and false alarm control under the relatively low reflectivity threshold.
At dBZ ≥ 30, VSTPN also outperforms ETCJ-PredNet on all four threshold-based metrics. Specifically, CSI increases from 0.577 to 0.591, HSS from 0.506 to 0.521, and POD from 0.717 to 0.722, while FAR decreases from 0.252 to 0.247. This suggests that VSTPN maintains better detection capability and lower false alarm tendency under a moderately stronger precipitation threshold.
At dBZ ≥ 40, VSTPN obtains higher CSI, HSS, and POD than ETCJ-PredNet, with CSI increasing from 0.331 to 0.342, HSS from 0.353 to 0.361, and POD from 0.514 to 0.521. However, its FAR is slightly higher than that of ETCJ-PredNet, increasing from 0.518 to 0.523. This result indicates that the proposed model detects more intense echoes but also introduces a small increase in false alarms at the highest threshold. Therefore, the performance at dBZ ≥ 40 should be interpreted as a detection–false alarm trade-off rather than an unqualified improvement in all extreme precipitation metrics.
3.3. Qualitative Nowcasting Analysis
Figure 8 and
Figure 9 compare the predicted radar sequences produced by the evaluated models. The input frames provide the historical context, the ground-truth frames show the observed future sequence, and the white regions correspond to stronger radar echoes.
The displayed cases suggest that ConvLSTM has difficulty preserving both echo appearance and motion trajectory over longer lead times. Several recurrent baselines retain the approximate echo envelope in early frames but gradually lose fine structures. VSTPN preserves the main echo contour and produces visually sharper structures in the shown examples, especially in high-reflectivity regions. These qualitative results are consistent with the improved SSIM and threshold-based detection scores. However, the examples should be regarded as case studies rather than proof of general visual superiority. Randomly sampled cases, representative failure cases, and quantitative texture or spectral metrics would further strengthen the evaluation.
3.4. Component Analysis
To examine the contribution of individual components, this section reports component-level experiments for the temporal attention mechanism (ST), jump connection strategy (JC), and ViT-Diffusion module (ViTDiff). The tables should be interpreted as incremental component analyses derived from independently trained configurations, not as a strict leave-one-out ablation starting from the final VSTPN checkpoint. This distinction is important because the full-configuration CSI values in
Table 5,
Table 6 and
Table 7 differ slightly from the final VSTPN score in
Table 2; the differences arise from separate training runs and component-specific settings.
Table 5 reports the component analysis for the temporal attention module at dBZ ≥ 20. Adding temporal attention improves CSI from 0.701 to 0.717 when JC and ViTDiff are retained, suggesting that temporal attention helps capture precipitation evolution.
Table 6 reports the component analysis for jump connections. With ST and ViTDiff retained, adding JC improves CSI from 0.698 to 0.714 and reduces FAR from 0.211 to 0.197, indicating that cross-layer feature propagation helps preserve spatial details.
Table 7 reports the component analysis for ViTDiff. With ST and JC retained, adding ViTDiff improves CSI from 0.712 to 0.718 and increases POD from 0.875 to 0.887, suggesting that diffusion refinement helps recover detectable echo structures. The gain is smaller than that of the recurrent components, so the result should be described as complementary rather than dominant.
Overall, the component analysis indicates that temporal attention, jump connections, and ViTDiff each contribute to the tested deterministic and threshold-based scores. However, because the current design is incremental rather than a strict leave-one-out ablation of the final VSTPN checkpoint, the results should be interpreted as supportive evidence rather than complete causal attribution.
3.5. Additional Diagnostic Experiments
To further strengthen the quantitative evidence, we added three diagnostic experiments based on the numerical outputs already reported in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5,
Table 6 and
Table 7. These diagnostics do not use additional training data or unreported model predictions; instead, they re-analyze the existing test-set metrics from complementary perspectives. This design makes the reported improvements easier to audit because MSE/SSIM, CSI/HSS, POD, and FAR represent different operational requirements in precipitation nowcasting.
First, we computed the relative improvement of VSTPN over ETCJ-PredNet, which is the strongest baseline in the main comparison tables. For a metric
k, the relative gain is defined as
where
denotes the VSTPN value and
denotes the ETCJ-PredNet value. As shown in
Figure 10, VSTPN reduces MSE by 8.5% and improves SSIM by 1.3% relative to ETCJ-PredNet. For event-based metrics, the largest relative gains appear in HSS at dBZ ≥ 20 and CSI at dBZ ≥ 40. The negative value of FAR at dBZ ≥ 40 is retained in the figure, confirming that the high-threshold result should be interpreted as a detection–false alarm trade-off rather than an unconditional improvement.
Second, we computed a cross-metric robustness summary. Each metric column was first min–max normalized across models, with MSE and FAR reversed so that a larger normalized value always indicates better performance. The composite normalized skill is the average of these normalized scores over 14 reported metrics, including MSE, SSIM, and CSI/HSS/POD/FAR at the three reflectivity thresholds. We also computed the average rank over the same metrics and the retention ratios of CSI and POD from dBZ ≥ 20 to dBZ ≥ 40. As shown in
Table 8, VSTPN has the highest composite skill, the best average rank, and the highest CSI/POD retention ratios. These results indicate that the reported advantage is not driven by a single metric column and does not disappear when the verification threshold increases.
Third, we visualized the POD–FAR relationship across reflectivity thresholds. The diagnostic in
Figure 11 shows the strongest baselines and VSTPN at dBZ thresholds of 20, 30, and 40. The upper-left region indicates higher detection probability and lower false alarm ratio. VSTPN remains near the leading edge of the trade-off curve at dBZ ≥ 20 and dBZ ≥ 30. At dBZ ≥ 40, VSTPN shifts slightly upward but also slightly rightward compared with ETCJ-PredNet, which means that it detects more intense echoes at the cost of a small FAR increase. This visualization gives a more transparent interpretation of the extreme precipitation metrics than a single table row.
Finally, we converted the component-analysis tables into absolute effect-size diagnostics.
Figure 12 shows the changes in CSI, HSS, POD, and FAR reduction when temporal attention, jump connections, or ViT-Diffusion is added under the reported component-analysis protocol. Temporal attention and jump connections provide the largest CSI gains, whereas ViT-Diffusion contributes a smaller but consistent improvement in CSI, HSS, POD, and FAR. This supports the interpretation that the recurrent module and the diffusion refinement module play complementary roles: the recurrent components improve spatiotemporal event localization, while the diffusion module helps recover detectable echo structures.
4. Limitations and Future Work
The present study has several limitations that should be considered when interpreting the results. The additional diagnostic experiments improve transparency by summarizing the reported metrics from several angles, but they are still aggregate-metric analyses and do not replace sample-level significance testing, uncertainty verification, or validation on new datasets. First, the evaluation is conducted only on HKO-7, which is a single geographic radar domain. Additional datasets, such as MRMS or European radar composites, are needed to assess cross-region, cross-climate, and cross-radar generalization. Second, the current evidence does not include an end-to-end diffusion baseline that directly maps historical radar sequences to future frames. Therefore, the reported results support the usefulness of the proposed cascade under the tested protocol but do not prove that cascading is intrinsically superior to a unified generative forecaster. Third, although diffusion models are probabilistic by design, the present evaluation reports deterministic outputs and does not include ensemble spread, prediction intervals, uncertainty maps, CRPS, rank histograms, or reliability diagrams [
42,
43,
44]. Fourth, the generative model comparison is mainly qualitative; future work should include distributional, perceptual, texture, and spectral metrics. Fifth, although the additional diagnostic experiments summarize relative gains, metric consistency, and threshold sensitivity, the high-threshold evaluation would still be more statistically rigorous if the number of positive samples and significance tests were reported for each dBZ threshold. Finally, diffusion inference can be computationally expensive because it requires iterative denoising. Although the architecture diagram specifies the U-Net channel layout, the 16 × 16 ViT patching strategy, and the
diffusion schedule, a fully reproducible operational report should also include optimizer settings, learning rate schedule, batch size, hardware, latency, parameter count, GPU memory, and training cost. Future operational implementations should additionally consider acceleration strategies such as DDIM sampling [
45], reduced-step diffusion, or consistency distillation [
46].
5. Conclusions
This paper presents VSTPN, a cascaded framework for short-term precipitation forecasting that integrates spatiotemporal sequence modeling and ViT-conditioned diffusion refinement. The spatiotemporal module captures radar echo evolution from historical frames, while the ViT-Diffusion module uses global contextual tokens to guide U-Net-based denoising and spatial detail reconstruction. On the HKO-7 benchmark, VSTPN obtains an MSE of 256.5 and an SSIM of 0.844, outperforming the tested deterministic baselines. Threshold-based evaluation further shows improved CSI, HSS, and POD at dBZ ≥ 20, 30, and 40. The additional diagnostic experiments show that these gains are consistent across the reported metrics, that VSTPN retains the highest CSI and POD ratios from dBZ ≥ 20 to dBZ ≥ 40, and that the strongest high-threshold result should be interpreted as improved detection with a slight FAR penalty. The component analysis indicates that temporal attention, jump connections, and ViTDiff refinement contribute to the final performance under the current experimental protocol.
The findings suggest that diffusion-based refinement can mitigate part of the smoothing problem observed in recurrent precipitation nowcasting models. At the same time, the conclusions should be interpreted within the scope of the reported experiments. The study does not yet provide multi-domain validation, probabilistic uncertainty scoring, statistical significance testing for rare high-threshold events, or a direct end-to-end diffusion baseline. Addressing these issues will be necessary for establishing the operational robustness and scientific generality of VSTPN in future work.