3.1. Representative Cases
Figure 4 presents a convective organization case observed over Jiangsu Province from 14:06 to 16:00 Beijing Time on 6 September 2025. The figure compares real observations (a1–a6) with outputs from three sets of controlled experimental models. For models capable of predicting multi-layer radar echoes, the maximum vertical column reflectivity is displayed. Radar images are shown at 30 min intervals for clarity. Observations reveal that a new convective storm initiated south of Taizhou between 14:00 and 14:30, which subsequently merged with pre-existing storms to the west and organized into a linear convective system along the coastal area. For the first set of controlled experiments, The PredRNNV2 experiment failed to capture this convective initiation. NowcastNet only predicted the initiation around 15:30 (c3), which lagged considerably behind observations. In contrast, DIFF-3DRformer accurately and timely detected the convective initiation and reasonably predicted its organization into a linear system, although the predicted system intensity at 16:00 (d4) was weaker than observed. This case demonstrates that the proposed DIFF-3DRformer outperforms both PredRNNV2 and NowcastNet in forecasting convective initiation. For the second set of controlled experiments, Zh1_DIFF-3DRformer, Zh10_DIFF-3DRformer, and Zh19_DIFF-3DRformer successfully captured the development of this convective initiation. However, Zh1_DIFF-3DRformer underestimated the linear convective system over coastal southeastern China at 16:00 (d4), whereas Zh10_DIFF-3DRformer substantially mitigated this underestimation (e4). Zh19_DIFF-3DRformer yielded intensity and precipitation-region forecasts closest to the observations (h4). These results indicate that increasing the number of radar echo input levels can effectively alleviate the intensity bias in forecasts of strong convective systems. For the third set, the Zh19_Transformer scheme exhibited progressively degraded morphology and displacement of the strong convective system with lead time (f1–f4). Incorporating additional physics constraints in Zh19_3DRformer improved the representation of strong echo morphology and intensity in the first hour (g1–g2) but still underestimated peak reflectivity by the second hour (g3–g4). Further augmenting the model with a diffusion model in Zh19_DIFF-3DRformer maintained accurate intensity and structural depiction of the convective system at later lead times. Overall, physics-constrained training enhances early-period convective echo forecasts, but performance still degrades with lead time; the addition of a diffusion model effectively extends the useful forecast lead time for strong convection.
Figure 5 presents a case study of a large-scale squall line over Jiangsu, China, from 19:36 to 21:30 Beijing Time on 16 July 2025. The figure compares the ground-truth observations (a1–a6) with outputs from three sets of controlled experimental models. Observations reveal that the convective system propagated south-southeastward, progressively intensifying and becoming more organized, eventually affecting the Nanjing area. The Zh1_PredRNNV2 model exhibits limited skill in squall-line forecasting, with the system’s intensity consistently underestimated and the bias becoming more pronounced in the later forecast hours (b3–b4). Although Zh1_NowcastNet captures the squall line’s location and morphology at 21:30 (c4), its predicted intensity is markedly weaker than that of the observations. Z1_DIFF-3DRformer better predicts the squall line’s location in the later stages, yet the intensity remains underestimated. Overall, DIFF-3DRformer achieves the best performance among the three models for this case. The Zh1_DIFF-3DRformer experiment, which utilized only a single echo level failed to capture the intensity and spatial extent of the strong convection, producing a disorganized and significantly weakened squall line structure. In contrast, the Zh10_DIFF-3DRformer experiment, incorporating 10 echo levels, effectively captured the overall location and organization of the convective system, though it still underestimated the intensity and poorly delineated storm boundaries. The Zh19_DIFF-3DRformer experiment, using 19 input levels, successfully reproduced the linear convective organization process. By 21:30 (h4), it accurately predicted the intensity and position of the line-shaped convection north of Nanjing, demonstrating improved forecasting skill regarding the squall line’s propagation, areal expansion, and echo structure. These results underscore that increasing the number of radar echo layers as input markedly enhances the prediction of morphological and intensity characteristics of severe convective echoes. For Zh19_Transformer, the squall-line position is inaccurate at 21:30 (f4), while Zh19_3DRformer improves the positional forecast, however, intensity remains biased low (g4). The Z19_DIFF-3DRformer experimental configuration yields squall-line location and intensity closest to the observations. These results indicate that incorporating physical constraints enhances the prediction of squall-line location, and further augmenting the model with a diffusion component improves the representation of intensity.
Figure 6 presents the observed (a1–a6) and prediction from three sets of controlled experimental models during a spiral rainband event associated with a typhoon over Jiangsu Province from 16:36 to 18:30 BT on 1 August 2025. Observations (a4–a6) reveal a localized enhancement of strong convection in the Suzhou area starting at 17:30 BT. Zh1_PredRNNV2 and Zh1_NowcastNet exhibit limited skill in predicting the morphology of typhoon spiral rainbands, while Zh1_DIFF-3DRformer better reproduces the spiral-band structure but shows notable discrepancies in the placement of intense radar reflectivity cores relative to observations. These results indicate that, among the three models, Zh1_DIFF-3DRformer more accurately captures the structural characteristics of typhoon spiral rainbands. Zh1_DIFF-3DRformer, Zh10_DIFF-3DRformer, and Zh19_DIFF-3DRformer all reproduce the morphological characteristics of spiral rainbands reasonably well. Among them, the positions of convective storms predicted by Zh19_DIFF-3DRformer are closest to the observed results, followed by Zh10_DIFF-3DRformer. These results indicate that increasing the number of radar echo input levels can improve the nowcasting accuracy of convective storm position, and the magnitude of performance improvement increases with the number of input levels. The Zh19_Transformer experiment captures the overall structure of the spiral rainband reasonably well at 17:00 (f1) and 17:30 (f2). However, as the forecast lead time increases, its ability to reproduce the rainband morphology deteriorates, and it fails to predict the short-term intense convection over Suzhou. In contrast, the Zh19_3DRformer, which incorporates physical constraints into the Zh19_transformer framework, shows a marked improvement in predicting the morphology of the spiral rainbands, although the convective intensity is notably underestimated compared to observations. This suggests that introducing physically constrained neural operators enhances the model’s capability in capturing the structure of strong convective echoes. Furthermore, the Zh19_DIFF-3DRformer, which integrates a diffusion model into the Zh19_3DRformer, not only successfully captures the localized strong convection over Suzhou but also substantially mitigates the underestimation of convective intensity seen in the physical-constraint-only experiment.
Furthermore, as some experimental designs in the first and second controlled experiment lacked forecast data for the 19 levels of reflectivity, the third control experiment was selected to demonstrate the model’s capability in capturing the three-dimensional evolution of storms. Vertical cross-sections of reflectivity factor forecasts are presented for the widespread squall line event on 16 July 2025 and the convective initiation case on 6 September 2025. To better visualize the three-dimensional storm development near the locations of convective initiation, vertical cross-sections along the transects defined by the two latitude–longitude points marked in
Figure 7a and
Figure 7b are shown for the 16 July and 6 September cases, respectively.
Figure 8 presents three-dimensional vertical cross-section forecasts from two representative cases. As shown in
Figure 8a, from 19:36 to 19:48, all three models successfully capture the vertical structure of the convective storm. However, the Zh19_Transformer model forecasts the strong convective storm core with reflectivity above 50 dBZ only up to approximately 6000 m, whereas observations indicate that the storm core reaches beyond 9000 m. In contrast, the Zh19_3DRformer and Zh19_DIFF-3DRformer models produce storm core heights that align more closely with observations.
Figure 8b illustrates a rapidly intensifying convective storm with an ascending core. Observations reveal that starting from 14:42, the storm begins to strengthen, and its core height gradually exceeds 6000 m. The Zh19_Transformer model fails to capture this intensification and ascent. The Zh19_3DRformer model reproduces the vertical structure of the storm but with underestimated intensity and some displacement in the core location. Notably, the Zh19_DIFF-3DRformer model accurately captures the ascending motion of the storm, despite a slight positional offset in the core region.
3.2. Quantitative Performance Evaluation
As illustrated in
Figure 9, the root-mean-square error (RMSE) of the reflectivity factor for Zh1_NowcastNet(b) rises sharply starting from the 60th min, and after the 90th min, the overall RMSE exceeds 10 dBZ. In contrast, the RMSE values of Zh1_PredRNNV2(a) and Zh1_DIFF-3DRformer(c) increase gradually with time steps, and the errors at the 120th min remain below 10 dBZ. This phenomenon can be attributed to the fact that the PredRNNV2 model is inherently trained with the mean squared error (MSE) as the loss function, whereas NowcastNet adopts generative adversarial network (GAN) training, whose loss function incorporates multiple terms such as cumulative consistency loss, adversarial loss, and pooling loss. For the DIFF-3DRformer model, its loss function includes both MSE loss and denoising loss. Furthermore, the mean deviation (MD) of all three models are less than 0, indicating that they all underestimate the echo intensity. Notably, the overall mean bias of DIFF-3DRformer is lower, which demonstrates the superiority of the DIFF-3DRformer model over the other two counterparts. Zh1_DIFF-3DRformer, Zh10_DIFF-3DRformer, and Zh19_DIFF-3DRformer are similar, with all errors remaining below 10 dBZ at the 120th min. From the perspective of the mean deviation, the MD of Zh19_DIFF-3DRformer is close to 0 before the 60th min, and its error growth trend changes more slowly with time steps compared to Zh1_DIFF-3DRformer and Zh10_DIFF-3DRformer. This result confirms the effectiveness of increasing the input layers of radar echoes in reducing forecast errors. The mean bias of Zh19_Transformer becomes less than 0 after the 72nd min, while that of Zh19_3DRformer remains greater than 0 throughout the forecast period. Specifically, the mean bias of Zh19_3DRformer is close to 0 in the first 60 min but turns negative in the later stage. This observation can be explained by the introduction of physical constraints, which may lead to overestimation of forecast results in some weak echo areas. Additionally, the incorporation of the diffusion model results in a larger mean bias compared to the schemes trained without the diffusion model.
Figure 10 presents the variation in TS scores with forecast lead time during the test period (April–September 2025) for three sets of comparative experiments. As shown in
Figure 10a, at the 25 dBZ threshold, Zh19_Transformer and Zh19_3DRformer achieve the best performance, with Zh19_3DRformer slightly outperforming Zh19_Transformer, followed by Zh19_DIFF-3DRformer. This is attributed to the fact that both Zh19_Transformer and Zh19_3DRformer adopt the mean squared error (MSE) as the loss function, which yields superior forecasting results for large-scale weak precipitation echoes. In addition, Zh1_DIFF-3DRformer, Zh10_DIFF-3DRformer, and Zh19_DIFF-3DRformer exhibit comparable forecasting performance within the first 36 min. However, after 36 min, the TS of Zh1_DIFF-3DRformer gradually becomes lower than that of Zh10_DIFF-3DRformer, while the TS of Zh10_DIFF-3DRformer also decreases progressively compared with Zh19_DIFF-3DRformer. The aforementioned results also indicate that increasing the input levels of radar echoes can extend the forecasting lead time of weak precipitation echoes. Across all time steps, the TS of Zh1_NowcastNet is consistently lower than those of Zh1_PredRNNV2 and Zh1_DIFF-3DRformer. Moreover, the TS of Zh1_PredRNNV2 becomes higher than that of Zh1_DIFF-3DRformer after 30 min. This is because the forecasting results of Zh1_PredRNNV2 gradually become blurred and averaged after 30 min, which enhances its performance in forecasting weak precipitation echoes. In contrast, Zh1_DIFF-3DRformer produces more refined forecasts, particularly demonstrating superior performance in predicting small-scale strong echoes.
As shown in
Figure 10b, at the 35 dBZ threshold, starting from 18 min, the TS values of both Zh19_3DRformer and Zh19_DIFF-3DRformer are higher than those of other experimental schemes. This indicates that the introduction of physical constraints can improve the forecasting performance of convective storms. Furthermore, after 72 min, the TS of Zh19_DIFF-3DRformer gradually surpasses that of Zh19_3DRformer, which is due to the further integration of a diffusion model in Zh19_DIFF-3DRformer, effectively extending the forecasting lead time. Within the 0–72 min forecast range, the TS scores of Zh1_DIFF-3DRformer are significantly higher than those of the other two models. Beyond 72 min, however, Zh1_NowcastNet begins to outperform the others, which can be attributed to its use of generative ensemble forecasting. This approach enhances prediction stability and accuracy, allowing Zh1_NowcastNet to maintain reasonable skill in capturing the initiation and dissipation of storms beyond their typical lifecycle, thereby stabilizing TS scores.
Figure 10c displays TS scores for reflectivity ≥ 45 dBZ. At the 45 dBZ threshold, starting from 30 min, the TS of Zh19_DIFF-3DRformer is consistently superior to those of all other experimental schemes. Additionally, Zh19_DIFF-3DRformer and Zh19_3DRformer maintain higher TS values in the early forecasting stage. After 66 min, the TS values of Zh10_3DRformer and Zh19_DIFF-3DRformer are higher than those of other models. This indicates that in the late forecasting stage, the introduction of physical constraints alone is insufficient to improve the forecasting performance of severe convective storms, and the integration of a diffusion model is necessary to extend the forecasting lead time.
As illustrated in
Figure 11, at the 25 dBZ threshold, Zh19_DIFF-3DRformer achieves the optimal performance within the first 66 min. In the later stage of the forecast, Zh1_PredRNNV2 exhibits the largest bias. At the 35 dBZ threshold, Zh19_Transformer, Zh19_3DRformer, and Zh19_DIFF-3DRformer demonstrate superior BIAS performance. This indicates that utilizing the maximum 19 levels of radar echo input can improve the BIAS score in the later forecast period. At the 45 dBZ threshold, after the 90th min, the BIAS score of Zh19_DIFF-3DRformer approaches 1, verifying its effectiveness in severe convective storm forecasting.
It is evident from
Figure 12 that the three experimental schemes adopting the diffusion model, namely Zh1_DIFF-3DRformer, Zh10_DIFF-3DRformer and Zh19_DIFF-3DRformer, yield better performance. This confirms that the introduction of the diffusion model can enhance the clarity of forecast results.
Figure 13 presents the time-series variation curves of Score for various experimental schemes. At the 25 dBZ and 35 dBZ thresholds, Zh19_DIFF-3DRformer outperforms other schemes in terms of Score after the 72nd min. Additionally, at the 35 dBZ threshold, Zh19_Transformer, Zh19_3DRformer, and Zh19_DIFF-3DRformer achieve higher scores in the later forecast stage, suggesting that increasing the number of radar echo input levels can improve the late-stage forecast performance. At the 45 dBZ threshold, Zh19_3DRformer consistently outperforms Zh19_Transformer, while Zh1_PredRNNV2 and Zh1_NowcastNet exhibit the worst metrics. This not only validates the effectiveness of the introduced physical constraints but also demonstrates that the proposed DIFF-3DRformer model is superior to the mainstream extrapolation models PredRNNV2 and NowcastNet.
In addition, the 0–2 h TS, Bias, FID and Score of the three comparative experimental schemes were evaluated. As illustrated in
Figure 14a, Zh1_DIFF-3DRformer exhibits higher TS values for convective echoes (≥35 dBZ) at 0–2 h lead times compared to the PredRNNV2 and NowcastNet models. Notably, for intense convective echoes (≥45 dBZ), the TS of Zh1_DIFF-3DRformer is 0.0353 higher than that of NowcastNet, indicating a more pronounced advantage in forecasting extreme convective weather. In terms of BIAS (
Figure 14b), the BIAS of NowcastNet deviates substantially from 1 at the ≥45 dBZ threshold, while PredRNNV2 shows BIAS values between 0.2 and 0.8 across all thresholds, suggesting a systematic underestimation of echo area. In contrast, DIFF-3DRformer demonstrates superior BIAS performance. Regarding the FID metric (
Figure 14c), both NowcastNet and DIFF-3DRformer, which employ generative artificial intelligence techniques, yield sharper forecasts than PredRNNV2, with DIFF-3DRformer achieving the optimal FID score. The composite Score results (
Figure 14d), reveal that DIFF-3DRformer outperforms NowcastNet by 30.3%, 44.8%, and 236.6% at the ≥25 dBZ, ≥35 dBZ, and ≥45 dBZ thresholds, respectively. These results demonstrate that, using single-layer composite reflectivity data as input, DIFF-3DRformer achieves overall superior scores compared to both PredRNNV2 and NowcastNet.
As illustrated in
Figure 15a, the Zh19_DIFF-3DRformer model exhibits higher TS scores for thresholds ≥35 dBZ and ≥45 dBZ at 0–2 h lead times compared to the Zh10_DIFF-3DRformer, and also outperforms the Zh1_DIFF-3DRformer. In terms of Bias score (
Figure 15b) Zh19_DIFF-3DRformer, which utilizes 19 levels of radar reflectivity as input, shows Bias scores close to 1 for thresholds ≥25 dBZ and ≥35 dBZ. Regarding the FID metric (
Figure 15c), since the three experimental setups output extrapolation results at different levels, the scheme employing more input levels needs to predict more output levels. This makes the model learning process more complex compared to the single-level input and output approach. Consequently, using more input levels can lead to a degradation in FID score. Based on the comprehensive Score results (
Figure 15d), using more radar echo layers as input improves the Score for thresholds ≥35 dBZ (convective echoes). This suggests that using additional vertical echo levels can improve the forecast skill for convective systems.
As illustrated in
Figure 16, the Zh19_3DRformer scheme, which integrates physical constraint neural operators, achieves TS that are comparable to or slightly lower than those of the Zh19_Transformer baseline at the 25 dBZ and 35 dBZ thresholds across a 0–2 h forecast lead time. However, a marked improvement is observed at the more intense convective echo threshold (≥45 dBZ), where the Zh19_3DRformer yields a 73.2% increase in TS relative to the Zh19_Transformer. This enhancement can be attributed to the incorporated physical neural operators, which are primarily designed to represent updraft motions within strong convective systems. By jointly optimizing the root mean square error and divergence loss, the scheme prioritizes the accurate representation of intense convection, albeit with a minor trade-off in skill for weaker echoes compared to a loss function relying solely on mean squared error. Furthermore, the addition of a convective-scale denoising diffusion model provides a secondary, albeit smaller, boost to the TS at the ≥45 dBZ. In terms of Bias score (
Figure 16b), the integration of physical constraints also leads to improved performance specifically for strong convective echoes. Regarding the FID (
Figure 16c), both the Zh19_Transformer and Zh19_3DRformer yield similar results, as neither employs generative deep learning models to enhance extrapolation sharpness. Notably, augmenting the Zh19_3DRformer with the convective-scale diffusion model significantly reduces the FID score, indicating a substantial gain in the spatial clarity and realism of the forecast fields. The comprehensive scoring results (
Figure 16d) confirm that at the ≥45 dBZ threshold, the Zh19_DIFF-3DRformer configuration outperforms the other experimental schemes. In summary, the inclusion of physics-constrained neural operators effectively improves the forecast skill for intense convective echoes, and this benefit is further augmented by the application of a diffusion model, which markedly enhances the structural sharpness of the extrapolation results.