1. Introduction
Typhoons, also known as tropical cyclones, are among the most destructive natural disasters, causing widespread devastation through high winds, heavy rainfall, and storm surges [
1]. Improving forecasting for such high-impact weather events is a high-priority task for machine learning applications aimed at climate change adaptation and disaster risk reduction [
2]. In the Western North Pacific basin, where over
of global typhoons form annually, accurate forecasting is essential for mitigating economic losses and saving lives. Traditional numerical weather prediction (NWP) models, such as those based on the Weather Research and Forecasting (WRF) system, rely on physical equations to simulate atmospheric dynamics and can produce forecasts; however, their real-time applicability is limited due to high computational costs and sensitivity to initial conditions. The advent of satellite remote sensing has generated massive archives of high-resolution cloud imagery, enabling data-driven approaches to effectively complement and enhance conventional physics-based methods. The Digital Typhoon dataset [
3,
4], containing over 40 years with hourly infrared images, provides a rich resource for studying typhoon evolution, yet leveraging it for predictive tasks remains challenging due to the spatiotemporal complexity of cloud patterns.
The spatiotemporal forecasting of typhoon satellite cloud imagery aims to infer future sequences from historical data, requiring models to simultaneously capture macro-level global structures (e.g., eye formation, spiral rainbands) and micro-level local dynamics (e.g., cloud convection). However, this task remains highly challenging due to the strongly nonlinear and multi-scale nature of atmospheric dynamics in typhoons. Deep learning has become a powerful paradigm for this task. Convolutional neural networks (CNNs) [
5] excel at extracting spatial features from images, while recurrent units like LSTMs [
6] are effective at modeling temporal dependencies in sequences. The combination of CNNs and LSTMs enables spatiotemporal feature extraction from image sequences for forecasting. Furthermore, physics-constrained neural networks improve prediction realism by incorporating domain-specific physical knowledge. More recently, denoising diffusion probabilistic models (DDPMs) [
7] have demonstrated remarkable capability in generating high-quality, diverse samples, thereby enabling probabilistic forecasting of spatiotemporal sequences. This paradigm has been successfully adapted to the atmospheric sciences [
8] to further overcome the challenges posed by strong nonlinearity. Despite these advances, most existing approaches either prioritize efficient global context modeling at the cost of fine-grained local details or suffer from temporal incoherence in recurrent processing, frequently resulting in blurred typhoon eye walls, distorted spiral rainbands, and rapid performance degradation beyond short lead times. Few approaches integrate dual-scale spatiotemporal modeling to directly predict future cloud image sequences in an attempt to address these challenges.
In this paper, FusionTyphoonPredictor is introduced, a novel encoder–decoder framework augmented with dual-branch mechanisms for enhanced spatiotemporal prediction of typhoon cloud images. The model processes input sequences of shape
, employing AdaptiveSampleConv layers for progressive compression and PixelShuffle for upsampling, with a skip connection preserving initial features [
9]. The Global Fusion Blocks capture multi-scale spatial interactions in latent space using LargeKernelAttention, DepthwiseFFN, and MultiScaleBlock, enabling efficient handling of large receptive fields without parameter explosion. Meanwhile, the ST Recurrent Refiner refines short-term dynamics through a recurrent ConvGRUCell, ResidualBlock, and pointwise convolution, reducing local redundancies via recurrent processing and dropout regularization.
The main contributions can be summarized as follows: (1) A dual-branch architecture for typhoon cloud image sequence prediction that effectively balances macroscopic evolutionary trends and microscopic dynamic features. (2) A Global Fusion Module that integrates Large-Kernel Attention and multi-scale convolutions to enhance the expressive ability of spatial features. (3) A ST (Spatiotemporal) Recurrent Refiner, which utilizes ConvGRU and residual structures to improve temporal consistency and local detail restoration. Experimental validation on real-world typhoon datasets demonstrates the effectiveness of each module, with comprehensive ablation studies quantifying their respective contributions.
2. Related Works
Typhoon forecasting has traditionally been dominated by Numerical Weather Prediction (NWP) models, such as the Weather Research and Forecasting (WRF) system [
10,
11,
12]. These models simulate atmospheric physics through partial differential equations, offering strong interpretability. However, they are computationally intensive and sensitive to initial conditions, often requiring supercomputers for real-time predictions.
The increasing availability of satellite data, such as the Digital Typhoon dataset, has fueled the development of deep learning-based methods for typhoon analysis and forecasting. Early studies leveraged Convolutional Neural Networks (CNNs) for feature extraction from infrared imagery, primarily focusing on intensity estimation rather than full-sequence cloud imagery prediction. A key advance came with Convolutional LSTMs (ConvLSTMs) [
13], which extended recurrent networks to spatiotemporal data, enabling short-term forecasting tasks like cloud motion prediction.
Subsequent research in spatiotemporal sequence forecasting has branched into multiple directions. Purely CNN-based architectures, such as SimVP [
14,
15], emphasize simplicity and efficiency. SimVP employs an encoder–translator–decoder framework with Inception-like modules for multi-scale temporal modeling, demonstrating strong performance and scalability without complex recurrent mechanisms.
Another line of work has focused on enhancing recurrent architectures with advanced memory mechanisms to better capture long-range spatiotemporal dependencies. The PredRNN family (PredRNN, PredRNN++, PredRNNv2) introduced novel memory cells that enable hierarchical spatiotemporal memory flow [
16,
17,
18], reducing error accumulation in long-horizon predictions and achieving state-of-the-art results on benchmarks like Moving MNIST [
19] and radar echo datasets. Building on this, the EMSN model [
20] incorporates channel attention and skip connections into the ConvLSTM framework to enhance short-term (e.g., 3 h) satellite cloud image forecasting.
Another important line of research focuses on the integration of physical knowledge. PhyDNet [
21], for instance, structurally disentangles PDE-driven dynamics from learned residuals to improve physical realism. Similarly, C2PhyNet [
22] introduces a novel Phy-Unit, which embeds typhoon dynamics through PDEs with an RNN architecture for sequential typhoon prediction. More recently, alternative architectures like PredFormer [
23] have demonstrated that a pure, recurrence-free, and convolution-free transformer [
24], equipped with gated attention and interleaved spatiotemporal modeling, can achieve strong performance in image sequence prediction without relying on recurrent components. Meanwhile, SwinRDM [
25] explores the application of diffusion models on the ERA5 dataset, and Earthformer [
26] employs cuboid attention, decomposing input data into cuboids and applying cuboid-level self-attention in parallel for global weather modeling.
To effectively capture both global structures and fine-grained local dynamics in typhoon evolution, we introduce FusionTyphoonPredictor, a dual-branch framework that integrates a global fusion module for multi-scale contextual modeling with a recurrent refinement branch tailored to local spatiotemporal details. Unlike prior recurrent models (e.g., PredRNN family) that process information at a uniform scale or physics-constrained approaches (e.g., PhyDNet) that may sacrifice flexibility, our synergistic design enables complementary modeling of macro- and micro-scale phenomena. Experiments on the Digital Typhoon dataset demonstrate that our method achieves competitive performance against PredRNN series and PhyDNet.
3. Methods
3.1. Overview and Encoder Decoder
We present FusionTyphoonPredictor, a novel spatiotemporal predictive model for typhoon cloud image sequence forecasting. The model builds upon a encoder–decoder framework, as shown in
Figure 1 (overview) and
Figure 2 (encoder–decoder detail), augmented with dual-branch spatiotemporal modeling to capture both macroscopic global evolution and microscopic local dynamics. The input consists of a sequence of typhoon cloud images with shape
, where
B is batch size,
T is sequence length,
C is channel count (e.g., 3), and
is spatial resolution (e.g.,
). The encoder progressively down-samples the input frames
into a compact latent representation
through multiple layers of AdaptiveSampleConv. During encoding, skip connections are preserved for later feature fusion. Symmetrically, the decoder employs PixelShuffle [
27] for up-sampling, integrates the skip connections, and finally restores the output to the full resolution via
convolutions.
This U-Net-inspired design aims to ensure progressive feature extraction and reconstruction, while alleviating the vanishing gradient problem through skip connections.
3.2. Global Fusion Blocks
To capture macro spatiotemporal interactions in the compressed latent space, we propose Global Fusion Blocks, as shown in
Figure 3. Each block integrates LargeKernelAttention, DepthwiseFFN, and MultiScaleBlock to handle large-scale spatiotemporal interactions. Specifically, each block first extracts global features using the LargeKernelAttention [
28] module, which combines pointwise convolution with depthwise convolution and introduces dilation to expand receptive fields without increasing parameters. The attention module is defined as (where
denotes the sigmoid activation):
The DepthwiseFFN first applies pointwise convolution and depthwise convolution to expand feature diversity and enhance spatial coherence, complementing the global preference of the LargeKernelAttention. It then introduces non-linearity through GELU [
29] activation and applies dropout [
30] to prevent overfitting. Finally, a pointwise convolution followed by dropout projects the features back to the original dimension for integration with other modules.
The MultiScaleBlock aims to enhance the overall expressive power of the block. It processes the output of the DepthwiseFFN in parallel, using multiple depthwise separable convolutions with different kernel sizes (
,
, and
) to capture details at different scales. The outputs from these parallel branches are then concatenated and fused via a
convolution, integrating features across scales to improve representation quality. Finally, robustness is enhanced through a skip connection and GroupNorm [
31], while DropPath regularization is applied to prevent overfitting.
3.3. ST Recurrent Refiner
Typhoons are phenomena characterized by strong spatiotemporal correlations. While the aforementioned Global Fusion Blocks effectively perform macro-global modeling, they implicitly learn temporal dynamics without an explicit recurrent mechanism. To compensate for the limitations of the Global Fusion Blocks branch in capturing local spatiotemporal details, we propose the ST Recurrent Refiner, as shown in
Figure 4. Its formulation is as follows:
At each time step
t, the current input frame is fed into a ConvGRUCell. The cell updates its hidden state using GRU gating mechanisms implemented via convolution with GroupNorm and GELU activations. Its purpose is to capture short-term dynamic features and mitigate temporal inconsistency. The updated hidden state
is then refined through a ResidualBlock—composed of two stacked Conv2d → GroupNorm → GELU layers with a shortcut [
32] connection. This block aims to enhance local feature details. Following the residual block, a pointwise convolution aligns the features with the input, facilitating integration in the next iteration and final output fusion.
The entire process from ConvGRUCell to ResidualBlock to PointWiseConv is executed recurrently for all T time steps. After T iterations, a final Dropout layer and a skip connection are employed to prevent overfitting, promote generalization, and alleviate the degradation issue common in recurrent networks.
4. Experiments
This section presents a comprehensive empirical evaluation of the proposed FusionTyphoonPredictor framework. We begin by detailing the experimental settings, including the dataset construction, temporal partitioning protocol and preprocessing procedures to ensure reproducibility. Subsequently, we compare our model against several state-of-the-art spatiotemporal forecasting baselines under consistent conditions, reporting quantitative results on key metrics such as Structural Similarity (SSIM) [
33], Mean Absolute Error (MAE), and Mean Squared Error (MSE) across multiple prediction horizons. To further dissect the contribution of core architectural branches, we conduct a thorough ablation study analyzing the roles of the Global Fusion Blocks and the ST (Spatiotemporal) Recurrent Refiner. Finally, we provide an in-depth analysis and discussion of the results, interpreting the model’s performance, training dynamics, and generalization capability to substantiate its effectiveness and practical utility for typhoon cloud image forecasting.
4.1. Dataset
The Digital Typhoon dataset is a comprehensive resource for studying tropical cyclones. Spanning over 40 years (1978–2023 for the Western Pacific and 1979–2024 for the Around Australia basin), it includes 263,043 infrared satellite images from the Himawari series, with a 5 km spatial and 1 h temporal resolution. In this paper, we focus on the Northern Hemisphere data, as summarized in
Table 1. It includes 192,956 images and 1116 typhoons, leveraging this subset for analysis and modeling to forecast satellite-based typhoon images. The dataset was created and is maintained by Asanobu Kitamoto, a professor at the National Institute of Informatics (NII), Japan, who has extensively researched typhoon image analysis and digital typhoon archives [
3,
4].
4.2. Experimental Settings
We partition the data temporally to reflect real-world forecasting scenarios: 2015–2021 for training, 2022 for validation, and 2023 for testing. This split adheres to best practices for time-series machine learning, ensuring models are trained on historical data and validated/tested on future unseen data, mirroring operational deployment.
We apply a tailored preprocessing pipeline: (1) Images are downsampled from their original resolution to to balance computational efficiency and spatial detail retention. (2) Sequences are constructed with input length () h and target length () h, adapted for time series prediction tasks. (3) Sequences are obtained via overlapping sliding windows to alleviate boundary effects and increase the sample size. (4) Typhoons with fewer than 13 images are filtered out to ensure that only those with at least one complete sequence are used.
To construct the input sequences, we utilize 12 consecutive time steps of typhoon cloud images as the input, with the subsequent 12 time steps serving as the ground truth, thereby forming a complete sequence. For instance, images from time steps 1 to 12 are designated as the input, while those from time steps 13 to 24 constitute the ground truth. At each prediction step, the model generates 12 images, which are subsequently evaluated against the corresponding ground truth. As shown in
Figure 5, the first three rows depict the input typhoon cloud image sequence, the middle three rows the ground truth sequence, and the last three rows the model’s predicted output sequence. Following this sequence formulation strategy, the 2015–2023 dataset subset is partitioned into three distinct subsets to ensure rigorous evaluation. The partitioning results in 32,259 sequences for training, 3366 sequences for validation, and 3223 sequences for testing.
All experiments were conducted using Python 3.12 on a single NVIDIA GeForce RTX 5070 Ti GPU with 16 GB of VRAM, under CUDA 12.8. To ensure fair comparisons across models, we adopted a consistent optimization strategy: the AdamW optimizer [
34] with a weight decay of
, paired with the OneCycleLR scheduler [
35] configured with pct_start = 0.3, div_factor = 25, and final_div_factor=10,000. The initial learning rate was set to
. Due to memory constraints on the GPU, we used a batch size of 4 for all training and evaluation runs. The hyperparameter settings (optimizer, scheduler, learning rate, weight decay) largely follow the configurations established in the OpenSTL benchmark [
36].
4.3. Main Results
We evaluate FusionTyphoonPredictor against five spatiotemporal forecasting baselines on the dataset: PredRNN [
16], PredRNN++ [
17], PredRNNv2 [
18], PredFormer [
23], and PhyDNet [
21]. All models are run under identical settings to ensure fair comparison.
All evaluation metrics are computed on a per-batch basis and then averaged across all batches at the end of each epoch, which ensures the reported values reflect the model’s average performance across the entire evaluation set.
Table 2 shows SSIM across prediction horizons (
t = 1 to
t = 12). FusionTyphoonPredictor achieves the highest SSIM of 0.482, outperforming the strongest baseline (PredRNNv2) by +0.006. Notably, it achieves the highest SSIM values in short-term forecasting (
) and maintains competitive performance in long-term horizons (
), consistently outperforming most baseline models.
Figure 6 and
Figure 7 show the training loss convergence. All models exhibit an initial rapid decline, followed by varied behaviors: PredRNN, PredRNN++, and PredFormer show slight loss increases with continued training, while PredRNNv2 displays a sharp spike around 20 k steps. PhyDNet begins with a relatively high initial loss but stabilizes at a lower value after descending. In contrast, FusionTyphoonPredictor maintains a steady downward trend in loss, with relatively low initial loss and only minor oscillations throughout training, demonstrating stable optimization throughout the training process.
Validation and test loss curves (
Figure 8 and
Figure 9) further confirm the stability and generalization capability of the proposed approach. Compared to other models, FusionTyphoonPredictor achieves and maintains lower loss values throughout the evaluation process.
Table 3 summarizes MAE ↓, MSE ↓, and SSIM ↑ at the epoch where validation SSIM reaches its maximum. FusionTyphoonPredictor achieves values of 0.07780 (MAE), 0.01367 (MSE), and 0.48164 (SSIM), ranking first in both MAE and SSIM and second in MSE. These results demonstrate its balanced performance in both error minimization and structural similarity preservation.
4.4. Ablation Study
We conduct controlled ablations to validate the contribution of Global Fusion Blocks and ST Recurrent Refiner. Results are shown in
Table 4.
Removing Global Fusion Blocks (retaining only ST Recurrent Refiner): The average SSIM decreases from 0.482 to 0.419 (), with the most significant degradation observed at t = 1 (dropping from 0.711 to 0.412). This confirms the critical role of multi-scale fusion in capturing typhoon cloud image macro-structures.
Removing ST Recurrent Refiner (retaining only Global Fusion Blocks): The average SSIM drops to 0.475 (), with consistent performance degradation across all forecasting horizons. We attribute this to the essential function of the looped refinement mechanism in suppressing local redundancies and ensuring temporal coherence.
The complete model (integrating both branches) achieves optimal performance, demonstrating the complementarity of dual-branch modeling: macroscopic global integration combined with recurrent local refinement.
4.5. Analysis and Discussion
The competitive performance of FusionTyphoonPredictor across multiple evaluation metrics (SSIM, MAE, MSE) and temporal horizons demonstrates its robust forecasting capability. The model’s particular strength in short-term forecasting () suggests effective initial condition capture and immediate temporal evolution modeling. This is crucial for operational typhoon forecasting where near-term predictions are most critical for early warning systems.
The maintained competitive performance in longer-term horizons (), while showing expected gradual degradation consistent with chaotic atmospheric systems, demonstrates the model’s ability to learn meaningful temporal dynamics beyond immediate transitions. The stable loss curves across training, validation, and test phases confirm the model’s training stability and generalization capability, mitigating the instability issues observed in some baseline models.
Ablation studies reveal the complementary nature of the core architectural branches: while Global Fusion Blocks demonstrate clear importance in capturing coherent macro-structures, the ST Recurrent Refiner appears to contribute to maintaining spatiotemporal consistency and refining local details. The integration of these two branches seems to play a significant role in achieving balanced performance across spatiotemporal prediction tasks. Additional evaluation metrics, including PSNR and RMSE, are provided in
Appendix A.
5. Conclusions
In this paper, we have proposed FusionTyphoonPredictor, a dual-branch spatiotemporal framework for typhoon cloud image forecasting. The model effectively integrates a Global Fusion Module for capturing macro spatiotemporal interactions and an ST Recurrent Refiner for refining local spatiotemporal dynamics, addressing the limitations of existing methods in balancing macroscopic evolution and microscopic details. Extensive experiments conducted on the Digital Typhoon dataset show that the proposed method achieves competitive or improved performance across multiple evaluation metrics, including SSIM, MAE, and MSE, with particular strength in short-term forecasting tasks while maintaining competitive performance over extended horizons.
Ablation studies confirm the complementary roles of the two core branches: the Global Fusion Blocks are crucial for preserving large-scale structural coherence, while the ST Recurrent Refiner enhances temporal consistency and reduces local redundancy. The model also exhibits favorable training stability and generalization capability, as evidenced by consistent loss curves across training, validation, and test phases.
Despite these advancements, our work has certain limitations. Constrained by available computational resources, we utilized only a subset of the Digital Typhoon dataset. Future work could incorporate data from all available years. Furthermore, our preprocessing involved downsampling of typhoon imagery, which may result in the loss of fine-grained details. Future work could explore training and inference on full-resolution typhoon images, avoiding cropping or downsampling to preserve complete spatial information. Additionally, our current study relies solely on infrared imagery from a single satellite band. Incorporating multimodal data sources, such as water vapor channels, radar estimates, and microwave imagery, could provide complementary atmospheric information and potentially improve prediction accuracy, particularly for cloud structures obscured in infrared imagery. Extending the evaluation to other ocean basins would also help verify the model’s generalizability under different climate regimes. Although these extensions would entail increased computational costs, they are likely to enhance both predictive performance and practical applicability.
In summary, FusionTyphoonPredictor provides a robust and scalable solution for typhoon cloud image forecasting, with potential implications for improving early warning systems and disaster preparedness in typhoon-prone regions.
Author Contributions
Conceptualization, H.L. and J.L.; methodology, H.L. and J.L.; software, H.L.; validation, H.L., J.L., Y.L. and Z.L.; formal analysis, H.L. and Z.L.; investigation, H.L. and J.L.; resources, J.L. and Y.L.; data curation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, J.L. and Y.L.; visualization, H.L. and Z.L.; supervision, J.L. and Y.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The dataset analyzed during the current study is publicly available datasets: Digital Typhoon [
4].
Acknowledgments
The authors would like to thank the authors of all references used in the paper, the editors, and the anonymous reviewers for their detailed comments and suggestions.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A. Additional Metrics
We conducted supplementary experiments incorporating additional metrics: Peak Signal-to-Noise Ratio (PSNR) and Root Mean Squared Error (RMSE). PSNR provides a distortion-sensitive measure of reconstruction quality, while RMSE penalizes large prediction errors more severely than MSE.
We note that the values in
Table A1 differ slightly from those reported in the main text (
Table 3). The supplementary results were obtained from an independent re-run using identical settings (same data split, preprocessing, optimizer, and random seed). We suspect these minor discrepancies likely stem from the inherent nondeterminism of GPU computations and framework-level scheduling—such as the order of floating-point accumulation, CUDA kernel execution, and data loading—which can introduce slight run-to-run variation even when the random seed is fixed. These differences fall within an acceptable range of expected stochastic noise and are therefore minor. The key point is that the relative rankings among models and the overall conclusions remain consistent with the results reported in the main text.
Table A1.
Additional evaluation metrics (PSNR, RMSE, MAE, MSE, SSIM) at the epoch of maximum validation SSIM.
Table A1.
Additional evaluation metrics (PSNR, RMSE, MAE, MSE, SSIM) at the epoch of maximum validation SSIM.
| | Metrics |
|---|
|
Model
|
PSNR ↑
|
RMSE ↓
|
MAE ↓
|
MSE ↓
|
SSIM ↑
|
|---|
| PredRNN | 16.14 | 0.1569 | 0.1218 | 0.0249 | 0.4658 |
| PredRNN++ | 14.51 | 0.1902 | 0.1595 | 0.0370 | 0.3941 |
| PhyDNet | 17.86 | 0.1309 | 0.0896 | 0.0179 | 0.4396 |
| PredRNNv2 | 18.59 | 0.1195 | 0.0877 | 0.0147 | 0.4761 |
| PredFormer | 18.93 | 0.1151 | 0.0786 | 0.0137 | 0.4595 |
| FusionTyphoonPredictor | 19.02 | 0.1142 | 0.0774 | 0.0135 | 0.4821 |
References
- Knutson, T.; Camargo, S.J.; Chan, J.C.; Emanuel, K.; Ho, C.H.; Kossin, J.; Mohapatra, M.; Satoh, M.; Sugi, M.; Walsh, K.; et al. Tropical cyclones and climate change assessment: Part II: Projected response to anthropogenic warming. Bull. Am. Meteorol. Soc. 2020, 101, E303–E322. [Google Scholar] [CrossRef]
- Rolnick, D.; Donti, P.L.; Kaack, L.H.; Kochanski, K.; Lacoste, A.; Sankaran, K.; Ross, A.S.; Milojevic-Dupont, N.; Jaques, N.; Waldman-Brown, A.; et al. Tackling climate change with machine learning. ACM Comput. Surv. (CSUR) 2022, 55, 42. [Google Scholar] [CrossRef]
- Kitamoto, A.; Hwang, J.; Vuillod, B.; Gautier, L.; Tian, Y.; Clanuwat, T. Digital typhoon: Long-term satellite image dataset for the spatio-temporal modeling of tropical cyclones. Adv. Neural Inf. Process. Syst. 2023, 36, 40623–40636. [Google Scholar]
- Kitamoto, A.; Dzik, E.; Faure, G. Machine Learning for the Digital Typhoon Dataset: Extensions to Multiple Basins and New Developments in Representations and Tasks. arXiv 2024, arXiv:2411.16421. [Google Scholar] [CrossRef]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 2002, 86, 2278–2324. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Price, I.; Sanchez-Gonzalez, A.; Alet, F.; Andersson, T.R.; El-Kadi, A.; Masters, D.; Ewalds, T.; Stott, J.; Mohamed, S.; Battaglia, P.; et al. Gencast: Diffusion-based ensemble forecasting for medium-range weather. arXiv 2023, arXiv:2312.15796. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Kalnay, E. Atmospheric Modeling, Data Assimilation and Predictability; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Bauer, P.; Thorpe, A.; Brunet, G. The quiet revolution of numerical weather prediction. Nature 2015, 525, 47–55. [Google Scholar] [CrossRef]
- Powers, J.G.; Klemp, J.B.; Skamarock, W.C.; Davis, C.A.; Dudhia, J.; Gill, D.O.; Coen, J.L.; Gochis, D.J.; Ahmad, N.; Peckham, S.E.; et al. The Weather Research and Forecasting Model: Overview, System Efforts, and Future Directions. Bull. Am. Meteorol. Soc. 2017, 98, 1717–1737. [Google Scholar] [CrossRef]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. Adv. Neural Inf. Process. Syst. 2015, 28, 802–810. [Google Scholar]
- Gao, Z.; Tan, C.; Wu, L.; Li, S.Z. Simvp: Simpler yet better video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 3170–3180. [Google Scholar]
- Tan, C.; Gao, Z.; Li, S.; Li, S.Z. SimVPv2: Towards Simple Yet Powerful Spatiotemporal Predictive Learning. IEEE Trans. Multimed. 2025, 27, 5170–5184. [Google Scholar] [CrossRef]
- Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. Adv. Neural Inf. Process. Syst. 2017, 30, 879–888. [Google Scholar]
- Wang, Y.; Gao, Z.; Long, M.; Wang, J.; Yu, P.S. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2018; pp. 5123–5132. [Google Scholar]
- Wang, Y.; Wu, H.; Zhang, J.; Gao, Z.; Wang, J.; Yu, P.S.; Long, M. Predrnn: A recurrent neural network for spatiotemporal predictive learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2208–2225. [Google Scholar] [CrossRef]
- Srivastava, N.; Mansimov, E.; Salakhudinov, R. Unsupervised learning of video representations using lstms. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2015; pp. 843–852. [Google Scholar]
- Wang, X.; Qin, M.; Zhang, Z.; Wang, Y.; Du, Z.; Wang, N. Typhoon cloud image prediction based on enhanced multi-scale deep neural network. Front. Mar. Sci. 2023, 9, 956813. [Google Scholar] [CrossRef]
- Guen, V.L.; Thome, N. Disentangling physical dynamics from unknown factors for unsupervised video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 11474–11484. [Google Scholar]
- Yuan, J.; Zhao, L.; Yu, R.; Lu, X.; Xia, M.; Liu, Y.; Wang, Y.; Wang, X. A Physics-Enhanced Network for Predicting Sequential Satellite Images of Typhoon Clouds. Sel. Top. Appl. Earth Obs. Remote Sens. IEEE J. 2025, 18, 16798–16815. [Google Scholar] [CrossRef]
- Tang, Y.; Qi, L.; Xie, F.; Li, X.; Ma, C.; Yang, M.H. Video Prediction Transformers without Recurrence or Convolution. arXiv 2024, arXiv:2410.04733. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Chen, L.; Du, F.; Hu, Y.; Wang, Z.; Wang, F. Swinrdm: Integrate swinrnn with diffusion model towards high-resolution and high-quality weather forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2023; Volume 37, pp. 322–330. [Google Scholar]
- Gao, Z.; Shi, X.; Wang, H.; Zhu, Y.; Wang, Y.B.; Li, M.; Yeung, D.Y. Earthformer: Exploring space-time transformers for earth system forecasting. Adv. Neural Inf. Process. Syst. 2022, 35, 25390–25403. [Google Scholar]
- Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 1874–1883. [Google Scholar]
- Guo, M.H.; Lu, C.Z.; Liu, Z.N.; Cheng, M.M.; Hu, S.M. Visual attention network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar] [CrossRef]
- Hendrycks, D. Gaussian Error Linear Units (Gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580. [Google Scholar] [CrossRef]
- Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Smith, L.N. Cyclical learning rates for training neural networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV); IEEE: New York, NY, USA, 2017; pp. 464–472. [Google Scholar]
- Tan, C.; Li, S.; Gao, Z.; Guan, W.; Wang, Z.; Liu, Z.; Wu, L.; Li, S.Z. OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive Learning. In Proceedings of the Conference on Neural Information Processing Systems Datasets and Benchmarks Track; Curran Associates, Inc.: Red Hook, NY, USA, 2023. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |