ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention
Abstract
1. Introduction
2. Data
2.1. Moving MNIST Dataset
2.2. Radar Echo Dataset
3. Method
3.1. Network Structure
3.1.1. Transformer Structure
3.1.2. Patch Coding Based on 3D Convolution
3.1.3. Sparse Attention Module Based on Multi-Head Spatiotemporal Fusion
3.1.4. Parallel Decoding with Temporal-Spatial Relationships
3.2. Loss Function
3.3. Evaluation Metrics
4. Experiments and Analysis
4.1. Moving MNIST Experiments
4.2. Radar Echo Dataset Experiments
4.3. Ablation Experiments
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Hartman, B.; Cutler, H.; Shields, M.; Turner, D. The economic effects of improved precipitation forecasts in the United States due to better commuting decisions. Growth Change 2021, 52, 2149–2171. [Google Scholar] [CrossRef]
- Niu, X.; Zhang, L.; Wang, C.; Shen, K.; Tian, W.; Liao, B. A Generative Adversarial and Spatiotemporal Differential Fusion Method in Radar Echo Extrapolation. Remote Sens. 2023, 15, 5329. [Google Scholar] [CrossRef]
- Huang, X.; Chen, G.; Zhao, K.; Li, W.; Huang, J.; Zhao, L.; Fang, J. Improved Nowcasting of Short-Time Heavy Precipitation and Thunderstorm Gale Based on Vertical Profile Characteristics of Dual-Polarization Radar. Meteorol. Mon. 2024, 50, 1519–1530. [Google Scholar] [CrossRef]
- Imhoff, R.O.; Brauer, C.; Overeem, A.; Weerts, A.H.; Uijlenhoet, R. Spatial and temporal evaluation of radar rainfall nowcasting techniques on 1,533 events. Water Resour. Res. 2020, 56, e2019WR026723. [Google Scholar] [CrossRef]
- Wu, J.; Chen, M.; Qin, R.; Gao, F.; Song, L. The variational echo tracking method and its application in convective storm nowcasting. In Proceedings of the EGU General Assembly Conference, Vienna, Austria, 3–8 May 2020; p. 1293. [Google Scholar]
- Zhang, W.; Fang, B. The Study of Severe Convection Weather Forecast by the Method of Tracking Echo Centroids. Meteorol. Mon. 1995, 21, 13–18. [Google Scholar] [CrossRef]
- Cao, W.; Chen, M.; Gao, F.; Cheng, C.; Qin, R.; Wu, J.; Zhong, J. A vector blending study based on object-based tracking vectors and cross correlation tracking vectors. Acta Meteorol. Sin. 2019, 77, 1015–1027. [Google Scholar] [CrossRef]
- Ayzel, G.; Heistermann, M.; Winterrath, T. Optical flow models as an open benchmark for radar-based precipitation nowcasting (rainymotion v0.1). Geosci. Model Dev. 2019, 12, 1387–1402. [Google Scholar] [CrossRef]
- Li, L.; Schmid, W.; Joss, J. Nowcasting of motion and growth of precipitation with radar over a complex orography. J. Appl. Meteorol. Climatol. 1995, 34, 1286–1300. [Google Scholar] [CrossRef]
- Horn, B.K.; Schunck, B.G. Determining optical flow. Artif. Intell. 1981, 17, 185–203. [Google Scholar] [CrossRef]
- Liu, Y.; Xi, D.G.; Li, Z.L.; Hong, Y. A new methodology for pixel-quantitative precipitation nowcasting using a pyramid Lucas Kanade optical flow approach. J. Hydrol. 2015, 529, 354–364. [Google Scholar] [CrossRef]
- Cheung, P.; Yeung, H. Application of optical-flow technique to significant convection nowcast for terminal areas in Hong Kong. In Proceedings of the 3rd WMO International Symposium on Nowcasting and Very Short-Range Forecasting (WSN12), Rio de Janeiro, Brazil, 6–10 August 2012; pp. 6–10. [Google Scholar]
- Ajit, A.; Acharya, K.; Samanta, A. A review of convolutional neural networks. In Proceedings of the 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE), Vellore, India, 24–25 February 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–5. [Google Scholar]
- Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef] [PubMed]
- Ayzel, G.; Heistermann, M.; Sorokin, A.; Nikitin, O.; Lukyanova, O. All convolutional neural networks for radar-based precipitation nowcasting. Procedia Comput. Sci. 2019, 150, 186–192. [Google Scholar] [CrossRef]
- Ayzel, G.; Scheffer, T.; Heistermann, M. RainNet v1.0: A convolutional neural network for radar-based precipitation nowcasting. Geosci. Model Dev. 2020, 13, 2631–2644. [Google Scholar] [CrossRef]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Weng, L.; Xu, Y.; Xia, M.; Zhang, Y.; Liu, J.; Xu, Y. Water Areas Segmentation from Remote Sensing Images Using a Separable Residual SegNet Network. ISPRS Int. J. -Geo-Inf. 2020, 9, 256. [Google Scholar] [CrossRef]
- Salehinejad, H.; Sankar, S.; Barfett, J.; Colak, E.; Valaee, S. Recent advances in recurrent neural networks. arXiv 2017, arXiv:1801.01078. [Google Scholar]
- Lindemann, B.; Müller, T.; Vietz, H.; Jazdi, N.; Weyrich, M. A survey on long short-term memory networks for time series prediction. Procedia CIRP 2021, 99, 650–655. [Google Scholar] [CrossRef]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 156–165. [Google Scholar]
- Atluri, G.; Karpatne, A.; Kumar, V. Spatio-temporal data mining: A survey of problems and methods. ACM Comput. Surv. (CSUR) 2018, 51, 1–41. [Google Scholar] [CrossRef]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A machine learning approach for precipitation nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 1, pp. 802–810. [Google Scholar]
- Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Deep learning for precipitation nowcasting: A benchmark and a new model. Advances in Neural Information Processing Systems. arXiv 2017, arXiv:1706.03458. [Google Scholar]
- Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. PredRNN: Recurrent neural networks for predictive learning using spatiotemporal LSTMs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 879–888. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar]
- He, G.; Wu, W.; Han, J.; Luo, J.; Lei, L. EOST-LSTM: Long Short-Term Memory Model Combined with Attention Module and Full-Dimensional Dynamic Convolution Module. Remote Sens. 2025, 17, 1103. [Google Scholar] [CrossRef]
- Ji, C.; Xu, Y. trajPredRNN+: A new approach for precipitation nowcasting with weather radar echo images based on deep learning. Heliyon 2024, 10, e36134. [Google Scholar] [CrossRef]
- Srivastava, N.; Mansimov, E.; Salakhudinov, R. Unsupervised learning of video representations using lstms. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 843–852. [Google Scholar]
- Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
- Sun, W.; Su, F.; Wang, L. Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 2018, 278, 34–40. [Google Scholar] [CrossRef]
- Wang, Y.; Gao, Z.; Long, M.; Wang, J.; Philip, S.Y. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5123–5132. [Google Scholar]
- Wang, Y.; Jiang, L.; Yang, M.H.; Li, L.J.; Long, M.; Li, F. Eidetic 3D LSTM: A model for video prediction and beyond. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Wang, Y.; Zhang, J.; Zhu, H.; Long, M.; Wang, J.; Yu, P.S. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9154–9162. [Google Scholar]
- Lin, Z.; Li, M.; Zheng, Z.; Cheng, Y.; Yuan, C. Self-attention convlstm for spatiotemporal prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11531–11538. [Google Scholar]
- Chang, Z.; Zhang, X.; Wang, S.; Ma, S.; Ye, Y.; Gao, W. Stae: A spatiotemporal auto-encoder for high-resolution video prediction. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
- Ye, X.; Bilodeau, G.A. Video prediction by efficient transformers. Image Vis. Comput. 2023, 130, 104612. [Google Scholar] [CrossRef]
- Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
- Wang, G.; Liu, L.; Ding, Y. Improvement of radar quantitative precipitation estimation based on real-time adjustments to ZR relationships and inverse distance weighting correction schemes. Adv. Atmos. Sci. 2012, 29, 575–584. [Google Scholar] [CrossRef]
Models | Description |
---|---|
ConvLSTM | A classic spatiotemporal sequence prediction network that uses convolutional operations in both the network states and input data to capture spatial features, effectively modeling temporal and spatial correlations. |
PredRNN | A recurrent neural network that employs a unified memory pool to store spatial representations and temporal changes. The hidden states are no longer confined within individual LSTM cells but can propagate in both horizontal and vertical directions. |
CausalLSTM | A novel spatiotemporal memory unit that enhances feature amplification through additional nonlinear operations while maintaining hierarchical invariance, enabling better capture of short-term dynamic changes. |
E3D-LSTM | A spatiotemporal sequence prediction model that integrates LSTM with 3D convolutions and incorporates self-attention mechanisms, demonstrating strong capabilities in predicting multidimensional spatiotemporal data. |
MIM | A neural network prediction model designed to learn high-order non-stationarity in spatiotemporal dynamics. It combines historical information with the current state to predict future states, capturing complex relationships in spatiotemporal data. |
SA-ConvLSTM | A ConvLSTM model enhanced with self-attention mechanisms. It dynamically adjusts information flow within the network to better learn complex patterns in spatiotemporal data. |
STAE | A spatiotemporal sequence prediction model that utilizes temporal attention mechanisms to weight time information, thereby improving prediction performance. |
VPTR | A novel transformer-based video prediction model, available in three variants: fully autoregressive (VPTR-FAR), partially autoregressive (VPTR-PAR), and non-autoregressive (VPTR-NAR). |
OpticalFlow | A radar echo extrapolation model based on optical flow methods. It tracks the motion trends of precipitation features from a series of radar echo images and infers the precipitation field at the next time step. |
RainNet | A deep learning-based radar echo extrapolation model that uses quality-controlled weather radar data provided by the German Meteorological Service to predict persistent echoes and precipitation intensity. |
Models | MSE↓ | MAE↓ | SSIM↑ | |
---|---|---|---|---|
RNN-Based Models | ConvLSTM | 103.3 | 182.9 | 0.707 |
PredRNN | 56.8 | 126.1 | 0.867 | |
CausalLSTM | 46.5 | 106.8 | 0.898 | |
MIM | 44.2 | 101.1 | 0.910 | |
E3D-LSTM | 41.3 | 86.4 | 0.910 | |
SA-ConvLSTM | 43.9 | 94.7 | 0.913 | |
STAE | 35.2 | * | 0.929 | |
Transformer-Based Models | VPTR-FAR | 107.2 | * | 0.844 |
VPTR-PAR | 93.2 | * | 0.859 | |
VPTR-NAR | 63.6 | * | 0.882 | |
Ours(ViViT-Prob) | 29.1 | 89.1 | 0.923 |
Models | MSE↓ | MAE↓ | SSIM↑ | |
---|---|---|---|---|
Radar Echo Extrapolation Models | OpticalFlow | 87.8 | 1.745 | 0.604 |
RainNet | 84.7 | 2.335 | 0.572 | |
Spatiotemporal Sequence Models | ConvLSTM | 100.6 | 1.719 | 0.613 |
PredRNN | 74.3 | 1.657 | 0.717 | |
Ours (ViViT-Prob) | 58.4 | 1.542 | 0.744 |
Encoder | Decoder | Attention | MSE↓ | MAE↓ | SSIM↑ |
---|---|---|---|---|---|
Patch | * | Multihead Self-Attention | 78.7 | 1.736 | 0.623 |
Patch | Patch | Multihead Self-Attention | 72.9 | 1.712 | 0.646 |
Patch | Patch | Multihead Probsparse Self-Attention | 67.3 | 1.637 | 0.682 |
3DCNN | 3DCNN * | Multihead Self-Attention | 70.4 | 1.692 | 0.667 |
3DCNN | 3DCNN | Multihead Self-Attention | 64.8 | 1.583 | 0.727 |
3DCNN | 3DCNN | Multihead Cross-Attention | 68.2 | 1.624 | 0.703 |
3DCNN | 3DCNN | Multihead Probsparse Self-Attention | 58.4 | 1.542 | 0.744 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qiu, Y.; Lu, B.; Xiong, W.; Lu, Z.; Sun, L.; Cui, Y. ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention. Remote Sens. 2025, 17, 1966. https://doi.org/10.3390/rs17121966
Qiu Y, Lu B, Xiong W, Lu Z, Sun L, Cui Y. ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention. Remote Sensing. 2025; 17(12):1966. https://doi.org/10.3390/rs17121966
Chicago/Turabian StyleQiu, Yunan, Bingjian Lu, Wenrui Xiong, Zhenyu Lu, Le Sun, and Yingjie Cui. 2025. "ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention" Remote Sensing 17, no. 12: 1966. https://doi.org/10.3390/rs17121966
APA StyleQiu, Y., Lu, B., Xiong, W., Lu, Z., Sun, L., & Cui, Y. (2025). ViViT-Prob: A Radar Echo Extrapolation Model Based on Video Vision Transformer and Spatiotemporal Sparse Attention. Remote Sensing, 17(12), 1966. https://doi.org/10.3390/rs17121966