Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview
Abstract
1. Introduction
2. Preliminaries
3. Convolutional Neural Networks
Benefits and Drawbacks of Convolutional Neural Networks
4. Recurrent Neural Networks
Benefits and Drawbacks of Recurrent Neural Networks
5. Hybrid Convolutional–Recurrent Networks
Benefits and Drawbacks of Hybrid Convolutional–Recurrent Networks
6. Transformers
6.1. Vision Transformers and Transformers for Video Processing
6.2. Transformers for Spatio-Temporal Processing in Heterogeneous Domains
6.3. Benefits and Drawbacks of Transformers
7. Diffusion Models for Spatio-Temporal Modelling
Benefits and Drawbacks of Diffusion Models
8. Discussions and Future Research Directions
- Uncertainty estimation for spatio-temporal predictions. With the only exception of diffusion models, other reviewed deep learning approaches do not yield uncertainty estimates on their spatio-temporal predictions. This aspect is crucial in many spatio-temporal prediction tasks concerning environmental safety [152,153] or extreme events forecasting [154]. To introduce prediction uncertainty quantification, the Bayesian learning framework [155] can be used, or it can be achieved using generative AI models, such as Diffusion or Flow Models (see Section 7).
- Physics constraint in spatio-temporal prediction. The spatio-temporal dynamics of many real-world phenomena behave according to some physical laws and properties. Deep learning is of great utility to infer such spatio-temporal dynamics, but, in many cases, its data-driven nature can be empowered by embedding physical constraints, governing equations and first principles directly into the training process [156]. In this way, predictions can be not only statistically accurate but also coherent with known physical laws. This property is particularly important in many regular raster spatio-temporal prediction tasks, such as climate modelling [157,158] and fluid dynamics, where respecting physical laws is critical.
- Model explainability. While deep learning models can achieve strong predictive performance, they act as black boxes, making it difficult, especially for users that are not familiar with deep learning, to understand why certain predictions are obtained and thus limiting trust and applicability. Therefore, it is important to have strategies and methods that allow explainability of deep learning models in spatio-temporal prediction tasks [159]. A relevant direction towards model explainability is causality [160], namely, discovering cause–effect relationships within spatio-temporal input data. Integrating causality into deep learning models can provide not only better interpretable insights into the underlying spatio-temporal dynamics, but it can also yield more robust models with respect to generalisation.
- Foundation models. Foundation models [161] are a relevant recent topic in deep learning. These, representing the next generation of pre-trained models, are trained on massive and heterogeneous datasets, providing general-purpose backbones that can be fine-tuned or extended to diverse tasks. Recently, pre-trained foundation models have also been proposed for spatio-temporal data, especially for Earth observation. Some of the most relevant models in this field are Aurora [162], Prithvi [163], Clay [164], Presto [115] and DOFA [165], which leverage the concept of neural plasticity to adaptively integrate multiple data modalities according to the task being tackled. In fact, multimodal learning is another relevant topic in regular raster spatio-temporal prediction. Multimodal learning refers to the ability of integrating and using multiple data sources to address a given task, leveraging the diverse complementary information provided by heterogeneous data modalities. This capability plays a crucial role in many Earth observation tasks, such as in cloud removal, where the combined use of optical and radar data has proven particularly advantageous [166]. Furthermore, it is worthwhile to remark that a few foundation models based on Large Language Models (LLMs) have been applied to spatio-temporal prediction [167]. Finally, in the field of pre-trained models, the Mixture-of-Experts [168] is a further relevant topic. The mixture-of-experts, generally empowered by transformers [169], can be used to build large pre-trained models while maintaining a moderate computational complexity. Such models are deep learning architectures where different experts, which are still neural networks, process different parts of the input. A routing network decides which experts should process a given input, activating only a few at a time. Such a sparse computation guarantees mixture-of-experts models to have similar performance to traditional models while also being less computationally demanding and scalable.
9. Conclusions
- Methodology and techniques underlying reviewed models;
- Benefits and drawbacks, allowing the identification of domains where some models can be more effective than others;
- Diverse applicative realms, with their respective datasets, where regular raster spatio-temporal prediction is relevant.
Author Contributions
Funding
Conflicts of Interest
References
- Xiao, C.; Chen, N.; Hu, C.; Wang, K.; Xu, Z.; Cai, Y.; Xu, L.; Chen, Z.; Gong, J. A spatiotemporal deep learning model for sea surface temperature field prediction using time-series satellite data. Environ. Model. Softw. 2019, 120, 104502. [Google Scholar] [CrossRef]
- Censi, A.M.; Ienco, D.; Gbodjo, Y.J.E.; Pensa, R.G.; Interdonato, R.; Gaetano, R. Attentive Spatial Temporal Graph CNN for Land Cover Mapping From Multi Temporal Remote Sensing Data. IEEE Access 2021, 9, 23070–23082. [Google Scholar] [CrossRef]
- Glomb, K.; Rué Queralt, J.; Pascucci, D.; Defferrard, M.; Tourbier, S.; Carboni, M.; Rubega, M.; Vulliémoz, S.; Plomp, G.; Hagmann, P. Connectome spectral analysis to track EEG task dynamics on a subsecond scale. NeuroImage 2020, 221, 117137. [Google Scholar] [CrossRef] [PubMed]
- Pillai, K.G.; Angryk, R.A.; Banda, J.M.; Schuh, M.A.; Wylie, T. Spatio-temporal Co-occurrence Pattern Mining in Data Sets with Evolving Regions. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining Workshops, Brussels, Belgium, 10–13 December 2012; pp. 805–812. [Google Scholar] [CrossRef]
- Morosin, R.; De La Cruz Rodríguez, J.; Díaz Baso, C.J.; Leenaarts, J. Spatio-temporal analysis of chromospheric heating in a plage region. Astron. Astrophys. 2022, 664, A8. [Google Scholar] [CrossRef]
- Cressie, N.A.C. Statistics for Spatial Data; Wiley series in probability and mathematical statistics; Wiley: New York, NY, USA, 1993. [Google Scholar]
- Cressie, N.A.C.; Wikle, C.K. Statistics for Spatio-Temporal Data; Wiley series in probability and statistics; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
- Atluri, G.; Karpatne, A.; Kumar, V. Spatio-Temporal Data Mining: A Survey of Problems and Methods. Acm Comput. Surv. 2019, 51, 1–41. [Google Scholar] [CrossRef]
- Wang, S.; Cao, J.; Yu, P.S. Deep Learning for Spatio-Temporal Data Mining: A Survey. IEEE Trans. Knowl. Data Eng. 2020, 34, 3681–3700. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 6000–6010. [Google Scholar]
- Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-k.; WOO, W.-c. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28, pp. 802–810. [Google Scholar]
- Gao, Z.; Shi, X.; Wang, H.; Zhu, Y.; Wang, Y.B.; Li, M.; Yeung, D.Y. Earthformer: Exploring Space-Time Transformers for Earth System Forecasting. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 25390–25403. [Google Scholar]
- Wang, Y.; Wu, H.; Zhang, J.; Gao, Z.; Wang, J.; Yu, P.S.; Long, M. PredRNN: A Recurrent Neural Network for Spatiotemporal Predictive Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2208–2225. [Google Scholar] [CrossRef]
- Benson, V.; Robin, C.; Requena-Mesa, C.; Alonso, L.; Carvalhais, N.; Cortés, J.; Gao, Z.; Linscheid, N.; Weynants, M.; Reichstein, M. Multi-modal Learning for Geospatial Vegetation Forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 27788–27799. [Google Scholar]
- Mou, L.; Bruzzone, L.; Zhu, X.X. Learning Spectral-Spatial-Temporal Features via a Recurrent Convolutional Neural Network for Change Detection in Multispectral Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 924–935. [Google Scholar] [CrossRef]
- Xiao, Y.; Yuan, Q.; He, J.; Zhang, Q.; Sun, J.; Su, X.; Wu, J.; Zhang, L. Space-time super-resolution for satellite video: A joint framework based on multi-scale spatial-temporal transformer. Int. J. Appl. Earth Obs. Geoinf. 2022, 108, 102731. [Google Scholar] [CrossRef]
- Ebel, P.; Fare Garnot, V.S.; Schmitt, M.; Wegner, J.D.; Zhu, X.X. UnCRtainTS: Uncertainty Quantification for Cloud Removal in Optical Satellite Time Series. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 2086–2096. [Google Scholar] [CrossRef]
- Panboonyuen, T.; Charoenphon, C.; Satirapod, C. SatDiff: A Stable Diffusion Framework for Inpainting Very High-Resolution Satellite Imagery. IEEE Access 2025, 13, 51617–51631. [Google Scholar] [CrossRef]
- Guo, S.; Lin, Y.; Li, S.; Chen, Z.; Wan, H. Deep Spatial–Temporal 3D Convolutional Neural Networks for Traffic Data Forecasting. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3913–3926. [Google Scholar] [CrossRef]
- Cui, Z.; Ke, R.; Pu, Z.; Wang, Y. Stacked bidirectional and unidirectional LSTM recurrent neural network for forecasting network-wide traffic state with missing values. Transp. Res. Part Emerg. Technol. 2020, 118, 102674. [Google Scholar] [CrossRef]
- Yan, H.; Ma, X.; Pu, Z. Learning Dynamic and Hierarchical Traffic Spatiotemporal Features with Transformer. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22386–22399. [Google Scholar] [CrossRef]
- Yu, D.; Guo, G.; Wang, D.; Zhang, H.; Li, B.; Xu, G.; Deng, S. Modeling dynamic spatio-temporal correlations and transitions with time window partitioning for traffic flow prediction. Expert Syst. Appl. 2024, 252, 124187. [Google Scholar] [CrossRef]
- Huang, H.; Castruccio, S.; Genton, M.G. Forecasting High-Frequency Spatio-Temporal Wind Power with Dimensionally Reduced Echo State Networks. J. R. Stat. Soc. Ser. Appl. Stat. 2022, 71, 449–466. [Google Scholar] [CrossRef]
- Žalik, M.; Mongus, D.; Lukač, N. High-resolution spatiotemporal assessment of solar potential from remote sensing data using deep learning. Renew. Energy 2024, 222, 119868. [Google Scholar] [CrossRef]
- Girdhar, R.; Joao Carreira, J.; Doersch, C.; Zisserman, A. Video Action Transformer Network. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 244–253. [Google Scholar] [CrossRef]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar] [CrossRef]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video Swin Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar] [CrossRef]
- Zhai, S.; Ye, Z.; Liu, J.; Xie, W.; Hu, J.; Peng, Z.; Xue, H.; Chen, D.; Wang, X.; Yang, L.; et al. StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2025; pp. 26822–26833. [Google Scholar]
- Wikle, C.K.; Zammit Mangion, A.; Cressie, N.A.C. Spatio-Temporal Statistics with R; Chapman & Hall/CRC: The R Series; CRC Press: Boca Raton, FL, USA; Taylor & Francis Group: London, UK; New York, NY, USA, 2019. [Google Scholar]
- Pfeifer, P.E.; Deutsch, S.J. A Three-Stage Iterative Procedure for Space-Time Modeling. Technometrics 1980, 22, 35. [Google Scholar] [CrossRef]
- Pfeifer, P.E.; Deutsch, S.J. Seasonal Space-Time ARIMA Modeling. Geogr. Anal. 1981, 13, 117–133. [Google Scholar] [CrossRef]
- Stoffer, D.S. Estimation and Identification of Space-Time ARMAX Models in the Presence of Missing Data. J. Am. Stat. Assoc. 1986, 81, 762–772. [Google Scholar] [CrossRef]
- Vapnik, V.N. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 2000. [Google Scholar] [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Bishop, C.M.; Bishop, H. Deep Learning: Foundations and Concepts, 1st ed.; Springer International Publishing: Berlin/Heidelberg, Germany, 2024. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal Convolutional Networks for Action Segmentation and Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1003–1012. [Google Scholar] [CrossRef]
- Zhang, J.; Zheng, Y.; Qi, D.; Li, R.; Yi, X.; Li, T. Predicting citywide crowd flows using deep spatio-temporal residual networks. Artif. Intell. 2018, 259, 147–166. [Google Scholar] [CrossRef]
- Ham, Y.G.; Kim, J.H.; Luo, J.J. Deep learning for multi-year ENSO forecasts. Nature 2019, 573, 568–572. [Google Scholar] [CrossRef] [PubMed]
- Ayzel, G.; Heistermann, M.; Sorokin, A.; Nikitin, O.; Lukyanova, O. All convolutional neural networks for radar-based precipitation nowcasting. Procedia Comput. Sci. 2019, 150, 186–192. [Google Scholar] [CrossRef]
- Zammit-Mangion, A.; Wikle, C.K. Deep integro-difference equation models for spatio-temporal forecasting. Spat. Stat. 2020, 37, 100408. [Google Scholar] [CrossRef]
- Andersson, T.R.; Hosking, J.S.; Pérez-Ortiz, M.; Paige, B.; Elliott, A.; Russell, C.; Law, S.; Jones, D.C.; Wilkinson, J.; Phillips, T.; et al. Seasonal Arctic sea ice forecasting with probabilistic deep learning. Nat. Commun. 2021, 12, 5124. [Google Scholar] [CrossRef]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [Google Scholar] [CrossRef]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale Video Classification with Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Dumoulin, V.; Visin, F. A guide to convolution arithmetic for deep learning. arXiv 2016, arXiv:1603.07285. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. [Google Scholar]
- Wu, P.; Yin, Z.; Yang, H.; Wu, Y.; Ma, X. Reconstructing Geostationary Satellite Land Surface Temperature Imagery Based on a Multiscale Feature Connected Convolutional Neural Network. Remote Sens. 2019, 11, 300. [Google Scholar] [CrossRef]
- Hutchison, D.; Kanade, T.; Kittler, J.; Kleinberg, J.M.; Mattern, F.; Mitchell, J.C.; Naor, M.; Nierstrasz, O.; Pandu Rangan, C.; Steffen, B.; et al. Convolutional Learning of Spatio-temporal Features. In Computer Vision—ECCV 2010; Daniilidis, K., Maragos, P., Paragios, N., Eds.; Springer: Berlin/Heidelberg, Genmary, 2010; Volume 6316, pp. 140–153. [Google Scholar]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Gao, Z.; Tan, C.; Wu, L.; Li, S.Z. SimVP: Simpler yet Better Video Prediction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3160–3170. [Google Scholar] [CrossRef]
- Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric Deep Learning: Going beyond Euclidean data. IEEE Signal Process. Mag. 2017, 34, 18–42. [Google Scholar] [CrossRef]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Yang, X.; DelSole, T. Systematic Comparison of ENSO Teleconnection Patterns between Models and Observations. J. Clim. 2012, 25, 425–446. [Google Scholar] [CrossRef]
- Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
- Jaeger, H. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note. Bonn Ger. Ger. Natl. Res. Cent. Inf. Technol. Gmd Tech. Rep. 2001, 148, 13. [Google Scholar]
- Zhang, T.; Zheng, W.; Cui, Z.; Zong, Y.; Li, Y. Spatial–Temporal Recurrent Neural Network for Emotion Recognition. IEEE Trans. Cybern. 2019, 49, 839–847. [Google Scholar] [CrossRef] [PubMed]
- Jain, A.; Zamir, A.R.; Savarese, S.; Saxena, A. Structural-RNN: Deep Learning on Spatio-Temporal Graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Capone, V.; Casolaro, A.; Camastra, F. Spatio-temporal prediction using graph neural networks: A survey. Neurocomputing 2025, 643, 130400. [Google Scholar] [CrossRef]
- McDermott, P.L.; Wikle, C.K. An ensemble quadratic echo state network for non-linear spatio-temporal forecasting. Stat 2017, 6, 315–330. [Google Scholar] [CrossRef]
- McDermott, P.L.; Wikle, C.K. Deep echo state networks with uncertainty quantification for spatio-temporal forecasting. Environmetrics 2019, 30, e2553. [Google Scholar] [CrossRef]
- Fragkiadaki, K.; Levine, S.; Felsen, P.; Malik, J. Recurrent Network Models for Human Dynamics. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Srivastava, N.; Mansimov, E.; Salakhudinov, R. Unsupervised Learning of Video Representations using LSTMs. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 843–852. [Google Scholar]
- Jia, X.; Khandelwal, A.; Nayak, G.; Gerber, J.; Carlson, K.; West, P.; Kumar, V. Predict Land Covers with Transition Modeling and Incremental Learning. In Proceedings of the 2017 SIAM International Conference on Data Mining (SDM), Houston, TX, USA, 27–29 April 2017; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2017; pp. 171–179. [Google Scholar] [CrossRef]
- Jia, X.; Khandelwal, A.; Nayak, G.; Gerber, J.; Carlson, K.; West, P.; Kumar, V. Incremental Dual-memory LSTM in Land Cover Prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 867–876. [Google Scholar] [CrossRef]
- Reddy, D.S.; Prasad, P.R.C. Prediction of vegetation dynamics using NDVI time series data and LSTM. Model. Earth Syst. Environ. 2018, 4, 409–419. [Google Scholar] [CrossRef]
- Ndikumana, E.; Ho Tong Minh, D.; Baghdadi, N.; Courault, D.; Hossard, L. Deep Recurrent Neural Network for Agricultural Classification using multitemporal SAR Sentinel-1 for Camargue, France. Remote Sens. 2018, 10, 1217. [Google Scholar] [CrossRef]
- McDermott, P.L.; Wikle, C.K. Bayesian Recurrent Neural Network Models for Forecasting and Quantifying Uncertainty in Spatial-Temporal Data. Entropy 2019, 21, 184. [Google Scholar] [CrossRef]
- Vlachas, P.; Pathak, J.; Hunt, B.; Sapsis, T.; Girvan, M.; Ott, E.; Koumoutsakos, P. Backpropagation algorithms and Reservoir Computing in Recurrent Neural Networks for the forecasting of complex spatiotemporal dynamics. Neural Netw. 2020, 126, 191–217. [Google Scholar] [CrossRef] [PubMed]
- Lees, T.; Tseng, G.; Atzberger, C.; Reece, S.; Dadson, S. Deep Learning for Vegetation Health Forecasting: A Case Study in Kenya. Remote Sens. 2022, 14, 698. [Google Scholar] [CrossRef]
- Liu, Q.; Yang, M.; Mohammadi, K.; Song, D.; Bi, J.; Wang, G. Machine Learning Crop Yield Models Based on Meteorological Features and Comparison with a Process-Based Model. Artif. Intell. Earth Syst. 2022, 1, e220002. [Google Scholar] [CrossRef]
- Interdonato, R.; Ienco, D.; Gaetano, R.; Ose, K. DuPLO: A DUal view Point deep Learning architecture for time series classificatiOn. Isprs J. Photogramm. Remote Sens. 2019, 149, 91–104. [Google Scholar] [CrossRef]
- Qiu, C.; Mou, L.; Schmitt, M.; Zhu, X.X. Local climate zone-based urban land cover classification from multi-seasonal Sentinel-2 images with a recurrent residual network. Isprs J. Photogramm. Remote Sens. 2019, 154, 151–162. [Google Scholar] [CrossRef]
- Kaur, A.; Goyal, P.; Sharma, K.; Sharma, L.; Goyal, N. A Generalized Multimodal Deep Learning Model for Early Crop Yield Prediction. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 1272–1279. [Google Scholar] [CrossRef]
- Graves, A. Generating Sequences with Recurrent Neural Networks. arXiv 2013, arXiv:1308.0850. [Google Scholar] [CrossRef]
- Shi, X.; Gao, Z.; Lausen, L.; Wang, H.; Yeung, D.Y.; Wong, W.k.; WOO, W.c. Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5622–5632. [Google Scholar]
- Wang, Y.; Long, M.; Wang, J.; Gao, Z.; Yu, P.S. PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Wang, Y.; Zhang, J.; Zhu, H.; Long, M.; Wang, J.; Yu, P.S. Memory in Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity From Spatiotemporal Dynamics. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9146–9154. [Google Scholar] [CrossRef]
- Yang, Y.; Dong, J.; Sun, X.; Lima, E.; Mu, Q.; Wang, X. A CFCC-LSTM Model for Sea Surface Temperature Prediction. IEEE Geosci. Remote Sens. Lett. 2018, 15, 207–211. [Google Scholar] [CrossRef]
- Ienco, D.; Interdonato, R.; Gaetano, R.; Ho Tong Minh, D. Combining Sentinel-1 and Sentinel-2 Satellite Image Time Series for land cover mapping via a multi-source deep learning architecture. Isprs J. Photogramm. Remote Sens. 2019, 158, 11–22. [Google Scholar] [CrossRef]
- Boulila, W.; Ghandorh, H.; Khan, M.A.; Ahmed, F.; Ahmad, J. A novel CNN-LSTM-based approach to predict urban expansion. Ecol. Inform. 2021, 64, 101325. [Google Scholar] [CrossRef]
- Wu, H.; Yao, Z.; Wang, J.; Long, M. MotionRNN: A Flexible Model for Video Prediction with Spacetime-Varying Motions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15430–15439. [Google Scholar] [CrossRef]
- Robin, C.; Requena-Mesa, C.; Benson, V.; Alonso, L.; Poehls, J.; Carvalhais, N.; Reichstein, M. Learning to Forecast Vegetation Greenness at Fine Resolution over Africa with ConvLSTMs. arXiv 2022, arXiv:2210.13648. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar] [CrossRef]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. Preprint, 2018; in press. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Bertasius, G.; Wang, H.; Torresani, L. Is Space-Time Attention All You Need for Video Understanding? In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Cambridge MA, USA, 2021; Volume 139, pp. 813–824. [Google Scholar]
- Yan, S.; Xiong, X.; Arnab, A.; Lu, Z.; Zhang, M.; Sun, C.; Schmid, C. Multiview Transformers for Video Recognition. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3323–3333. [Google Scholar] [CrossRef]
- Li, K.; Wang, Y.; Peng, G.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Neimark, D.; Bar, O.; Zohar, M.; Asselmann, D. Video Transformer Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 3163–3172. [Google Scholar]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar] [CrossRef]
- Meinhardt, T.; Kirillov, A.; Leal-Taixé, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8844–8854. [Google Scholar]
- Feichtenhofer, C.; Fan, H.; Li, Y.; He, K. Masked Autoencoders As Spatiotemporal Learners. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 35946–35958. [Google Scholar]
- Caballero, J.; Ledig, C.; Aitken, A.; Acosta, A.; Totz, J.; Wang, Z.; Shi, W. Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2848–2857. [Google Scholar] [CrossRef]
- Zeng, Y.; Fu, J.; Chao, H. Learning Joint Spatial-Temporal Transformations for Video Inpainting. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12361, pp. 528–543. [Google Scholar]
- Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10428–10437. [Google Scholar] [CrossRef]
- Alfasly, S.; Chui, C.K.; Jiang, Q.; Lu, J.; Xu, C. An Effective Video Transformer with Synchronized Spatiotemporal and Spatial Self-Attention for Action Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 2496–2509. [Google Scholar] [CrossRef]
- Lin, K.; Li, L.; Lin, C.C.; Ahmed, F.; Gan, Z.; Liu, Z.; Lu, Y.; Wang, L. SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17928–17937. [Google Scholar] [CrossRef]
- Yu, Y.; Ni, R.; Zhao, Y.; Yang, S.; Xia, F.; Jiang, N.; Zhao, G. MSVT: Multiple Spatiotemporal Views Transformer for DeepFake Video Detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4462–4471. [Google Scholar] [CrossRef]
- Tao, R.; Huang, B.; Zou, X.; Zheng, G. SVT-SDE: Spatiotemporal Vision Transformers-Based Self-Supervised Depth Estimation in Stereoscopic Surgical Videos. IEEE Trans. Med. Robot. Bionics 2023, 5, 42–53. [Google Scholar] [CrossRef]
- Zhou, W.; Zhao, Y.; Zhang, F.; Luo, B.; Yu, L.; Chen, B.; Yang, C.; Gui, W. TSDTVOS: Target-guided spatiotemporal dual-stream transformers for video object segmentation. Neurocomputing 2023, 555, 126582. [Google Scholar] [CrossRef]
- Hsu, T.C.; Liao, Y.S.; Huang, C.R. Video Summarization with Spatiotemporal Vision Transformer. IEEE Transactions on Image Processing 2023, 32, 3013–3026. [Google Scholar] [CrossRef]
- Gupta, A.; Tian, S.; Zhang, Y.; Wu, J.; Martín-Martín, R.; Fei-Fei, L. MaskViT: Masked Visual Pre-Training for Video Prediction. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Liang, J.; Cao, J.; Fan, Y.; Zhang, K.; Ranjan, R.; Li, Y.; Timofte, R.; Van Gool, L. VRT: A Video Restoration Transformer. IEEE Trans. Image Process. 2024, 33, 2171–2182. [Google Scholar] [CrossRef]
- Korban, M.; Youngs, P.; Acton, S.T. A Semantic and Motion-Aware Spatiotemporal Transformer Network for Action Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6055–6069. [Google Scholar] [CrossRef]
- Gu, F.; Lu, J.; Cai, C.; Zhu, Q.; Ju, Z. RTSformer: A Robust Toroidal Transformer with Spatiotemporal Features for Visual Tracking. IEEE Trans.-Hum.-Mach. Syst. 2024, 54, 214–225. [Google Scholar] [CrossRef]
- Li, M.; Li, F.; Meng, B.; Bai, R.; Ren, J.; Huang, Z.; Gao, C. Spatiotemporal Representation Enhanced ViT for Video Recognition. In MultiMedia Modeling; Rudinac, S., Hanjalic, A., Liem, C., Worring, M., Jonsson, B., Liu, B., Yamakata, Y., Eds.; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2024; Volume 14554, pp. 28–40. [Google Scholar] [CrossRef]
- Lin, F.; Crawford, S.; Guillot, K.; Zhang, Y.; Chen, Y.; Yuan, X.; Chen, L.; Williams, S.; Minvielle, R.; Xiao, X.; et al. MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; pp. 5751–5761. [Google Scholar] [CrossRef]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 548–558. [Google Scholar] [CrossRef]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
- Tseng, G.; Cartuyvels, R.; Zvonkov, I.; Purohit, M.; Rolnick, D.; Kerner, H. Lightweight, Pre-trained Transformers for Remote Sensing Timeseries. arXiv 2023, arXiv:2304.14065. [Google Scholar] [CrossRef]
- Tang, S.; Li, C.; Zhang, P.; Tang, R. SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; pp. 13470–13479. [Google Scholar]
- Li, Z.; Chen, G.; Zhang, T. A CNN-Transformer Hybrid Approach for Crop Classification Using Multitemporal Multisensor Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 847–858. [Google Scholar] [CrossRef]
- Aksan, E.; Kaufmann, M.; Cao, P.; Hilliges, O. A Spatio-temporal Transformer for 3D Human Motion Prediction. In Proceedings of the 2021 International Conference on 3D Vision (3DV), London, UK, 1–3 December 2021; pp. 565–574. [Google Scholar] [CrossRef]
- Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607514. [Google Scholar] [CrossRef]
- Huang, L.; Mao, F.; Zhang, K.; Li, Z. Spatial-Temporal Convolutional Transformer Network for Multivariate Time Series Forecasting. Sensors 2022, 22, 841. [Google Scholar] [CrossRef] [PubMed]
- Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
- Wang, Y.; Hong, D.; Sha, J.; Gao, L.; Liu, L.; Zhang, Y.; Rong, X. Spectral–Spatial–Temporal Transformers for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5536814. [Google Scholar] [CrossRef]
- Yu, M.; Masrur, A.; Blaszczak-Boxe, C. Predicting hourly PM2.5 concentrations in wildfire-prone areas using a SpatioTemporal Transformer model. Sci. Total Environ. 2023, 860, 160446. [Google Scholar] [CrossRef]
- Yi, Z.; Zhang, H.; Tan, P.; Gong, M. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. Version Number: 4. [Google Scholar] [CrossRef]
- Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. Version Number: 3. [Google Scholar] [CrossRef]
- Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. arXiv 2018, arXiv:1809.11096. [Google Scholar] [CrossRef]
- Yoon, J.; Jarrett, D.; van der Schaar, M. Time-series Generative Adversarial Networks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., Alché-Buc, F.d., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
- Liao, S.; Ni, H.; Szpruch, L.; Wiese, M.; Sabate-Vidales, M.; Xiao, B. Conditional Sig-Wasserstein GANs for Time Series Generation. arXiv 2020, arXiv:2006.05421. [Google Scholar] [CrossRef]
- Li, X.; Metsis, V.; Wang, H.; Ngu, A.H.H. TTS-GAN: A Transformer-Based Time-Series Generative Adversarial Network. In Artificial Intelligence in Medicine; Michalowski, M., Abidi, S.S.R., Abidi, S., Eds.; Springer International Publishing: Cham, Switzerland, 2022; Volume 13263, pp. 133–143. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
- Luo, C. Understanding Diffusion Models: A Unified Perspective. arXiv 2022, arXiv:2208.11970. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar]
- Yuan, H.; Zhou, S.; Yu, S. EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models. arXiv 2023, arXiv:2303.05656. [Google Scholar]
- Rühling Cachay, S.; Zhao, B.; Joren, H.; Yu, R. DYffusion: A Dynamics-informed Diffusion Model for Spatiotemporal Forecasting. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Volume 36, pp. 45259–45287. [Google Scholar]
- Han, X.; Zheng, H.; Zhou, M. CARD: Classification and Regression Diffusion Models. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 18100–18115. [Google Scholar]
- Awasthi, A.; Ly, S.T.; Nizam, J.; Mehta, V.; Ahmad, S.; Nemani, R.; Prasad, S.; Nguyen, H.V. Anomaly Detection in Satellite Videos Using Diffusion Models. In Proceedings of the 2024 IEEE 26th International Workshop on Multimedia Signal Processing (MMSP), West Lafayette, IN, USA, 2–4 October 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar] [CrossRef]
- Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
- Ye, X.; Bilodeau, G.A. STDiff: Spatio-Temporal Diffusion for Continuous Stochastic Video Prediction. Proc. Aaai Conf. Artif. Intell. 2024, 38, 6666–6674. [Google Scholar] [CrossRef]
- Zhao, Z.; Dong, X.; Wang, Y.; Hu, C. Advancing Realistic Precipitation Nowcasting with a Spatiotemporal Transformer-Based Denoising Diffusion Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
- Zou, X.; Li, K.; Xing, J.; Zhang, Y.; Wang, S.; Jin, L.; Tao, P. DiffCR: A Fast Conditional Diffusion Framework for Cloud Removal From Optical Satellite Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
- Liu, H.; Liu, J.; Hu, T.; Ma, H. Spatio-Temporal Probabilistic Forecasting of Wind Speed Using Transformer-Based Diffusion Models. IEEE Trans. Sustain. Energy 2025, 1–13. [Google Scholar] [CrossRef]
- Yao, S.; Zhang, X.; Liu, X.; Liu, M.; Cui, Z. STDD: Spatio-Temporal Dual Diffusion for Video Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 12575–12584. [Google Scholar]
- Anderson, B.D. Reverse-time diffusion equation models. Stoch. Processes Their Appl. 1982, 12, 313–326. [Google Scholar] [CrossRef]
- Lipman, Y.; Chen, R.T.Q.; Ben-Hamu, H.; Nickel, M.; Le, M. Flow Matching for Generative Modeling. arXiv 2022, arXiv:2210.02747. [Google Scholar] [CrossRef]
- Lipman, Y.; Havasi, M.; Holderrieth, P.; Shaul, N.; Le, M.; Karrer, B.; Chen, R.T.Q.; Lopez-Paz, D.; Ben-Hamu, H.; Gat, I. Flow Matching Guide and Code. arXiv 2024, arXiv:2412.06264. [Google Scholar] [CrossRef]
- Lu, C.; Zhou, Y.; Bao, F.; Chen, J.; Li, C.; Zhu, J. DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models. Mach. Intell. Res. 2025, 22, 730–751. [Google Scholar] [CrossRef]
- Holderrieth, P.; Havasi, M.; Yim, J.; Shaul, N.; Gat, I.; Jaakkola, T.; Karrer, B.; Chen, R.T.Q.; Lipman, Y. Generator Matching: Generative modeling with arbitrary Markov processes. In Proceedings of the International Conference on Representation Learning, Singapore, 24–28 April 2025; Volume 2025, pp. 52153–52219. [Google Scholar]
- Peebles, W.; Xie, S. Scalable Diffusion Models with Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023. [Google Scholar] [CrossRef]
- Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005; Volume 2, pp. 729–734. [Google Scholar] [CrossRef]
- Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
- Xu, L.; Chen, N.; Yang, C.; Yu, H.; Chen, Z. Quantifying the uncertainty of precipitation forecasting using probabilistic deep learning. Hydrol. Earth Syst. Sci. 2022, 26, 2923–2938. [Google Scholar] [CrossRef]
- Casolaro, A.; Capone, V.; Camastra, F. Predicting ground-level nitrogen dioxide concentrations using the BaYesian attention-based deep neural network. Ecol. Inform. 2025, 87, 103097. [Google Scholar] [CrossRef]
- Kapoor, A.; Negi, A.; Marshall, L.; Chandra, R. Cyclone trajectory and intensity prediction with uncertainty quantification using variational recurrent neural networks. Environ. Model. Softw. 2023, 162, 105654. [Google Scholar] [CrossRef]
- MacKay, D.J.C. A Practical Bayesian Framework for Backpropagation Networks. Neural Comput. 1992, 4, 448–472. [Google Scholar] [CrossRef]
- Raissi, M.; Perdikaris, P.; Karniadakis, G. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
- Hess, P.; Drüke, M.; Petri, S.; Strnad, F.M.; Boers, N. Physically constrained generative adversarial networks for improving precipitation fields from Earth system models. Nat. Mach. Intell. 2022, 4, 828–839. [Google Scholar] [CrossRef]
- Harder, P.; Hernandez-Garcia, A.; Ramesh, V.; Yang, Q.; Sattegeri, P.; Szwarcman, D.; Watson, C.; Rolnick, D. Hard-Constrained Deep Learning for Climate Downscaling. J. Mach. Learn. Res. 2023, 24, 1–40. [Google Scholar]
- Verdone, A.; Scardapane, S.; Panella, M. Explainable Spatio-Temporal Graph Neural Networks for multi-site photovoltaic energy production. Appl. Energy 2024, 353, 122151. [Google Scholar] [CrossRef]
- Peters, J.; Janzing, D.; Schölkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms; Adaptive computation and machine learning series; The MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Awais, M.; Naseer, M.; Khan, S.; Anwer, R.M.; Cholakkal, H.; Shah, M.; Yang, M.H.; Khan, F.S. Foundation Models Defining a New Era in Vision: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2245–2264. [Google Scholar] [CrossRef] [PubMed]
- Bodnar, C.; Bruinsma, W.P.; Lucic, A.; Stanley, M.; Allen, A.; Brandstetter, J.; Garvan, P.; Riechert, M.; Weyn, J.A.; Dong, H.; et al. A foundation model for the Earth system. Nature 2025, 641, 1180–1187. [Google Scholar] [CrossRef]
- Szwarcman, D.; Roy, S.; Fraccaro, P.; Gislason, T.E.; Blumenstiel, B.; Ghosal, R.; de Oliveira, P.H.; Almeida, J.L.d.S.; Sedona, R.; Kang, Y.; et al. Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation Applications. arXiv 2024, arXiv:2412.02732. [Google Scholar] [CrossRef]
- Clay. Clay Foundation Model. 2025. Available online: https://madewithclay.org/ (accessed on 18 October 2025).
- Xiong, Z.; Wang, Y.; Zhang, F.; Stewart, A.J.; Hanna, J.; Borth, D.; Papoutsis, I.; Saux, B.L.; Camps-Valls, G.; Zhu, X.X. Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation. arXiv 2024, arXiv:2403.15356. [Google Scholar] [CrossRef]
- Ebel, P.; Xu, Y.; Schmitt, M.; Zhu, X.X. SEN12MS-CR-TS: A Remote-Sensing Data Set for Multimodal Multitemporal Cloud Removal. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
- Fang, L.; Xiang, W.; Pan, S.; Salim, F.D.; Chen, Y.P.P. Spatiotemporal Pretrained Large Language Model for Forecasting with Missing Values. IEEE Internet Things J. 2025, 12, 13838–13850. [Google Scholar] [CrossRef]
- Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]








| Ref. | Year | Type | Applications | Datasets |
|---|---|---|---|---|
| [50] | 2010 | 2D | Video Processing; Action Recognition | NORB Synthetic; KTH Action; Hollywood2 |
| [44] | 2013 | 3D | Action Recognition | Trecvid |
| [45] | 2014 | 3D | Video Classification; Action Recognition | Sports-1M; UCF-101 |
| [46] | 2015 | 3D | Video Processing; Action Recognition; Scene Recognition; Object Recognition | Sports-1M; UCF-101; YUPENN; Maryland; Egocentric |
| [39] | 2018 | 2D | Crowd Flow Forecasting | TaxiBJ; BikeNYC |
| [40] | 2019 | 2D | El Niño Index Prediction | CMIP5; SST Reanalysis Data |
| [49] | 2019 | 2D | Satellite Image Missing Data Reconstruction | FY-2G LST; MSG-SEVIRI LST |
| [20] | 2019 | 3D | Traffic Forecasting | TaxiBJ; BikeNYC |
| [41] | 2019 | 2D | Precipitation Forecasting | DWD RY Product |
| [51] | 2019 | 2D | Action Recognition | Kinetics; AVA |
| [42] | 2020 | 2D | Sea Surface Temperature Modelling | CMEMS SST |
| [43] | 2021 | 2D | Sea Ice Forecasting | EUMETSAT OSI-SAF SIC |
| [52] | 2022 | 2D | Video Prediction | MovingMNIST; TaxiBJ; Human3.6M; KTH |
| [18] | 2023 | 3D | Cloud Removal | SEN12MSCR; SEN12MS-CR-TS |
| [25] | 2024 | 2D | Solar Potential Forecasting | Custom Datasets |
| Ref. | Year | Type | Applications | Datasets |
|---|---|---|---|---|
| [65] | 2015 | LSTM | Human Motion Prediction | Human3.6M |
| [66] | 2015 | LSTM | Video Representation Learning; Video Action Recognition | Sports-1M; UCF101; HMDB51 |
| [63] | 2017 | ESN | Sea Surface Temperature Modelling | NOAA ERSST |
| [67] | 2017 | RNN; LSTM | Land Cover Prediction | MODIS; RSPO; Tree Plantation |
| [68] | 2017 | LSTM | Land Cover Prediction | MODIS; RSPO; Tree Plantation |
| [69] | 2018 | LSTM | Vegetation Dynamics Forecasting | MODIS NDVI |
| [70] | 2018 | LSTM; GRU | Land Cover Classification | Sentinel-1A/1B SAR |
| [64] | 2019 | ESN | Soil Moisture Forecasting | NOAA CPC GMSM |
| [71] | 2019 | Bayesian RNN | Sea Surface Temperature Modelling | Lorenz-96; NOAA ERSST |
| [72] | 2020 | LSTM; GRU; ESN | Spatio-temporal Forecasting | Lorenz-96; Kuramoto–Sivashinsky |
| [21] | 2020 | LSTM | Traffic Forecasting; Missing Data Imputation | Loop-Sea; PeMS |
| [73] | 2022 | LSTM | Vegetation Health Forecasting | MODIS NDVI |
| [74] | 2022 | LSTM | Crop Yield Prediction | USDA NASS |
| [24] | 2022 | ESN | Wind Power Forecasting | WRF Simulation Data |
| Ref. | Year | Type | Applications | Datasets |
|---|---|---|---|---|
| [12] | 2015 | LSTM | Precipitation Nowcasting | MovingMNIST; Radar Echo |
| [79] | 2017 | GRU | Precipitation Nowcasting | MovingMNIST++; HKO-7 |
| [80] | 2017 | LSTM | Video Prediction | MovingMNIST; KTH; Radar Echo |
| [82] | 2018 | LSTM | Sea Surface Temperature Modelling | NOAA OISST; NASA AVHRR |
| [81] | 2019 | LSTM | Spatio-temporal Forecasting | MovingMNIST; TaxiBJ; Human3.6M; Radar Echo |
| [75] | 2019 | GRU | Land Cover Classification | Sentinel-2 Data |
| [76] | 2019 | LSTM | Urban Land Cover Classification | Sentinel-2 Data |
| [83] | 2019 | GRU | Land Cover Classification | Sentinel-1 Data; Sentinel-2 Data |
| [1] | 2019 | LSTM | Sea Surface Temperature Modelling | NOAA OISST |
| [16] | 2019 | RNN | Remote Sensing Change Detection | Landsat ETM+ |
| [84] | 2021 | LSTM | Urban Expansion Prediction | SPOT Satellite Data |
| [85] | 2021 | Various | Video Prediction; Precipitation Nowcasting | Human3.6M; Shangai Precipitation Data; MovingMNIST |
| [77] | 2022 | LSTM | Crop Yield Prediction | MODIS; Landsat8; Sentinel-2 |
| [86] | 2022 | LSTM | NDVI Forecasting | Sentinel-2; ERA5; SMAP; SRTM |
| [14] | 2023 | LSTM | Spatio-temporal Forecasting | MovingMNIST; KTH; Radar Echo; Traffic4Cast; BAIR |
| Ref. | Year | Applications | Datasets |
|---|---|---|---|
| [98] | 2017 | Video Super-Resolution | CDVL |
| [26] | 2019 | Video Action Recognition | AVA |
| [99] | 2020 | Video Inpainting | YouTube-VOS; DAVIS |
| [94] | 2021 | Video Action Recognition | Kinetics; MiT |
| [91] | 2021 | Video Action Recognition | Kinetics; Something-Something; Diving-48 |
| [27] | 2021 | Video Action Recognition; Video Classification | Kinetics; Something-Something; Epic-Kitchens; MiT |
| [100] | 2021 | Object Tracking | LaSOT; GOT-10K; COCO2017; TrackingNet |
| [96] | 2022 | Object Tracking | MOT17; MOTS20 |
| [92] | 2022 | Video Classification | Kinetics; Something-Something; Epic-Kitchens; MiT |
| [93] | 2022 | Video Classification | Kinetics; Something-Something |
| [101] | 2022 | Video Action Recognition | Kinetics; Something-Something |
| [28] | 2022 | Video Classification | Kinetics; Something-Something |
| [102] | 2022 | Video Captioning | MSVD; YouCookII; MSRVTT; TVC; Vatex |
| [97] | 2022 | Video Representation Learning; Video Classification | Instagram Video Data; Kinetics |
| [103] | 2023 | Deep Fake Video Detection | FaceForensics++; DeepFakeDetection; Celeb-DF-v2; DeeperForensics-1.0; WildDeepfake |
| [104] | 2023 | Video Depth Estimation | dVPN; SCARED |
| [105] | 2023 | Video Object Segmentation | Youtube-VOS; DAVIS |
| [106] | 2023 | Video Summarization | SumME; TVSum |
| [107] | 2023 | Video Prediction | BAIR; KITTI; RoboNet |
| [108] | 2024 | Video Super-Resolution; Video Deblurring; Video Denoising | Multiple datasets |
| [109] | 2024 | Video Action Recognition | AVA; UCF101; Epic-Kitchens |
| [110] | 2024 | Object Tracking | LaSOT; GOT-10K; UAV123; NfS; OTB2015; VOT2018; TempleColor128 |
| [111] | 2024 | Video Action Recognition | Kinetics; Something-Something |
| Ref. | Year | Applications | Datasets |
|---|---|---|---|
| [117] | 2020 | Crop Classification | Sentinel-2; Landsat-8 |
| [118] | 2021 | 3D Human Motion Prediction | Human3.6M; AMASS |
| [119] | 2022 | Remote Sensing Change Detection | LEVIR; WHU-CD |
| [17] | 2022 | Remote Sensing Super Resolution | Jilin-1 Custom |
| [13] | 2022 | Earth Systems Forecasting | MovingMNIST; SEVIR; ICAR-ENSO |
| [120] | 2022 | Multivariate Spatial Time Series Forecasting | PeMS; Electricity; Traffic |
| [22] | 2022 | Traffic Forecasting | METR-LA; Urban-BJ; Ring-BJ |
| [121] | 2022 | Remote Sensing Change Detection | CDD; WHU-CD; OSCD; HRSCD |
| [122] | 2022 | Remote Sensing Change Detection | Farmland-CD; Barbara-CD; BayArea-CD |
| [123] | 2023 | Spatio-temporal PM2.5 Forecasting | Custom EPA AQS |
| [116] | 2023 | 3D Human Motion Prediction; Traffic Forecasting; Action Recognition | MovingMNIST; Human3.6M; TaxiBJ; KTH |
| [112] | 2023 | Crop Yield Prediction | USDA Crop Data; HRRR Data; Sentinel-2 |
| [23] | 2024 | Traffic Forecasting | PeMS; Zhengzhou |
| [15] | 2024 | Vegetation Forecasting | GreenEarthNet |
| Ref. | Year | Applications | Datasets |
|---|---|---|---|
| [134] | 2023 | Forecasting | SST; Navier–Stokes; Spring-Mesh |
| [136] | 2024 | Anomaly Detection | GOES-16; GOES-17 |
| [139] | 2024 | Video Prediction | KITTI; Cityscapes; KTH; BAIR; MovingMNIST |
| [140] | 2024 | Precipitation Nowcasting | SEVIR |
| [141] | 2024 | Cloud Removal | Sen2MTCNew; WHUS2-CRv; SEN12MS-CR |
| [142] | 2025 | Wind Speed Prediction | Wind Toolkit Data V2 |
| [19] | 2025 | Satellite Image Inpainting | Massachussets Roads; DeepGlobe 2018 |
| [29] | 2025 | Video Generation | RealEstate; ACID; DL3DV |
| [143] | 2025 | Video Generation | Sky Time-Lapse; UUCF101; MHAD; WebVid |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Capone, V.; Casolaro, A.; Camastra, F. Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview. Information 2025, 16, 917. https://doi.org/10.3390/info16100917
Capone V, Casolaro A, Camastra F. Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview. Information. 2025; 16(10):917. https://doi.org/10.3390/info16100917
Chicago/Turabian StyleCapone, Vincenzo, Angelo Casolaro, and Francesco Camastra. 2025. "Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview" Information 16, no. 10: 917. https://doi.org/10.3390/info16100917
APA StyleCapone, V., Casolaro, A., & Camastra, F. (2025). Deep Learning for Regular Raster Spatio-Temporal Prediction: An Overview. Information, 16(10), 917. https://doi.org/10.3390/info16100917

