# 2D Convolutional Neural Markov Models for Spatiotemporal Sequence Forecasting

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- The method introduces a DMM that maintain the spatial structure of the input data by running them through a full 2D model, which consists of several 2D CNNs and a backward ConvLSTM, with the intention of capturing the inherent spatial features of the data. Using DMM as a base model allows the integration of probabilistic modeling to spatiotemporal forecasting problem, increasing the robustness of the proposed approach.
- The feasibility of our method is evaluated by conducting two experiments using a synthetic spatiotemporal data modeled after 2D heat diffusion equation, as well as real-world precipitation data. We compare the results with other baseline models, namely, naive forecast, DMM, and ConvLSTM.
- The combination of 2D CNNs, ConvLSTM, and DMM in the proposed approach opens up the possibility of combining popular 2D CNN-based methods, further increasing DMM’s modeling capability to cater to various spatiotemporal forecasting problems. Conversely, the proposed approach also allows the usage of DMM in other fields, such as video prediction and generation, due to its autoencoder-like structure.

## 2. Related Work

## 3. Spatiotemporal Sequence Forecasting Task

## 4. 2D Convolutional Neural Markov Model

#### 4.1. Overview and Structure of the Model

#### 4.2. Inference Network

#### 4.3. Generative Network

#### 4.4. Training and Forecasting Flow

#### 4.4.1. Training Procedure

#### 4.4.2. Forecasting Flow

**Multi-step method**: By repeating these generative steps recursively, we can produce a forecast sequence with an arbitrary length, i.e., repeat the steps $\Delta T-1$ times to output ${\tilde{X}}_{T+1:T+\Delta T}$. This method requires a very well-trained generative network to be accurate, as problems such as high variance or biased calculation produced by suboptimally trained transition and emitter functions will result in chaotic predictions.**One-step method**: Instead of forecasting every observation point with only the generative network, we instead update our observations in real-time when we have new ones, and time-shift the input to the inference network by 1 (${X}_{2:T+1}$), acquiring new posterior latents ${\widehat{Z}}_{2:T+1}$. We use the newly estimated ${\widehat{Z}}_{T+1}$ to estimate ${\tilde{Z}}_{T+2}$, and in turn ${\tilde{X}}_{T+2}$. Finally, we then repeat this procedure to produce the rest of the forecast. Note the similarities of this method to data assimilation, in which we keep updating our estimates using newly obtained observations. This forecasting method is shown in Figure 3b.

## 5. Experiments and Results

#### 5.1. 2D Heat Equation

#### 5.2. CPC Merged Analysis of Precipitation

#### 5.3. Model Specification and Experiment Details

**Emission 1**and

**2**, as shown in Equations () and ()) and CMAP data with varying forecasting length (5, 10, 15, and 20 timesteps). When evaluating with the

**Emission 2**condition, we utilize the models trained on the first emission. This is to evaluate the robustness of the models with respect to noise. To ensure fairness, forecasting for every model uses the one-step forecast method. Furthermore, we run the training procedure five times (except naive forecast, as there are no training required) and present the averaged MSE from every last epoch of the run as the final result. We present the resulting forecast MSE in both table and bar chart forms. The forecast MSE for 2D heat equation data is shown by Table 5 and Figure 4, while the CMAP data is shown by Table 6 and Figure 5. Note that the MSE is calculated on normalized data instead of unnormalized data.

#### 5.4. Experimental Results

#### 5.4.1. 2D Heat Equation

#### 5.4.2. CMAP

## 6. Discussions

## 7. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Kalman, R.E. A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng.
**1960**, 82, 35–45. [Google Scholar] [CrossRef] [Green Version] - Julier, S.J.; Uhlmann, J.K. A New Extension of the Kalman Filter to Nonlinear Systems. In Proceedings of the AeroSense: The 11th International Symposium on Aerospace/Defense Sensing, Simulations and Controls, Orlando, FL, USA, 20–25 April 1997; pp. 182–193. [Google Scholar]
- Evensen, G. Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics. J. Geophys. Res. Oceans
**1994**, 99, 10143–10162. [Google Scholar] [CrossRef] - Hochreiter, S.; Schmidhuber, J. Long Short-term Memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed] - Shi, X.; Chen, Z.; Wang, H.; Yeung, D.; Wong, W.; Woo, W. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the NeurIPS, Montréal, PQ, Canada, 7–12 December 2015; pp. 802–810. [Google Scholar]
- Krishnan, R.G.; Shalit, U.; Sontag, D. Structured Inference Networks for Nonlinear State Space Models. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Fransisco, CA, USA, 4–9 February 2017; pp. 2101–2109. [Google Scholar]
- Le, V.D.; Bui, C.; Cha, S.K. Spatiotemporal Deep Learning Model for Citywide Air Pollution Interpolation and Prediction. In Proceedings of the 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), Busan, Korea, 19–22 February 2020; pp. 55–62. [Google Scholar]
- Elsayed, N.; Maida, A.S.; Bayoumi, M. Reduced-Gate Convolutional LSTM Architecture for Next-Frame Video Prediction Using Predictive Coding. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–9. [Google Scholar]
- Wang, J.; Sun, T.; Liu, B.; Cao, Y.; Zhu, H. CLVSA: A Convolutional LSTM Based Variational Sequence-to-Sequence Model with Attention for Predicting Trends of Financial Markets. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 10–16 August 2019; pp. 3705–3711. [Google Scholar]
- Khan, Z.A.; Hussain, T.; Ullah, A.; Rho, S.; Lee, M.; Baik, S.W. Towards Efficient Electricity Forecasting in Residential and Commercial Buildings: A Novel Hybrid CNN with a LSTM-AE based Framework. Sensors
**2020**, 20, 1399. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Yu, B.; Yin, H.; Zhu, Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018. [Google Scholar]
- Zheng, C.; Fan, X.; Wang, C.; Qi, J. GMAN: A Graph Multi-Attention Network for Traffic Prediction. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 1234–1241. [Google Scholar]
- Toyer, S.; Cherian, A.; Han, T.; Gould, S. Human Pose Forecasting via Deep Markov Models. In Proceedings of the Digital Image Computing: Techniques and Applications (DICTA), Canberra, Australia, 10–13 December 2017; pp. 1–8. [Google Scholar]
- Farnoosh, A.; Rezaei, B.; Sennesh, E.; Khan, Z.; Dy, J.G.; Satpute, A.B.; Hutchinson, J.B.; van de Meent, J.W.; Ostadabbas, S. Deep Markov Spatio-Temporal Factorization. arXiv
**2020**, arXiv:2003.09779. [Google Scholar] - Tan, Z.X.; Soh, H.; Ong, D.C. Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 10334–10341. [Google Scholar]
- Khurana, S.; Laurent, A.; Hsu, W.N.; Chorowski, J.; Lancucki, A.; Marxer, R.; Glass, J. A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning. arXiv
**2020**, arXiv:2006.02547. [Google Scholar] - Van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.W.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv
**2016**, arXiv:1609.03499. [Google Scholar] - Lee, A.X.; Zhang, R.; Ebert, F.; Abbeel, P.; Finn, C.; Levine, S. Stochastic Adversarial Video Prediction. arXiv
**2018**, arXiv:1804.01523. [Google Scholar] - Kwon, Y.; Park, M. Predicting Future Frames Using Retrospective Cycle GAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1811–1820. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar]
- Halim, C.J.; Kawamoto, K. Deep Markov Models for Data Assimilation in Chaotic Dynamical Systems. In Advances in Artificial Intelligence; Springer Nature Switzerland AG: Cham, Switzerland, 2020; pp. 37–44. [Google Scholar]
- Xie, P.; Arkin, P.A. Global Precipitation: A 17-Year Monthly Analysis Based on Gauge Observations, Satellite Estimates, and Numerical Model Outputs. Bull. Am. Meteorol. Soc.
**1997**, 78, 2539–2558. [Google Scholar] [CrossRef] - Kalnay, E.; Kanamitsu, M.; Kistler, R.; Collins, W.; Deaven, D.; LS, G.; Iredell, M.; Saha, S.; White, G.; Woollen, J.; et al. The NMC/NCAR 40-year reanalysis project. Bull. Am. Meteorol. Soc.
**1996**, 77, 437–472. [Google Scholar] [CrossRef] [Green Version] - Bingham, E.; Chen, J.P.; Jankowiak, M.; Obermeyer, F.; Pradhan, N.; Karaletsos, T.; Singh, R.; Szerlip, P.; Horsfall, P.; Goodman, N.D. Pyro: Deep Universal Probabilistic Programming. J. Mach. Learn. Res.
**2019**, 20, 973–978. [Google Scholar] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]

**Figure 1.**Inference network. Observation is first encoded using encoder to produce encoded observation. The encoded observation for a time point and hidden tensors from future timepoints are then fed into a backward ConvLSTM cell to produce hidden tensors for the current timepoint. When hidden tensors for a particular sequence are calculated, they are then fed into combiner along with the previous posterior latent to produce the current latent mean and variance. We then sample the current latent from the produced mean and variance, which follows a Gaussian distribution. Here, “/2 ds.” denotes 1/2 reduction in spatial size (downsampling). Dashed lines denote sampling, while dotted lines denote repetition. The description of the ConvLSTM cell structure can be found in [5].

**Figure 2.**Generative network. Propagation on latent tensors is done by inserting the previous latent to a convolutional gated transition function (ConvGTF) to obtain latent mean and variance, which will then be used to sample the next latent. Observation is produced by inputting the latent into Emitter to output the observation mean and variance, which will be used to sample the observation. “x2 us.” denotes a two-times increase in spatial size (upsampling). Similar to Figure 1, dashed lines denote sampling, while dotted lines denote repetition.

**Figure 3.**(

**a**) shows the training flow of our approach. The observations are first fed into the inference network to produce a series of approximated posterior latents. These latents are then inputted into transition function to produce a shifted series of prior latents. The approximated latents are also inputted into emitter to produce reconstructed observations. Finally, the original and reconstructed observations, along with approximated posterior and prior latents are used to calculate the evidence lower bound (ELBO) as the objective function. (

**b**) shows the one-step forecasting flow of our method. Similar to training flow, the observations are first inputted into inference network to produce posterior latents. We then use the last latent data to produce the next latent using transition function and calculate the next forecast using the emitter. We repeat this flow as new observations are obtained until the desired forecast length is reached.

**Figure 4.**Forecast MSE on the CMAP data when plotted as bar graphs. The y-axis represents the MSE. The x-axis represents the forecast length and the model with which the forecast is produced

**Figure 5.**Forecast MSE on the CMAP data when plotted as bar graphs. The y-axis represents the MSE. The x-axis represents the forecast length and the model with which the forecast is produced.

**Figure 6.**Spatially averaged (spatial mean) squared forecast error of baseline and our models on nine randomly selected 2D heat equation validation data. The data used here is from the Emission 1 condition, and the length of the forecast is 20. The y-axis shows the error value, and the x-axis shows the timestep of the forecast. It is clear that ConvLSTM’s forecasts underperformed on the first initial steps, while other models are more stable at forecasting the dynamics of the data.

**Figure 7.**Spatially averaged (spatial mean) squared forecast error of baseline and our models on nine randomly selected CMAP validation data. The length of the forecast is set as 20 timesteps. The y-axis shows the error value, and the x-axis shows the timestep of the forecast. Compared to 2D heat equation data, every DNN model outperformed the naive forecast as expected, due to the higher variance and chaos introduced in the real-world data that are better modeled by DNN models.

**Figure 8.**Squared error heatmap between ground truth and predicted forecast of the upper-left data used in the plot shown in Figure 6 (2D heat equation data). Higher errors are shown by the bright regions. Here, we can see the initial high prediction error in the area of heat on ConvLSTM’s forecasts, correlating with the high prediction error shown by Figure 6.

**Figure 9.**Squared error heatmap between ground truth and predicted forecast of the upper-left data used in the plot shown in Figure 7 (CMAP data). A brighter color means higher error.

**Figure 10.**2D heatmap visualization of the forecast result of the same data used in Figure 8 (2D heat equation data). Prediction starts from the 11th timestep onward. Brighter region means higher temperature.

**Table 1.**Generation details of the 2D heat equation data. The $\mathrm{random}(min,max)$ here means that the parameter is sampled from a uniform distribution with specified minimum ($min$) and maximum ($max$) values.

Attributes | Value |
---|---|

(Minimum, maximum) values | (0, 1000) K |

Plate size ($\mathrm{width}\times \mathrm{length}$) | $10\times 10$ m |

Thermal diffusivity | $4.0$${\mathrm{m}}^{2}/\mathrm{s}$ |

Base temperature | 0 K |

Initial temperature of circle of heat | $\mathrm{random}(500,700)$ K |

Radius of circle of heat | $\mathrm{random}(0.5,5)$ m |

Central position of circle of heat ($(x,y)$) | $\left(\mathrm{random}\right(10,10),\mathrm{random}(10,10\left)\right)$ m |

Differentiation method | Finite difference |

Spatial differences ($(dx,dy)$) | $(0.1,0.1)$ m |

Timestep difference | $0.000625$ ($\times 3$ from sampling) s |

Sequence length | 30 timesteps |

(Training, validation) data size | (3000, 750) |

Attributes | Value |
---|---|

(Minimum, maximum) values | (0, 80) mm/day |

Spatial size ($\mathrm{width}\times \mathrm{length}$) | $72\times 72$ pixels |

Sequence length | 30 timesteps |

(Training, validation) data size | (1901, 815) |

**Table 3.**Our model’s CNN and DCN channel specifications. () denotes multiple layers, while one value means the channel size is the same among all CNN inside the corresponding section. Heat denotes specifications during the 2D heat equation experiment and CMAP denotes specifications during the CMAP experiment.

Exp. | Inference | Generator | |||
---|---|---|---|---|---|

Encoder | ConvLSTM | Comb. | Trans. | Decoder * | |

Heat | (32, 64) | 64 | 64 | 64 | (32, 16) |

CMAP | (32, 64) | 16 | 16 | 16 | (16, 8) |

**Table 4.**Training specifications. (a, b) denotes the parameters used in both experiments: a shows the parameter in the heat experiment, b shows the parameter in the CMAP experiment. One value for the parameter means the value is the same across experiments.

Parameters | ConvLSTM | DMM | CNMM |
---|---|---|---|

Learning rate (LR) | 0.001 | 0.0001 | (0.00005, 0.0001) |

${\beta}_{1}$ | 0.9 | 0.9 | 0.9 |

${\beta}_{2}$ | 0.999 | 0.999 | 0.999 |

Grad. clipping | - | 10.0 | 10.0 |

LR decay | - | 1.0 | 1.0 |

Epoch | (100, 150) | 150 | (300, 150) |

Batch size | 16 | 16 | 16 |

**Table 5.**Forecast mean squared error (MSE) of the 2D heat equation data experiment on each model, with

**Emission 1**and

**Emission 2**denoting each emission condition of the data. The

**length**here shows in how many timesteps the forecast is done on each model. The bold numbers indicate the lowest error achieved on each forecast length.

(a) Forecast MSE on Emission 1 | ||||

Length | Heat equation - Emission 1(MSE$\times {\mathbf{10}}^{\mathbf{4}})$ | |||

Naive Forecast | ConvLSTM | DMM | CNMM (Ours) | |

5 | 1.8890 | 0.7508 | 8.9915 | 6.0846 |

10 | 1.8891 | 0.7546 | 9.5014 | 6.1196 |

15 | 1.8882 | 0.8512 | 10.0842 | 6.1591 |

20 | 1.8876 | 7.7907 | 10.8438 | 6.2114 |

(b) Forecast MSE on Emission 2 | ||||

Length | Heat equation - Emission 2(MSE$\times {\mathbf{10}}^{\mathbf{4}})$ | |||

Naive Forecast | ConvLSTM | DMM | CNMM (Ours) | |

5 | 8.4880 | 4.1452 | 13.5044 | 9.4584 |

10 | 8.4820 | 4.1485 | 14.1001 | 9.5014 |

15 | 8.4751 | 4.2451 | 14.7963 | 9.5528 |

20 | 8.4692 | 11.3309 | 15.6192 | 9.6146 |

**Table 6.**Forecast MSE of the CMAP data experiment on each model. The

**length**here shows in how many timesteps the forecast is done on each model. The bold numbers indicate the lowest error achieved on each forecast length.

Length | CMAP (MSE$\times {10}^{3})$ | |||
---|---|---|---|---|

Naive Forecast | ConvLSTM | DMM | CNMM (Ours) | |

5 | 9.8527 | 6.5047 | 6.6175 | 7.5480 |

10 | 9.8489 | 6.5003 | 6.6126 | 7.5447 |

15 | 9.8533 | 6.5181 | 6.6143 | 7.5501 |

20 | 9.8574 | 6.5500 | 6.6155 | 7.5529 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Halim, C.J.; Kawamoto, K.
2D Convolutional Neural Markov Models for Spatiotemporal Sequence Forecasting. *Sensors* **2020**, *20*, 4195.
https://doi.org/10.3390/s20154195

**AMA Style**

Halim CJ, Kawamoto K.
2D Convolutional Neural Markov Models for Spatiotemporal Sequence Forecasting. *Sensors*. 2020; 20(15):4195.
https://doi.org/10.3390/s20154195

**Chicago/Turabian Style**

Halim, Calvin Janitra, and Kazuhiko Kawamoto.
2020. "2D Convolutional Neural Markov Models for Spatiotemporal Sequence Forecasting" *Sensors* 20, no. 15: 4195.
https://doi.org/10.3390/s20154195