# History Marginalization Improves Forecasting in Variational Recurrent Neural Networks

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

- A new inference model. We establish a new type of variational family for inference in sequential latent variable models. Instead of a structured variational approximation, VDM marginalizes over past states. This leads to an efficient mean-field factorization where each variational factor is multi-modal by construction.
- An evaluation metric for multi-modal forecasting. The negative log-likelihood measures predictive accuracy but neglects an important aspect of multi-modal forecasts—sample diversity. In Section 4, we propose a score inspired by the Wasserstein distance [7] which evaluates both prediction quality and diversity. This metric complements our evaluation based on log-likelihoods.
- An extensive empirical study. In Section 4, we use VDM to study various datasets, including synthetic data, a stochastic Lorenz attractor, taxi trajectories, basketball player trajectories, and a U.S. pollution dataset with the measurements of various pollutants over time. We illustrate VDM’s ability in modeling multi-modal dynamics and provide quantitative comparisons to other methods showing that VDM compares favorably to previous work.

## 2. Related Work

## 3. Method–Variational Dynamic Mixtures

#### 3.1. The Generative Model of VDM

#### 3.2. The Variational Posterior of VDM

- ${q}_{\mathrm{inf}}$ reflects the generative model’s transition dynamics and combines it with the current observation ${\mathbf{x}}_{t}$. It is a Gaussian distribution whose parameters are obtained by propagating ${\mathbf{z}}_{<t}$ through the RNN of the generative model and using an inference network to combine the output with ${\mathbf{x}}_{t}$.
- ${q}_{\mathrm{tar}}$ is a distribution we will use to sample past states for approximating the marginalization in Equation (3). Its name suggests that it is generally intractable and will be approximated via self-normalized importance sampling.

#### 3.2.1. Parametrization of the Variational Posterior

#### 3.2.2. Generalized Mixture Weights

Algorithm 1: Generative model. |

Inputs: ${\mathbf{z}}_{\tau},{\mathbf{h}}_{\tau}$Outputs: ${\mathbf{x}}_{\tau +1:T}$for $t=\tau +1:T$ do${\mathbf{h}}_{t}={\varphi}^{\mathrm{GRU}}({\mathbf{z}}_{t-1},{\mathbf{h}}_{t-1})$ $[{\mu}_{0,t},{\sigma}_{0,t}^{2}]={\varphi}^{tra}\left({\mathbf{h}}_{t}\right)$ Equation (1) ${\mathbf{z}}_{t}\sim \mathcal{N}({\mu}_{0,t},{\sigma}_{0,t}^{2}\mathbb{I})$ $[{\mu}_{x,t},{\sigma}_{x,t}^{2}]={\varphi}^{dec}({\mathbf{z}}_{t},{\mathbf{h}}_{t})$ Equation (2) ${\mathbf{x}}_{t}\sim \mathcal{N}({\mu}_{x,t},{\sigma}_{x,t}^{2}\mathbb{I})$ end for |

Algorithm 2: Inference model. |

Inputs: ${\mathbf{x}}_{1:\tau},{\widehat{\mathbf{h}}}_{1}$Outputs: ${\mathbf{z}}_{1:\tau},{\widehat{\mathbf{h}}}_{\tau}$$[{\mu}_{\mathrm{inf},1},{\sigma}_{\mathrm{inf},1}^{2}]={\varphi}^{inf}({\mathbf{x}}_{1},{\widehat{\mathbf{h}}}_{1})$ ${\mathbf{z}}_{1}^{\left(i\right)}\sim \mathcal{N}({\mu}_{z,1},{\sigma}_{z,1}^{2}\mathbb{I})$ for $t=2:\tau $ do${\mathbf{h}}_{t}^{\left(i\right)}={\varphi}^{\mathrm{GRU}}({\mathbf{z}}_{t-1}^{\left(i\right)},{\widehat{\mathbf{h}}}_{t-1})$ Equation (5) $[{\mu}_{\mathrm{inf},t}^{\left(i\right)},{\sigma}_{\mathrm{inf},t}^{\left(i\right)2}]={\varphi}^{inf}({\mathbf{x}}_{t},{\mathbf{h}}_{t}^{\left(i\right)})$ Equation (6) ${\omega}_{t}^{\left(i\right)}\omega ({\mathbf{x}}_{t},{\mathbf{h}}_{t}^{\left(i\right)})/{\sum}_{j=1}^{K}\omega ({\mathbf{x}}_{t},{\mathbf{h}}_{t}^{\left(j\right)})$ Equation (8) ${\mathbf{z}}_{t}^{\left(i\right)}\sim {\sum}_{i}^{k}{\omega}_{t}^{\left(i\right)}\mathcal{N}({\mu}_{\mathrm{inf},t}^{\left(i\right)},{\sigma}_{\mathrm{inf},t}^{\left(i\right)2}\mathbb{I})$ ${\widehat{\mathbf{h}}}_{t}={\sum}_{i}^{k}{\omega}_{t}^{\left(i\right)}{\mathbf{h}}_{t}^{\left(i\right)}$ end for |

#### 3.3. The Variational Objective of VDM

#### 3.4. Alternative Modeling Choices

## 4. Evaluation and Experiments

#### 4.1. Evaluation Metrics

#### 4.2. Baselines

#### 4.3. Ablations

#### 4.4. Results

#### 4.4.1. Synthetic Data with Multi-Modal Dynamics

#### 4.4.2. Stochastic Lorenz Attractor

#### 4.4.3. Taxi Trajectories

#### 4.4.4. NBA SportVu Data

#### 4.4.5. U.S. Pollution Data

_{2}, O

_{3}, SO

_{2}, and CO). The goal is to predict monthly pollution values for the coming 18 months, given observations of the previous 6 months. We ignore the geographical location and time information to treat the development tendency of pollution in different counties and different times as i.i.d. The unknown context information makes the dynamics multi-modal and challenging to predict accurately. Due to the small size and high dimensionality of the dataset, there are not enough samples with very similar initial observations. Thus, we cannot evaluate empirical W-distance in this experiment. VDM outperforms all baselines in both evaluations (Table 3).

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. ELBO Derivations

## Appendix B. Supplementary to Stochastic Cubature Approximation

## Appendix C. Supplementary to Experiments Setup

#### Appendix C.1. Stochastic Lorenz Attractor Setup

#### Appendix C.2. Taxi Trajectories Setup

#### Appendix C.3. U.S. Pollution Data Setup

_{2}, O

_{3}, SO

_{2}, and CO). Each of them has 3 major values (mean, max value, and air quality index). It is collected from counties in different states every day from 2000 to 2016. Since the daily measurements are very noisy and volatile, we compute the monthly average values of each measurement, and then extract non-overlapping segments of length 24 from the dataset. In total, we extract 1639 sequences as training set, 25 sequences as validation set, and 300 sequences as test set.

#### Appendix C.4. NBA SportVu Data Setup

## Appendix D. Implementation Details

- Latent RNN: summarize the historic latent states ${\mathbf{z}}_{<t}$ in the hidden states ${\mathbf{h}}_{t}$.
- Transition network: transit the latent states ${\mathbf{z}}_{t}$ temporally.
- Emission network: map the latent states ${\mathbf{z}}_{t}$ and hidden states ${\mathbf{h}}_{t}$ to observations ${\mathbf{x}}_{t}$.
- Inference network: update states ${\mathbf{z}}_{t}$ given observations ${\mathbf{x}}_{t}$ and hidden states ${\mathbf{h}}_{t}$.

^{−3}. In all experiments, the networks have the same architectures but different sizes. The model size depends on observation dimension ${\mathbf{d}}_{\mathbf{x}}$, latent state dimension ${\mathbf{d}}_{\mathbf{z}}$, and hidden state dimension ${\mathbf{d}}_{\mathbf{h}}$. The number of samples used at each time step in the training is $2{\mathbf{d}}_{\mathbf{z}}+1$. If the model output is variance, we use a softplus to ensure its non-negative.

- Latent RNN: one layer GRU of input size ${\mathbf{d}}_{\mathbf{z}}$ and hidden size ${\mathbf{d}}_{\mathbf{h}}$
- Transition network: input size is ${\mathbf{d}}_{\mathbf{h}}$; 3 linear layers of size 64, 64, and $2{\mathbf{d}}_{\mathbf{z}}$, with ReLUs.
- Emission network: input size is ${\mathbf{d}}_{\mathbf{h}}+{\mathbf{d}}_{\mathbf{z}}$; 3 linear layers of size 32, 32 and $2{\mathbf{d}}_{\mathbf{x}}$, with ReLUs.
- Inference network: input size is ${\mathbf{d}}_{\mathbf{h}}+{\mathbf{d}}_{\mathbf{x}}$; 3 linear layers of size 64, 64, and $2{\mathbf{d}}_{\mathbf{z}}$, with ReLUs.

${\mathbf{d}}_{\mathbf{x}}$ | ${\mathbf{d}}_{\mathbf{z}}$ | ${\mathbf{d}}_{\mathbf{h}}$ | |
---|---|---|---|

Lorenz | 3 | 6 | 32 |

Taxi | 2 | 6 | 32 |

Pollution | 12 | 8 | 48 |

SportVu | 2 | 6 | 32 |

**Table A2.**Number of parameters for each model in four experiments. VDM, AESMC, DMM-IAF, VRNN, and RKN have comparable number of parameters. CF-VAE has much more parameters.

RKN | VRNN | CF-VAE | DMM-IAF | AESMC | VDM | |
---|---|---|---|---|---|---|

Lorenz | 23,170 | 22,506 | 7,497,468 | 24,698 | 22,218 | 22,218 |

Taxi | 23,118 | 22,248 | 7,491,123 | 24,536 | 22,056 | 22,056 |

Pollution | 35,774 | 33,192 | 8,162,850 | 36,328 | 31,464 | 31,464 |

SportVu | 23,118 | 22,248 | 7,491,123 | 24,536 | 22,056 | 22,056 |

## References

- Le, T.A.; Igl, M.; Rainforth, T.; Jin, T.; Wood, F. Auto-Encoding Sequential Monte Carlo. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Krishnan, R.G.; Shalit, U.; Sontag, D. Structured inference networks for nonlinear state space models. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Kingma, D.P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; Welling, M. Improved Variational Inference with Inverse Autoregressive Flow. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2016; Volume 29. [Google Scholar]
- Bhattacharyya, A.; Hanselmann, M.; Fritz, M.; Schiele, B.; Straehle, C.N. Conditional Flow Variational Autoencoders for Structured Sequence Prediction. arXiv
**2019**, arXiv:1908.09008. [Google Scholar] - Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A.C.; Bengio, Y. A recurrent latent variable model for sequential data. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2980–2988. [Google Scholar]
- Becker, P.; Pandya, H.; Gebhardt, G.; Zhao, C.; Taylor, J.; Neumann, G. Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces. In Proceedings of the Thirty-sixth International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Villani, C. Optimal Transport: Old and New; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008; Volume 338. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed] - Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv
**2014**, arXiv:1412.3555. [Google Scholar] - Fraccaro, M.; Sønderby, S.K.; Paquet, U.; Winther, O. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 2199–2207. [Google Scholar]
- Gemici, M.; Hung, C.C.; Santoro, A.; Wayne, G.; Mohamed, S.; Rezende, D.J.; Amos, D.; Lillicrap, T. Generative temporal models with memory. arXiv
**2017**, arXiv:1702.04649. [Google Scholar] - Li, Y.; Mandt, S. Disentangled sequential autoencoder. In Proceedings of the Thirty-fifth International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Goyal, A.G.A.P.; Sordoni, A.; Côté, M.A.; Ke, N.R.; Bengio, Y. Z-forcing: Training stochastic recurrent networks. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6713–6723. [Google Scholar]
- Naesseth, C.; Linderman, S.; Ranganath, R.; Blei, D. Variational sequential monte carlo. In Proceedings of the International Conference on Artificial Intelligence and Statistics (PMLR), Banff, AB, Canada, 4–8 July 2004; pp. 968–977. [Google Scholar]
- Hirt, M.; Dellaportas, P. Scalable bayesian learning for state space models using variational inference with smc samplers. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Japan, 16–18 April 2019; pp. 76–86. [Google Scholar]
- Saeedi, A.; Kulkarni, T.D.; Mansinghka, V.K.; Gershman, S.J. Variational particle approximations. J. Mach. Learn. Res.
**2017**, 18, 2328–2356. [Google Scholar] - Schmidt, F.; Hofmann, T. Deep state space models for unconditional word generation. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 6158–6168. [Google Scholar]
- Schmidt, F.; Mandt, S.; Hofmann, T. Autoregressive Text Generation Beyond Feedback Loops. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3391–3397. [Google Scholar]
- Ziegler, Z.M.; Rush, A.M. Latent normalizing flows for discrete sequences. In Proceedings of the Thirty-Sixth International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 2172–2180. [Google Scholar]
- Li, Y.; Song, J.; Ermon, S. Infogail: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 3812–3822. [Google Scholar]
- Bhattacharyya, A.; Schiele, B.; Fritz, M. Accurate and diverse sampling of sequences based on a “best of many” sample objective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8485–8493. [Google Scholar]
- Sadeghian, A.; Kosaraju, V.; Sadeghian, A.; Hirose, N.; Rezatofighi, H.; Savarese, S. Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1349–1358. [Google Scholar]
- Kosaraju, V.; Sadeghian, A.; Martín-Martín, R.; Reid, I.; Rezatofighi, H.; Savarese, S. Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 137–146. [Google Scholar]
- Karl, M.; Soelch, M.; Bayer, J.; Van der Smagt, P. Deep variational bayes filters: Unsupervised learning of state space models from raw data. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Fraccaro, M.; Kamronn, S.; Paquet, U.; Winther, O. A disentangled recognition and nonlinear dynamics model for unsupervised learning. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 3601–3610. [Google Scholar]
- Rangapuram, S.S.; Seeger, M.W.; Gasthaus, J.; Stella, L.; Wang, Y.; Januschowski, T. Deep state space models for time series forecasting. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 7785–7794. [Google Scholar]
- Zheng, X.; Zaheer, M.; Ahmed, A.; Wang, Y.; Xing, E.P.; Smola, A.J. State space LSTM models with particle MCMC inference. arXiv
**2017**, arXiv:1711.11179. [Google Scholar] - Doerr, A.; Daniel, C.; Schiegg, M.; Nguyen-Tuong, D.; Schaal, S.; Toussaint, M.; Trimpe, S. Probabilistic recurrent state-space models. In Proceedings of the Thirty-fifth International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- De Brouwer, E.; Simm, J.; Arany, A.; Moreau, Y. GRU-ODE-Bayes: Continuous modeling of sporadically-observed time series. In Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 7377–7388. [Google Scholar]
- Gedon, D.; Wahlström, N.; Schön, T.B.; Ljung, L. Deep State Space Models for Nonlinear System Identification. arXiv
**2020**, arXiv:2003.14162. [Google Scholar] [CrossRef] - Linderman, S.; Johnson, M.; Miller, A.; Adams, R.; Blei, D.; Paninski, L. Bayesian learning and inference in recurrent switching linear dynamical systems. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 914–922. [Google Scholar]
- Nassar, J.; Linderman, S.; Bugallo, M.; Park, I.M. Tree-Structured Recurrent Switching Linear Dynamical Systems for Multi-Scale Modeling. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Becker-Ehmck, P.; Peters, J.; Van Der Smagt, P. Switching Linear Dynamics for Variational Bayes Filtering. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 553–562. [Google Scholar]
- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv
**2014**, arXiv:1406.1078. [Google Scholar] - Auger-Méthé, M.; Field, C.; Albertsen, C.M.; Derocher, A.E.; Lewis, M.A.; Jonsen, I.D.; Flemming, J.M. State-space models’ dirty little secrets: Even simple linear Gaussian models can have estimation problems. Sci. Rep.
**2016**, 6, 26677. [Google Scholar] [CrossRef] [PubMed][Green Version] - Maaløe, L.; Sønderby, C.K.; Sønderby, S.K.; Winther, O. Auxiliary deep generative models. In Proceedings of the International conference on Machine Learning (PMLR), New York, NY, USA, 20–22 June 2016; pp. 1445–1453. [Google Scholar]
- Ranganath, R.; Tran, D.; Blei, D. Hierarchical variational models. In Proceedings of the International Conference on Machine Learning (PMLR), New York, NY, USA, 20–22 June 2016; pp. 324–333. [Google Scholar]
- Sobolev, A.; Vetrov, D. Importance weighted hierarchical variational inference. arXiv
**2019**, arXiv:1905.03290. [Google Scholar] - Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Owen, A.B. Monte Carlo Theory, Methods and Examples; 2013. Available online: https://artowen.su.domains/mc/ (accessed on 30 September 2021).
- Schön, T.B.; Lindsten, F.; Dahlin, J.; Wågberg, J.; Naesseth, C.A.; Svensson, A.; Dai, L. Sequential Monte Carlo methods for system identification. IFAC-Pap.
**2015**, 48, 775–786. [Google Scholar] - Wan, E.A.; Van Der Merwe, R. The unscented Kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373), Lake Louise, AB, Canada, 4 October 2000; pp. 153–158. [Google Scholar]
- Wu, Y.; Hu, D.; Wu, M.; Hu, X. A numerical-integration perspective on Gaussian filters. IEEE Trans. Signal Process.
**2006**, 54, 2910–2921. [Google Scholar] [CrossRef] - Arasaratnam, I.; Haykin, S. Cubature kalman filters. IEEE Trans. Autom. Control
**2009**, 54, 1254–1269. [Google Scholar] [CrossRef][Green Version] - Lee, N.; Choi, W.; Vernaza, P.; Choy, C.B.; Torr, P.H.; Chandraker, M. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 336–345. [Google Scholar]
- Tomczak, J.M.; Welling, M. VAE with a VampPrior. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics, Playa Blanca, Spain, 9–11 April 2018. [Google Scholar]

**Figure 1.**Forecasting taxi trajectories is challenging due to the highly multi-modal nature of the data (

**a**). VDM (

**b**) succeeds in generating diverse plausible predictions (red), based the beginning of a trajectory (blue). The other methods, auto-encoding sequential Monte Carlo (AESMC) [1], deep Markov model [2] with variational posteriors based on inverse autoregressive flows [3] (DMM-IAF), conditional flow variational autoencoder (CF-VAE) [4], variational recurrent neural network (VRNN) [5], recurrent Kalman network (RKN) [6], suffer from mode averaging.

**Figure 2.**Experiments on 2d synthetic data with 4 modes highlight the multi-modality of VDM. We train VDM (left), DMM-IAF (middle), and AESMC (right) on a training set of trajectories $\mathcal{D}$ of length 4, and plot generated trajectories $\widehat{\mathbf{X}}$ (2 colors for 2 dimensions). VDM and AESMC both use 9 samples. We also plot the aggregated posterior $p\left({\mathbf{z}}_{2}\right|\mathcal{D})$, and the predictive prior $p\left({\mathbf{z}}_{2}\right|{\mathbf{x}}_{\le 1})$ (4 colors for 4 clusters, and not related to the colors in the trajectories plot) at the second time step. Only VDM learns a multi-modal predictive prior, which explains its success in modeling multi-modal dynamics.

**Figure 3.**Generated samples from VDM and baselines for stochastic Lorenz attractor. The models generate the future 990 steps (blue) based on the first 10 observations (red). Due to the chaotic property, the reconstruction is impossible even the model learns the right dynamics. VDM, AESMC, and DMM-IAF capture the stochastic dynamics well, while RKN fails.

**Figure 4.**An illustration of predictive priors $p\left({\mathbf{z}}_{t}\right|{\mathbf{x}}_{<t})$ of taxi trajectories from VDM, DMM-IAF, and AESMC at 3 forks in the road marked on the map. VDM and AESMC both use 13 samples. VDM succeeds in capturing the multi-modal distributions, while DMM-IAF and AESMC approximate them with uni-modal distributions. For visualization, the distributions are projected to 2d with KDE.

**Figure 5.**VDM and CF-VAE generate plausible multi-modal trajectories of basketball plays. Each model’s forecasts (blue) are based on the first 10 observations (red). Ground truth data is green.

**Table 1.**Definition of VDM variants. By tuning the modeling choices of sampling (MC sampling or SCA in Section 3.4), weights (uniform weights, soft weights in Equation (9), or hard weights in Equation (13)), and the loss function (with or without ${\mathcal{L}}_{pred}$), we propose 5 variants of VDM.

VDM | VDM (${\mathcal{L}}_{\mathbf{ELBO}}$) | VDM-SCA-S | VDM-MC-S | VDM-MC-U | |
---|---|---|---|---|---|

Sampling | SCA | SCA | SCA | Monte-Carlo | Monte-Carlo |

Weights | hard | hard | soft | soft | uniform |

Loss | ${\mathcal{L}}_{\mathrm{VDM}}$ | $-{\mathcal{L}}_{\mathrm{ELBO}}$ | ${\mathcal{L}}_{\mathrm{VDM}}$ | ${\mathcal{L}}_{\mathrm{VDM}}$ | ${\mathcal{L}}_{\mathrm{VDM}}$ |

**Table 2.**Prediction error on stochastic Lorenz attractor and taxi trajectories for three evaluation metrics (details in main text). On the stochastic Lorenz attractor, VDM achieves the best performance. AESMC and DMM-IAF also give comparable results. On the taxi trajectories, CF-VAE achieves the best result in multi-step ahead prediction, since it uses a global variable, that guides the trajectories into generally the right direction. Meanwhile VDM variants outperform all sequential models, and outperform CF-VAE on the other metrics. To test different modeling choices we include the VDM variants of Table 1.

Stochastic Lorenz Attractor | Taxi Trajectories | |||||
---|---|---|---|---|---|---|

Multi-Step | One-Step | W-Distance | Multi-Step | One-Step | W-Distance | |

RKN | 104.41 | 1.88 | 16.16 | 4.25 | −2.90 | 2.07 |

VRNN | 65.89 ± 0.21 | −1.63 | 16.14 ± 0.006 | 5.51 ± 0.002 | −2.77 | 2.43 ± 0.0002 |

CF-VAE | 32.41 ± 0.13 | n.a | 8.44 ± 0.005 | 2.77 ± 0.001 | n.a | 0.76 ± 0.0003 |

DMM-IAF | 25.26 ± 0.24 | −1.29 | 7.47 ± 0.014 | 3.29 ± 0.001 | −2.45 | 0.70 ± 0.0003 |

AESMC | 25.01 ± 0.22 | −1.69 | 7.29 ± 0.005 | 3.31 ± 0.001 | −2.87 | 0.66 ± 0.0004 |

VDM | 24.49 ± 0.16 | −1.81 | 7.29 ± 0.003 | 2.88 ± 0.002 | −3.68 | 0.56 ± 0.0008 |

$\mathrm{VDM}\left({\mathcal{L}}_{\mathrm{ELBO}}\right)$ | 25.01 ± 0.27 | −1.74 | 7.30 ± 0.004 | 3.10 ± 0.005 | −3.05 | 0.61 ± 0.0003 |

$\mathrm{VDM}-\mathrm{SCA}-\mathrm{S}$ | 24.69 ± 0.16 | −1.83 | 7.30 ± 0.009 | 3.09 ± 0.001 | −3.24 | 0.64 ± 0.0005 |

$\mathrm{VDM}-\mathrm{MC}-\mathrm{S}$ | 24.67 ± 0.16 | −1.84 | 7.30 ± 0.005 | 3.17 ± 0.001 | −3.21 | 0.68 ± 0.0008 |

$\mathrm{VDM}-\mathrm{MC}-\mathrm{U}$ | 25.04 ± 0.28 | −1.81 | 7.31 ± 0.002 | 3.30 ± 0.002 | −2.42 | 0.69 ± 0.0002 |

**Table 3.**Prediction error on basketball players’ trajectories and U.S. pollution data for two evaluation metrics (details in main text). VDM makes the most accurate multi-step and one-step ahead predictions. The tested variants of VDM are defined in Table 1.

NBA SportVu | US Pollution | |||
---|---|---|---|---|

Multi-Steps | One-Step | Multi-Steps | One-Step | |

RKN | 4.88 | 1.55 | 53.13 | 6.98 |

VRNN | 5.42 ± 0.009 | −2.78 | 49.32 ± 0.13 | 8.69 |

CF-VAE | 3.24 ± 0.003 | n.a | 45.86 ± 0.04 | n.a |

DMM-IAF | 3.63 ± 0.002 | −3.74 | 44.82 ± 0.11 | 9.41 |

AESMC | 3.74 ± 0.003 | −3.91 | 41.14 ± 0.13 | 6.93 |

VDM | 3.23 ± 0.003 | −5.44 | 37.64 ± 0.07 | 6.91 |

$\mathrm{VDM}\left({\mathcal{L}}_{\mathrm{ELBO}}\right)$ | 3.29 ± 0.003 | −5.04 | 39.87 ± 0.04 | 7.60 |

$\mathrm{VDM}-\mathrm{SCA}-\mathrm{S}$ | 3.31 ± 0.001 | −5.08 | 39.58 ± 0.09 | 7.82 |

$\mathrm{VDM}-\mathrm{MC}-\mathrm{S}$ | 3.35 ± 0.007 | −5.00 | 40.33 ± 0.03 | 8.12 |

$\mathrm{VDM}-\mathrm{MC}-\mathrm{U}$ | 3.39 ± 0.006 | −4.82 | 41.81 ± 0.10 | 7.71 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Qiu, C.; Mandt, S.; Rudolph, M.
History Marginalization Improves Forecasting in Variational Recurrent Neural Networks. *Entropy* **2021**, *23*, 1563.
https://doi.org/10.3390/e23121563

**AMA Style**

Qiu C, Mandt S, Rudolph M.
History Marginalization Improves Forecasting in Variational Recurrent Neural Networks. *Entropy*. 2021; 23(12):1563.
https://doi.org/10.3390/e23121563

**Chicago/Turabian Style**

Qiu, Chen, Stephan Mandt, and Maja Rudolph.
2021. "History Marginalization Improves Forecasting in Variational Recurrent Neural Networks" *Entropy* 23, no. 12: 1563.
https://doi.org/10.3390/e23121563