# History Marginalization Improves Forecasting in Variational Recurrent Neural Networks

## Abstract

## 1. Introduction

- A new inference model. We establish a new type of variational family for inference in sequential latent variable models. Instead of a structured variational approximation, VDM marginalizes over past states. This leads to an efficient mean-field factorization where each variational factor is multi-modal by construction.
- An evaluation metric for multi-modal forecasting. The negative log-likelihood measures predictive accuracy but neglects an important aspect of multi-modal forecasts—sample diversity. In Section 4, we propose a score inspired by the Wasserstein distance [7] which evaluates both prediction quality and diversity. This metric complements our evaluation based on log-likelihoods.
- An extensive empirical study. In Section 4, we use VDM to study various datasets, including synthetic data, a stochastic Lorenz attractor, taxi trajectories, basketball player trajectories, and a U.S. pollution dataset with the measurements of various pollutants over time. We illustrate VDM’s ability in modeling multi-modal dynamics and provide quantitative comparisons to other methods showing that VDM compares favorably to previous work.

## 2. Related Work

## 3. Method–Variational Dynamic Mixtures

#### 3.1. The Generative Model of VDM

#### 3.2. The Variational Posterior of VDM

- ${q}_{\mathrm{inf}}$ reflects the generative model’s transition dynamics and combines it with the current observation ${\mathbf{x}}_{t}$. It is a Gaussian distribution whose parameters are obtained by propagating ${\mathbf{z}}_{<t}$ through the RNN of the generative model and using an inference network to combine the output with ${\mathbf{x}}_{t}$.
- ${q}_{\mathrm{tar}}$ is a distribution we will use to sample past states for approximating the marginalization in Equation (3). Its name suggests that it is generally intractable and will be approximated via self-normalized importance sampling.

#### 3.2.1. Parametrization of the Variational Posterior

#### 3.2.2. Generalized Mixture Weights

Algorithm 1: Generative model. |

Inputs: ${\mathbf{z}}_{\tau},{\mathbf{h}}_{\tau}$Outputs: ${\mathbf{x}}_{\tau +1:T}$for $t=\tau +1:T$ do${\mathbf{h}}_{t}={\varphi}^{\mathrm{GRU}}({\mathbf{z}}_{t-1},{\mathbf{h}}_{t-1})$ $[{\mu}_{0,t},{\sigma}_{0,t}^{2}]={\varphi}^{tra}\left({\mathbf{h}}_{t}\right)$ Equation (1) ${\mathbf{z}}_{t}\sim \mathcal{N}({\mu}_{0,t},{\sigma}_{0,t}^{2}\mathbb{I})$ $[{\mu}_{x,t},{\sigma}_{x,t}^{2}]={\varphi}^{dec}({\mathbf{z}}_{t},{\mathbf{h}}_{t})$ Equation (2) ${\mathbf{x}}_{t}\sim \mathcal{N}({\mu}_{x,t},{\sigma}_{x,t}^{2}\mathbb{I})$ end for |

Algorithm 2: Inference model. |

Inputs: ${\mathbf{x}}_{1:\tau},{\widehat{\mathbf{h}}}_{1}$Outputs: ${\mathbf{z}}_{1:\tau},{\widehat{\mathbf{h}}}_{\tau}$$[{\mu}_{\mathrm{inf},1},{\sigma}_{\mathrm{inf},1}^{2}]={\varphi}^{inf}({\mathbf{x}}_{1},{\widehat{\mathbf{h}}}_{1})$ ${\mathbf{z}}_{1}^{\left(i\right)}\sim \mathcal{N}({\mu}_{z,1},{\sigma}_{z,1}^{2}\mathbb{I})$ for $t=2:\tau $ do${\mathbf{h}}_{t}^{\left(i\right)}={\varphi}^{\mathrm{GRU}}({\mathbf{z}}_{t-1}^{\left(i\right)},{\widehat{\mathbf{h}}}_{t-1})$ Equation (5) $[{\mu}_{\mathrm{inf},t}^{\left(i\right)},{\sigma}_{\mathrm{inf},t}^{\left(i\right)2}]={\varphi}^{inf}({\mathbf{x}}_{t},{\mathbf{h}}_{t}^{\left(i\right)})$ Equation (6) ${\omega}_{t}^{\left(i\right)}\omega ({\mathbf{x}}_{t},{\mathbf{h}}_{t}^{\left(i\right)})/{\sum}_{j=1}^{K}\omega ({\mathbf{x}}_{t},{\mathbf{h}}_{t}^{\left(j\right)})$ Equation (8) ${\mathbf{z}}_{t}^{\left(i\right)}\sim {\sum}_{i}^{k}{\omega}_{t}^{\left(i\right)}\mathcal{N}({\mu}_{\mathrm{inf},t}^{\left(i\right)},{\sigma}_{\mathrm{inf},t}^{\left(i\right)2}\mathbb{I})$ ${\widehat{\mathbf{h}}}_{t}={\sum}_{i}^{k}{\omega}_{t}^{\left(i\right)}{\mathbf{h}}_{t}^{\left(i\right)}$ end for |

#### 3.3. The Variational Objective of VDM

#### 3.4. Alternative Modeling Choices

## 4. Evaluation and Experiments

#### 4.1. Evaluation Metrics

#### 4.2. Baselines

#### 4.3. Ablations

#### 4.4. Results

#### 4.4.1. Synthetic Data with Multi-Modal Dynamics

#### 4.4.2. Stochastic Lorenz Attractor

#### 4.4.3. Taxi Trajectories

#### 4.4.4. NBA SportVu Data

#### 4.4.5. U.S. Pollution Data

_{2}, O

_{3}, SO

_{2}, and CO). The goal is to predict monthly pollution values for the coming 18 months, given observations of the previous 6 months. We ignore the geographical location and time information to treat the development tendency of pollution in different counties and different times as i.i.d. The unknown context information makes the dynamics multi-modal and challenging to predict accurately. Due to the small size and high dimensionality of the dataset, there are not enough samples with very similar initial observations. Thus, we cannot evaluate empirical W-distance in this experiment. VDM outperforms all baselines in both evaluations (Table 3).

## 5. Conclusions

## Appendix A. ELBO Derivations

## Appendix B. Supplementary to Stochastic Cubature Approximation

## Appendix C. Supplementary to Experiments Setup

#### Appendix C.1. Stochastic Lorenz Attractor Setup

#### Appendix C.2. Taxi Trajectories Setup

#### Appendix C.3. U.S. Pollution Data Setup

_{2}, O

_{3}, SO

_{2}, and CO). Each of them has 3 major values (mean, max value, and air quality index). It is collected from counties in different states every day from 2000 to 2016. Since the daily measurements are very noisy and volatile, we compute the monthly average values of each measurement, and then extract non-overlapping segments of length 24 from the dataset. In total, we extract 1639 sequences as training set, 25 sequences as validation set, and 300 sequences as test set.

#### Appendix C.4. NBA SportVu Data Setup

## Appendix D. Implementation Details

- Latent RNN: summarize the historic latent states ${\mathbf{z}}_{<t}$ in the hidden states ${\mathbf{h}}_{t}$.
- Transition network: transit the latent states ${\mathbf{z}}_{t}$ temporally.
- Emission network: map the latent states ${\mathbf{z}}_{t}$ and hidden states ${\mathbf{h}}_{t}$ to observations ${\mathbf{x}}_{t}$.
- Inference network: update states ${\mathbf{z}}_{t}$ given observations ${\mathbf{x}}_{t}$ and hidden states ${\mathbf{h}}_{t}$.

^{−3}. In all experiments, the networks have the same architectures but different sizes. The model size depends on observation dimension ${\mathbf{d}}_{\mathbf{x}}$, latent state dimension ${\mathbf{d}}_{\mathbf{z}}$, and hidden state dimension ${\mathbf{d}}_{\mathbf{h}}$. The number of samples used at each time step in the training is $2{\mathbf{d}}_{\mathbf{z}}+1$. If the model output is variance, we use a softplus to ensure its non-negative.

- Latent RNN: one layer GRU of input size ${\mathbf{d}}_{\mathbf{z}}$ and hidden size ${\mathbf{d}}_{\mathbf{h}}$
- Transition network: input size is ${\mathbf{d}}_{\mathbf{h}}$; 3 linear layers of size 64, 64, and $2{\mathbf{d}}_{\mathbf{z}}$, with ReLUs.
- Emission network: input size is ${\mathbf{d}}_{\mathbf{h}}+{\mathbf{d}}_{\mathbf{z}}$; 3 linear layers of size 32, 32 and $2{\mathbf{d}}_{\mathbf{x}}$, with ReLUs.
- Inference network: input size is ${\mathbf{d}}_{\mathbf{h}}+{\mathbf{d}}_{\mathbf{x}}$; 3 linear layers of size 64, 64, and $2{\mathbf{d}}_{\mathbf{z}}$, with ReLUs.

${\mathbf{d}}_{\mathbf{x}}$ | ${\mathbf{d}}_{\mathbf{z}}$ | ${\mathbf{d}}_{\mathbf{h}}$ | |
---|---|---|---|

Lorenz | 3 | 6 | 32 |

Taxi | 2 | 6 | 32 |

Pollution | 12 | 8 | 48 |

SportVu | 2 | 6 | 32 |

**Table A2.**Number of parameters for each model in four experiments. VDM, AESMC, DMM-IAF, VRNN, and RKN have comparable number of parameters. CF-VAE has much more parameters.

RKN | VRNN | CF-VAE | DMM-IAF | AESMC | VDM | |
---|---|---|---|---|---|---|

Lorenz | 23,170 | 22,506 | 7,497,468 | 24,698 | 22,218 | 22,218 |

Taxi | 23,118 | 22,248 | 7,491,123 | 24,536 | 22,056 | 22,056 |

Pollution | 35,774 | 33,192 | 8,162,850 | 36,328 | 31,464 | 31,464 |

SportVu | 23,118 | 22,248 | 7,491,123 | 24,536 | 22,056 | 22,056 |

**Figure 1.**Forecasting taxi trajectories is challenging due to the highly multi-modal nature of the data (

**a**). VDM (

**b**) succeeds in generating diverse plausible predictions (red), based the beginning of a trajectory (blue). The other methods, auto-encoding sequential Monte Carlo (AESMC) [1], deep Markov model [2] with variational posteriors based on inverse autoregressive flows [3] (DMM-IAF), conditional flow variational autoencoder (CF-VAE) [4], variational recurrent neural network (VRNN) [5], recurrent Kalman network (RKN) [6], suffer from mode averaging.

**Figure 2.**Experiments on 2d synthetic data with 4 modes highlight the multi-modality of VDM. We train VDM (left), DMM-IAF (middle), and AESMC (right) on a training set of trajectories $\mathcal{D}$ of length 4, and plot generated trajectories $\widehat{\mathbf{X}}$ (2 colors for 2 dimensions). VDM and AESMC both use 9 samples. We also plot the aggregated posterior $p\left({\mathbf{z}}_{2}\right|\mathcal{D})$, and the predictive prior $p\left({\mathbf{z}}_{2}\right|{\mathbf{x}}_{\le 1})$ (4 colors for 4 clusters, and not related to the colors in the trajectories plot) at the second time step. Only VDM learns a multi-modal predictive prior, which explains its success in modeling multi-modal dynamics.

**Figure 3.**Generated samples from VDM and baselines for stochastic Lorenz attractor. The models generate the future 990 steps (blue) based on the first 10 observations (red). Due to the chaotic property, the reconstruction is impossible even the model learns the right dynamics. VDM, AESMC, and DMM-IAF capture the stochastic dynamics well, while RKN fails.

**Figure 4.**An illustration of predictive priors $p\left({\mathbf{z}}_{t}\right|{\mathbf{x}}_{<t})$ of taxi trajectories from VDM, DMM-IAF, and AESMC at 3 forks in the road marked on the map. VDM and AESMC both use 13 samples. VDM succeeds in capturing the multi-modal distributions, while DMM-IAF and AESMC approximate them with uni-modal distributions. For visualization, the distributions are projected to 2d with KDE.

**Figure 5.**VDM and CF-VAE generate plausible multi-modal trajectories of basketball plays. Each model’s forecasts (blue) are based on the first 10 observations (red). Ground truth data is green.

**Table 1.**Definition of VDM variants. By tuning the modeling choices of sampling (MC sampling or SCA in Section 3.4), weights (uniform weights, soft weights in Equation (9), or hard weights in Equation (13)), and the loss function (with or without ${\mathcal{L}}_{pred}$), we propose 5 variants of VDM.

VDM | VDM (${\mathcal{L}}_{\mathbf{ELBO}}$) | VDM-SCA-S | VDM-MC-S | VDM-MC-U | |
---|---|---|---|---|---|

Sampling | SCA | SCA | SCA | Monte-Carlo | Monte-Carlo |

Weights | hard | hard | soft | soft | uniform |

Loss | ${\mathcal{L}}_{\mathrm{VDM}}$ | $-{\mathcal{L}}_{\mathrm{ELBO}}$ | ${\mathcal{L}}_{\mathrm{VDM}}$ | ${\mathcal{L}}_{\mathrm{VDM}}$ | ${\mathcal{L}}_{\mathrm{VDM}}$ |

**Table 2.**Prediction error on stochastic Lorenz attractor and taxi trajectories for three evaluation metrics (details in main text). On the stochastic Lorenz attractor, VDM achieves the best performance. AESMC and DMM-IAF also give comparable results. On the taxi trajectories, CF-VAE achieves the best result in multi-step ahead prediction, since it uses a global variable, that guides the trajectories into generally the right direction. Meanwhile VDM variants outperform all sequential models, and outperform CF-VAE on the other metrics. To test different modeling choices we include the VDM variants of Table 1.

Stochastic Lorenz Attractor | Taxi Trajectories | |||||
---|---|---|---|---|---|---|

Multi-Step | One-Step | W-Distance | Multi-Step | One-Step | W-Distance | |

RKN | 104.41 | 1.88 | 16.16 | 4.25 | −2.90 | 2.07 |

VRNN | 65.89 ± 0.21 | −1.63 | 16.14 ± 0.006 | 5.51 ± 0.002 | −2.77 | 2.43 ± 0.0002 |

CF-VAE | 32.41 ± 0.13 | n.a | 8.44 ± 0.005 | 2.77 ± 0.001 | n.a | 0.76 ± 0.0003 |

DMM-IAF | 25.26 ± 0.24 | −1.29 | 7.47 ± 0.014 | 3.29 ± 0.001 | −2.45 | 0.70 ± 0.0003 |

AESMC | 25.01 ± 0.22 | −1.69 | 7.29 ± 0.005 | 3.31 ± 0.001 | −2.87 | 0.66 ± 0.0004 |

VDM | 24.49 ± 0.16 | −1.81 | 7.29 ± 0.003 | 2.88 ± 0.002 | −3.68 | 0.56 ± 0.0008 |

$\mathrm{VDM}\left({\mathcal{L}}_{\mathrm{ELBO}}\right)$ | 25.01 ± 0.27 | −1.74 | 7.30 ± 0.004 | 3.10 ± 0.005 | −3.05 | 0.61 ± 0.0003 |

$\mathrm{VDM}-\mathrm{SCA}-\mathrm{S}$ | 24.69 ± 0.16 | −1.83 | 7.30 ± 0.009 | 3.09 ± 0.001 | −3.24 | 0.64 ± 0.0005 |

$\mathrm{VDM}-\mathrm{MC}-\mathrm{S}$ | 24.67 ± 0.16 | −1.84 | 7.30 ± 0.005 | 3.17 ± 0.001 | −3.21 | 0.68 ± 0.0008 |

$\mathrm{VDM}-\mathrm{MC}-\mathrm{U}$ | 25.04 ± 0.28 | −1.81 | 7.31 ± 0.002 | 3.30 ± 0.002 | −2.42 | 0.69 ± 0.0002 |

**Table 3.**Prediction error on basketball players’ trajectories and U.S. pollution data for two evaluation metrics (details in main text). VDM makes the most accurate multi-step and one-step ahead predictions. The tested variants of VDM are defined in Table 1.

NBA SportVu | US Pollution | |||
---|---|---|---|---|

Multi-Steps | One-Step | Multi-Steps | One-Step | |

RKN | 4.88 | 1.55 | 53.13 | 6.98 |

VRNN | 5.42 ± 0.009 | −2.78 | 49.32 ± 0.13 | 8.69 |

CF-VAE | 3.24 ± 0.003 | n.a | 45.86 ± 0.04 | n.a |

DMM-IAF | 3.63 ± 0.002 | −3.74 | 44.82 ± 0.11 | 9.41 |

AESMC | 3.74 ± 0.003 | −3.91 | 41.14 ± 0.13 | 6.93 |

VDM | 3.23 ± 0.003 | −5.44 | 37.64 ± 0.07 | 6.91 |

$\mathrm{VDM}\left({\mathcal{L}}_{\mathrm{ELBO}}\right)$ | 3.29 ± 0.003 | −5.04 | 39.87 ± 0.04 | 7.60 |

$\mathrm{VDM}-\mathrm{SCA}-\mathrm{S}$ | 3.31 ± 0.001 | −5.08 | 39.58 ± 0.09 | 7.82 |

$\mathrm{VDM}-\mathrm{MC}-\mathrm{S}$ | 3.35 ± 0.007 | −5.00 | 40.33 ± 0.03 | 8.12 |

$\mathrm{VDM}-\mathrm{MC}-\mathrm{U}$ | 3.39 ± 0.006 | −4.82 | 41.81 ± 0.10 | 7.71 |

