Diffusion Probabilistic Modeling for Video Generation

Denoising diffusion probabilistic models are a promising new class of generative models that mark a milestone in high-quality image generation. This paper showcases their ability to sequentially generate video, surpassing prior methods in perceptual and probabilistic forecasting metrics. We propose an autoregressive, end-to-end optimized video diffusion model inspired by recent advances in neural video compression. The model successively generates future frames by correcting a deterministic next-frame prediction using a stochastic residual generated by an inverse diffusion process. We compare this approach against six baselines on four datasets involving natural and simulation-based videos. We find significant improvements in terms of perceptual quality and probabilistic frame forecasting ability for all datasets.

The goals and challenges of video prediction include (i) generating multi-modal, stochastic predictions that (ii) accurately reflect the high-dimensional dynamics of the data long-term while (iii) identifying architectures that scale to high-resolution content without blurry artifacts.These goals are complicated by occlusions, lighting conditions, and dynamics on different temporal scales.Broadly speaking, models relying on sequential variational autoencoders (Babaeizadeh et al., 2018;Denton & Fergus, 2018;Castrejon et al., 2019) tend to be stronger in goals (i) and (ii), while sequential extensions of generative adversarial networks (Aigner & Körner, 2018;Kwon & Park, 2019;Lee et al., 2018) tend to perform better in goal (iii).A probabilistic method that succeeds in all the three desiderata on high-resolution video content is yet to be found.
Recently, diffusion probabilistic models have achieved considerable progress in image generation, with perceptual qualities comparable to GANs while avoiding the optimization challenges of adversarial training (Sohl-Dickstein et al., 2015;Song & Ermon, 2019;Ho et al., 2020;Song et al., 2021c;b).In this paper, we extend diffusion probabilistic models for stochastic video generation.Our ideas are inspired by the principles of predictive coding (Rao & Ballard, 1999;Marino, 2021) and neural compression algorithms (Yang et al., 2021b) and draw on the intuition that residual errors are easier to model than dense observations (Marino et al., 2021).Our architecture relies on two prediction steps: first, we employ a deterministic convolutional Figure 1: Overview: Our approach predicts the next frame µ t of a video autoregressively along with an additive correction y t 0 generated by a denoising process.Detailed model: Two convolutional RNNs (blue and red arrows) operate on a frame sequence x 0:t−1 to predict the most likely next frame µ t (blue box) and a context vector for a denoising diffusion model.The diffusion model is trained to model the scaled residual y t 0 = (x t − µ t )/σ conditioned on the temporal context.At generation time, the generated residual is added to the next-frame estimate µ t to generate the next frame as x t = µ t + σy t 0 .
RNN to deterministically predict the next frame conditioned on a sequence of frames.Second, we correct this prediction by an additive residual generated by a conditional denoising diffusion process (see Figs. 1a and 1b).This approach is scalable to high-resolution video, stochastic, and relies on likelihood-based principles.
Our ablation studies strongly suggest that predicting video frame residuals instead of naively predicting the next frames improves generative performance.By investigating our architecture on various datasets and comparing it against multiple baselines, we achieve a new state-of-the-art in video generation that produces sharp frames at higher resolutions.In more detail, our achievements are as follows: 1. We show how to use diffusion probabilistic models to generate videos.This enables a new path towards probabilistic video forecasting while achieving perceptual qualities better than or comparable with likelihood-free methods such as GANs.
2. We also study a marginal version of the Continuous Ranked Probability Score and show that it can be used to assess video prediction performance.Our method is also better at probabilistic forecasting than modern GAN and VAE baselines such as IVRNN, SVG-LP, RetroGAN, DVD-GAN and FutureGAN (Castrejon et al., 2019;Denton & Fergus, 2018;Kwon & Park, 2019;Clark et al., 2019;Aigner & Körner, 2018).
3. Our ablation studies demonstrate that modeling residuals from the predicted next frame yields better results than directly modeling the next frames.This observation is consistent with recent findings in neural video compression.Figure 1a summarizes the main idea of our approach (Figure 1b has more details).
The structure of our paper is as follows.We first describe our method, which is followed by a discussion about our experimental findings along with ablation studies.We then discuss connections to the literature, summarize our contributions.

A Diffusion Probabilistic Model for Video
We begin by reviewing the relevant background on diffusion probabilistic models.We then discuss our design choices for extending these models to sequential models for video.

Background on Diffusion Probabilistic Models
Denoising diffusion probabilistic models (DDPMs) are a recent class of generative models with promising properties (Sohl-Dickstein et al., 2015;Ho et al., 2020).Unlike GANs, these models rely on the maximum likelihood training paradigm (and are thus stable to train) while producing samples of comparable perceptual quality as GANs (Brock et al., 2019).
Similar to hierarchical variational autoencoders (VAEs) (Kingma & Welling, 2013), DDPMs are deep latent variable models that model data x 0 in terms of an underlying sequence of latent variables x 1:N such that p θ (x 0 ) = p θ (x 0:N )dx 1:N .The main idea is to impose a diffusion process on the data that incrementally destroys the structure.The diffusion process's incremental posterior yields a stochastic denoising process that can be used to generate structure (Sohl-Dickstein et al., 2015;Ho et al., 2020).The forward, or diffusion process is given by (1) Besides a predefined incremental variance schedule with β n ∈ (0, 1) for n ∈ {1, • • • , N }, this process is parameter-free (Song & Ermon, 2019;Ho et al., 2020).The reverse process is called denoising process, (2) The reverse process can be thought of as approximating the posterior of the diffusion process.Typically, one fixes the covariance matrix (with hyperparameter γ) and only learns the posterior mean function M θ (x n , n).The prior p(x N ) = N (0, I) is typically fixed.The parameter θ can be optimized by maximizing a variational lower bound on the log-likelihood, L variational = E q [− log p θ (x 0:N ) q(x 1:N |x0) ].This bound can be efficiently estimated by stochastic gradients by subsampling time steps n at random since the marginal distributions q(x n |x 0 ) can be computed in closed form (Ho et al., 2020).
In this paper, we use a simplified loss due Ho et al. (2020) who showed that the variational bound could be simplified to the following denoising score matching loss, We thereby define ᾱn = n i=1 (1 − β i ).The intuitive explanation of this loss is that f θ tries to predict the noise ∼ N (0, I) at the denoising step n (Ho et al., 2020).Once the model is trained, it can be used to generate data by ancestral sampling, starting with a draw from the prior p(x N ) and successively generating more and more structure through an annealed Langevin dynamics procedure (Song & Ermon, 2019;Song et al., 2021c).

Residual Video Diffusion Model
Experience shows that it is often simpler to model differences from our predictions than the predictions themselves.For example, masked autoregressive flows (Papamakarios et al., 2017) transform random noise into an additive prediction error residual and boosting algorithms train a sequence of models to predict the error residuals of earlier models (Schapire, 1999).Residual errors also play an important role in modern theories of the brain.For example, predictive coding (Rao & Ballard, 1999) postulates that neural circuits estimate probabilistic models of other neural activity, iteratively exchanging information about error residuals.This theory has interesting connections to VAEs (Marino, 2021;Marino et al., 2021) and neural video compression (Agustsson et al., 2020;Yang et al., 2021a), where one also compresses the residuals to the most likely next-frame predictions.
This work uses a diffusion model to generate residual corrections to a deterministically predicted next frame, adding stochasticity to the video generation task.Both the deterministic prediction as well as the denoising process are conditioned on a long-range context provided by a convolutional RNN.We call our approach "Residual Video Diffusion" (RVD).Details will be explained next.
Notation.We consider a frame sequence x 0:T and a set of latent variables y 1:T ≡ y 1:T 0:N specified by a diffusion process over the lower indices.We refer to y 1:T 0 as the (scaled) frame residuals.
Generative Process.We consider a joint distribution over x 0:T and y 1:T of the following form: We first specify the data likelihood term p(x t |y t , x <t ), which we model autoregressively as a Masked Autoregressive Flow (MAF) (Papamakarios et al., 2017) applied to the frame sequence.This involves an autoregressive prediction network outputting µ φ and a scale parameter σ, Conditioned on y t 0 , this transformation is deterministic.The forward MAF transform (y → x) converts the residual into the data sequence; the inverse transform (x → y) decorrelates the sequence.The temporally decorrelated, sparse residuals y 1:T 0 involve a simpler modeling task than generating the frames themselves.While the scale parameter σ can also be conditioned on past frames, we did not find a benefit in practice.
The autoregressive transform in Equation 5 has also been adapted in a VAE model (Marino et al., 2021) as well as in neural video compression architectures (Agustsson et al., 2020;Yang et al., 2020;2021a;b).These approaches separately compress latent variables that govern the next-frame prediction as well as frame residuals, thereby achieving state-of-the-art rate-distortion performance on high-resolution video content.While these works focused on compression, this paper focuses on generation.
We now specify the second factor in Eq. 4, the generative process of the residual variable, as We fix the top-level prior distribution to be a multivariate Gaussian with identity covariance.All other denoising factors are conditioned on past frames and involve prediction networks M θ , As in Eq. 2, γ is a hyperparameter.Our goal is to learn θ.
Inference Process.Having specified the generative process, we next specify the inference process conditioned on the observed sequence x 0:T : Since the residual noise is a deterministic function of the observed and predicted frame, the first factor is deterministic and can be expressed as q φ (y t ).The remaining N factors are identical to Eq. 1 with x n being replaced by y n .Following Nichol & Dhariwal (2021), we use a cosine schedule to define the variance β n ∈ (0, 1).The architecture is shown in Figure 1b.
Equations 7 and 8 generalize and improve the previously proposed TimeGrad (Rasul et al., 2021) method.This approach showed promising performance in forecasting time series of comparatively smaller dimensions such as electricity prices or taxi trajectories and not video.Besides differences in architecture, this method neither models residuals nor considers the temporal dependency in posterior, which we identify as a crucial aspect to make the model competitive with strong VAE and GAN baselines (see Section 3.6 for an ablation).

Optimization and Sampling.
In analogy to time-independent diffusion models, we can derive a variational lower bound that we can optimize using stochastic gradient descent.In analogy to to the derivation of Eq. 3 (Ho et al., 2020) and using the same definitions of ᾱn and , this results in We can optimize this function using the reparameterization trick Kingma & Welling (2013), i.e., by randomly sampling and n, and taking stochastic gradients with respect to φ and θ.For a practical scheme involving multiple time steps, we also employ teacher forcing Kolen & Kremer (2001).See Algorithm 1 for the detailed training and sampling procedure, where we abbreviated f θ,φ (y t n , n, Algorithm 1: Training (left) and Video Generation (right)

Experiments
We compare Residual Video Diffusion (RVD) against five strong baselines, including three GAN-based models and two sequential VAEs.We consider four different video datasets and consider both probabilistic (CRPS) and perceptual (FVD, LPIPS) metrics, discussed below.Our model achieves a new state of the art in terms of perceptual quality while being comparable with or better than the best-performing sequential VAE in its frame forecasting ability.

Datasets
We consider four video datasets of varying complexities and resolutions.Among the simpler datasets of frame dimensions of 64 × 64, we consider the BAIR Robot Pushing arm dataset (Ebert et al., 2017) and KTH Actions (Schuldt et al., 2004).Amongst the high-resolution datasets (frame sizes of 128×128), we use Cityscape (Cordts et al., 2016), a dataset involving urban street scenes, and a two-dimensional Simulation dataset for turbulent flow of our own making that has been computed using the Lattice Boltzmann Method (Chirila, 2018).These datasets cover various complexities, resolutions, and types of dynamics.
Preprocessing.For KTH and BAIR, we preprocess the videos as commonly proposed (Denton & Fergus, 2018;Marino et al., 2021).For Cityscape, we download the portion titled leftImg8bit_sequence_trainvaltest from the official website1 .Each video is a 30-frame sequence from which we randomly select a sub-sequence.All the videos are center-cropped and downsampled to 128x128.For the simulation dataset, we use an LBM solver to simulate the flow of a fluid (with pre-specified bulk and shear viscosity and rate of flow) interacting with a static object.We extract 10000 frames sampled every 128 ticks, using 8000 for training and 2000 for testing.

Training and Testing Details
The diffusion models are trained with 8 consecutive frames for all the datasets of which the first two frames as used as context frames.We set the batchsize to 4 for all high-resolution videos and to 8 for all low-resolution videos.The pixel values of all the video frames are normalized to [−1, 1].The models are optimized using the Adam optimizer with an initial learning rate of 5 × 10 −5 , which decays to 2 × 10 −5 .All the models are trained on a NVIDIA RTX Titan GPU.The number of diffusion depth is fixed to N = 1600 and the scale term is set to σ = 2.For testing, we use 4 context frames and predict 16 future frames for each video sequence.Wherever applicable, these frames are recursively generated.(Denton & Fergus, 2018) is an established sequential VAE baseline.It leverages recurrent architectures in all of encoder, decoder, and prior to capture the dynamics in videos.We adapt the official implementation from the authors while replacing all the LSTM with ConvLSTM layers, which helps the model scale to different video resolutions.IVRNN (Castrejon et al., 2019) is currently the state-of-the-art video-VAE model trained end-to-end from scratch.The model improves SVG by involving a hierarchy of latent variables.We use the official codebase to train the model.FutureGAN (Aigner & Körner, 2018) relies on an encoder-decoder GAN model that uses spatio-temporal 3D convolutions to process video tensors.In order to make the quality of the output more perceptually appealing, the paper employs the concept of progressively growing GANs.We use the official codebase to train the model.Retrospective Cycle GAN (Kwon & Park, 2019) employs a single generator that can predict both future and past frames given a context and enforces retrospective cycle constraints.Besides the usual discriminator that can identify fake frames, the method also introduces sequence discriminators to identify sequences containing the said fake frames.We used an available third-party implementation2 .DVD-GAN (Clark et al., 2019) proposes an alternative dual-discriminator architecture for video generation on complex datasets.We also adapt a third-party implementation of the model to conduct our experiment3 .

Evaluation Metrics
We address two key aspects for determining the quality of generated sequences: perceptual quality and the models' probabilistic forecasting ability.For the former, we adopt FVD (Unterthiner et al., 2019) and LPIPS (Zhang et al., 2018), while the latter is evaluated using a new extension of CRPS (Matheson & Winkler, 1976) that we propose for video evaluation.
Fréchet Video Distance (FVD) compares sample realism by calculating 2-Wasserstein distance between the ground truth video distribution and the distribution defined by the generative model.Typically, an I3D network pretrained on an action-recognition dataset is used to capture low dimensional feature representations, the distributions of which are used in the metric.Learned Perceptual Image Patch Similarity (LPIPS), on the other hand, computes the L2 distance between deep embeddings across all the layers of a pretrained network which are then averaged spatially.The LPIPS score is calculated on the basis of individual frames and then averaged.
Apart from realism, it is also important that the generated sequence covers the full distribution of possible outcomes.Given the multi-modality of future video outcomes, such an evaluation is challenging.In this paper, we draw inspiration from the Continuous Ranked Probability Score (CRPS), a forecasting metric and proper scoring rule typically associated with time-series data.We calculate CRPS for each pixel in every generated frame and then take an average both spatially and temporally.This results in a CRPS metric on marginal (i.e., pixel-level) distributions that scale well to high-dimensional data.To the best of our knowledge, we are the first to suggest this metric for video.We also visualize this quantity spatially.
CRPS measures the agreement of a cumulative distribution function (CDF) F with an observation x, CRPS(F, x) = R (F (z) − I{x ≤ z}) 2 dz, where I is the indicator function.In the context of our evaluation task, F is the CDF of a single pixel within a single future frame assigned by the generative model.CRPS measures how well this distribution matches the empirical CDF of the data, approximated by a single observed sample.The involved integral can be well approximated by a finite sum since we are dealing with standard 8-bit frames.We approximate F by an empirical CDF F (z) = 1 S S s=1 I{X s ≤ z}; we stress that this does not require a likelihood model but only a set of S stochastically generated samples X s ∼ F from the model, enabling comparisons across methods.

Qualitative and Quantitative Analysis
Using the perceptual and probabilistic metrics mentioned above, we compare test set predictions of our video diffusion architecture against a wide range of baselines, which model underlying data density both explicitly and implicitly.
Table 1 lists all the metric scores for our model and the baselines.Our model performs best in all cases in terms of FVD.For LPIPS, our model also performs best in 3 out of 4 datasets.The perceptual performance is also verified visually in Figure 2, where RVD shows higher clarity on the generated frames and shows less blurriness in regions that are less predictable due to the fast motion.
For a more quantitative assessment, we reported CRPS scores in Table 1.These support that the proposed model can predict the future with higher accuracy on high-resolution videos than other models.Besides, we compute CRPS on the basis of individual frames.Figure 3 shows 1/CRPS (higher is better) as a function of the frame index, revealing a monotonically decreasing trend along the time axis.This follows our intuition that long-term predictions become worse over time for all models.Our method performs best in 3 out of 4 cases.We can resolve this score also spatially in the images, as we do in Figure 4. Areas of distributional disagreement within a frame are shown in blue (right).See supplemental materials for the generated videos on other datasets.

Ablation Studies
We consider two ablations of our model.The first one studies the impact of applying the diffusion generative model for modeling residuals as opposed to directly predicting the next frames.The second ablation studies the impact of the number of frames that the model sees during training.

Modeling Residuals vs. Frames
Our proposed method uses a denoising diffusion generative model to generate residuals to a deterministic next-state prediction (see Figure 1b).A natural question arises whether this architecture is necessary or whether it could be simplified by directly generating the next frame x t 0 instead of the residual y t 0 .To address this, we make the following adjustment.Since y t 0 and x t 0 have equal dimensions, the ablation can be realized by setting µ t = 0 and σ = 1.To distinguish from our proposed "Residual Video Diffusion" (RVD), we call this ablation "Video Diffusion" (VD).Note that this ablation can be considered a customized version of TimeGrad (Rasul et al., 2021) applied to video.
Table 2 shows the results.Across all perceptual metrics, the residual model performs better on all data sets.In terms of CRPS, VD performs slightly better on the simpler KTH and Bair datasets, but worse on the more complex Simulation and Cityscape data.We, therefore, confirm our earlier claims that modeling residuals over frames is crucial for obtaining better performance, especially on more complex high-resolution video.

Influence of Training Sequence Length
We train both our diffusion model and IVRNN on video sequences of varying lengths.As Table 2 reveals, we find that the diffusion model maintains a robust performance, showing only a small degradation on significantly shorter sequences.In contrast, IVRNN seems to be more sensitive to the sequence length.We note that in most experiments, we outperform IVRNN even though we trained our model on shorter sequences.Table 2: Ablation studies on (1) modeling residuals (RVD, proposed) versus future frames (VD) and ( 2) training with different sequence lengths, where (p + q) denotes p context frames and q future frames for prediction.

Related Work
Our paper combines ideas from video generation, diffusion probabilistic models, and neural video compression.As follows, we discuss related work along these lines.

Video Generation Models
Video prediction can sometimes be treated as a supervised problem, where the focus is often on error metrics such as PSNR and SSIM (Lotter et al., 2017;Byeon et al., 2018;Finn et al., 2016).In contrast, in stochastic generation, the focus is tyically on distribution matching as measured by held-out likelihoods or perceptual no-reference metrics.
A large body of video generation research relies on deep sequential latent variable models (Babaeizadeh et al., 2018;Li & Mandt, 2018;Kumar et al., 2019;Unterthiner et al., 2018;Clark et al., 2019).Among the earliest works, Bayer & Osendorfer (2014) and Chung et al. (2015) extended recurrent neural networks to stochastic models by incorporating latent variables.Later work (Denton & Fergus, 2018) extended the sequential VAE by incorporating more expressive priors conditioned on a longer frame context.IVRNN (Castrejon et al., 2019) further enhanced the generation quality by working with a hierarchy of latent variables, which to our knowledge is currently the best end-to-end trained sequential VAE model that can be further refined by greedy fine-tuning (Wu et al., 2021).Normalizing flow-based models for video have been proposed while typically suffering from high demands on memory and compute (Kumar et al., 2019).Some works (Zhao et al., 2018;Franceschi et al., 2020;Marino, 2021;Marino et al., 2021) explored the use of residuals for improving video generation in sequential VAE settings but did not achieve state of the art results.
Another line of sequential models rely on GANs (Vondrick et al., 2016b;Aigner & Körner, 2018;Kwon & Park, 2019;Wu et al., 2021).While these models do not show blurry artifacts, they tend to suffer from long-term consistency, as our experiments confirm.

Diffusion Probabilistic Models
DDPMs have recently shown impressive performance on high-fidelity image generation.Sohl-Dickstein et al. (2015) first introduced and motivated this model class by drawing on a non-equilibrium thermodynamic perspective.Song & Ermon (2019) proposed a single-network model for score estimation, using annealed Langevin dynamics for sampling.Furthermore, Song et al. (2021c) used stochastic differential equations (related to diffusion processes) to train a network to transform random noise into the data distribution.
DDPM by Ho et al. (2020) is the first instance of a diffusion model scalable to high-resolution images.This work also showed the equivalence of DDPM and denoising score-matching methods described above.Subsequent work includes extensions of these models to image super-resolution (Saharia et al., 2021) or hybridizing these models with VAEs (Pandey et al., 2022).Apart from the traditional computer vision tasks, diffusion models were proven to be effective in audio synthesis (Chen et al., 2021;Kong et al., 2021), while Luo & Hu (2021) hybridized normalizing flows and diffusion model to generative 3D point cloud samples.
To the best of our knowledge, TimeGrad (Rasul et al., 2021) is the first sequential diffusion model for time-series forecasting.Their architecture was not designed for video but for traditional lower-dimensional correlated time-series datasets.A concurrent preprint also studies a video diffusion model (Ho et al., 2022).This work is based on an alternative architecture and focuses primarily on perceptual metrics.

Neural Video Compression Models
Video compression models typically employ frame prediction methods optimized for minimizing code length and distortion.In recent years, sequential generative models were proven to be effective on video compression tasks (Han et al., 2019;Yang et al., 2020;Agustsson et al., 2020;Yang et al., 2021a;Lu et al., 2019;Yang et al., 2022).Some of these models show impressive rate-distortion performance with hierarchical structures that separately encode the prediction and error residual.While compression models have different goals than generative models, both benefit from predictive sequential priors (Yang et al., 2021b).Note, however, that these models are ill-suited for generation.

Discussion
We proposed "Residual Video Diffusion": a new model for stochastic video generation based on denoising diffusion probabilistic models.Our approach uses a denoising process, conditioned on the context vector of a convolutional RNN, to generate a residual to a deterministic next-frame prediction.We showed that such residual prediction yields better results than directly predicting the next frame.
To benchmark our approach, we studied a variety of datasets of different degrees of complexity and pixel resolution, including Cityscape and a physics simulation dataset of turbulent flow.We compared our approach against two state-of-the-art VAE and three GAN baselines in terms of both perceptual and probabilistic forecasting metrics.Our method leads to a new state of the art in perceptual quality while being competitive with or better than state-of-the-art hierarchical VAE and GAN baselines in terms of probabilistic forecasting.
Our results provide several promising directions and could improve world model-based RL approaches as well as neural video codecs.

Limitations
The autoregressive setup of the proposed model allows conditional generation with at least one context frame pre-selected from the test dataset.In order to achieve unconditional generation of a complete video sequence, we need an auxiliary image generative model to sample the initial context frames.It's also worth mentioning that we only conduct the experiments on single-domain datasets with monotonic contents (e.g.Cityscape dataset only contains traffic video recorded by a camera installed in the front of the car), as training a large model for multi-domain datasets like Kinetics (Smaira et al., 2020) is demanding for our limited computing resources.Finally, diffusion probabilistic models tend to be slow in training which could be accelerated by incorporating DDIM sampling (Song et al., 2021a) or model distillation (Salimans & Ho, 2022).
Potential Negative Impacts Just as other generative models, video generation models pose the danger of being misused for generating deepfakes, running the risk of being used for spreading misinformation.Note, however, that a probabilistic video prediction model could also be used for anomaly detection (scoring anomalies by likelihood) and hence may help to detect such forgery.
• Channel Dim refers to the channel dimension of all the components in the first downsampling layer of the U-Net Ronneberger et al. (2015) style structure used in our approach.
• Denoising/Transform Multipliers are the channel dimension multipliers for subsequent downsampling layers (including the first layer) in the denoising/transform modules.The upsampling layer multipliers follow the reverse sequence.
• Each ResBlock He et al. ( 2016) leverages a standard implementation of the ResNet block with 3 × 3 kernel, LeakyReLU activation and Group Normalization.
• All ConvGRU Ballas et al. ( 2016) use a 3 × 3 kernel to deal with the temporal information.
• To condition our architecture on the denoising step n, we use positional encodings to encode n and add this encoding to the ResBlocks (as in Figure 5).
Figure 5a shows the overall U-Net style architecture that has been adopted for denoising module.It predicts the noise information from the noisy residual (flowing through the blue arrows) at an arbitrary n th step (note that we perform a total of N = 1600 steps in our setup), conditioned on all the past context frames (flowing through the green arrows).The figure shows a low-resolution setting, where the number of downsampling and upsampling layers has been set to L denoise = 4. Skip concatenations (shown as red arrows) are performed between the Linear Attention of downsampling layer and first ResBlock of the corresponding upsampling layer as detailed in Figure 5b.Context conditioning is provided by a ConvGRU block within the downsampling layers that generates a context which is concatenated along the residual processing stream in the second ResBlock module.Additionally, each ResBlock module in either layers receives a positional encoding indicating the denoising step.
Figure 6a shows the U-Net style architecture that has been adopted for the transform module with the number of downsampling and upsampling layers L transform = 4 in the low-resolution setting.Skip concatenations (shown as red arrows) are performed between the ConvGRU of downsampling layer and first ResBlock of the corresponding upsampling layer as detailed in Figure 6b.

B Deriving the Optimization Objective
The following derivation closely follows Ho et al.Ho et al. (2020) to derive the variational lower bound objective for our sequential generative model.
As discussed in the main paper, let x 0:T denote observed frames, and y 1:T 0:N the variables associated with the diffusion process.Among them, only y 1:T 1:N are the latent variables, while y 1:T 0 are the the observed scaled residuals, given by y t 0 = x t −µ φ (x <t ) σ for t = 1, ..., T .The variational bound is as follows:  The first term in Eq. 10, − log p(y t N ) q(y t N |x ≤t ) , tries to match the q(y t N |x ≤t ) to the prior p(y t N ) = N (0, I).The prior has a fixed variance and is centered around zero, while q(y t N |x ≤t ) = N ( √ ᾱn y t 0 , (1 − ᾱn )I).Because the variance of q is also fixed, the only effect of the the first term is to pull √ ᾱn y t 0 towards zero.However, in practice, ᾱn = N i=0 (1 − β i ) ≈ 0, and hence the effect of this term is very small.For simplicity, therefore, we drop it.
To understand the third term in Eq. 10, we simplify Eq. 11 suggests that the third term matches the diffusion model's output to the frame residual, which is also a special case of L mid we elaborate below.
Deriving a parameterization for the second term of Eq. 10, L mid := − N n>1 log p θ (y t n−1 |y t n ,x <t ) q(y t n−1 |y t n ,x ≤t ) , we recognize that it is a KL divergence between two Gaussians with fixed variances: q(y t n−1 |y t n , x ≤t ) = q(y t n−1 |y t n , y t 0 ) = N (y

Figure 3 :
Figure 3: Inverse CRPS scores (higher is better) as a function of the future frame index.The best performances are obtained by RVD (proposed) and IVRNN.Scores also monotonically decrease as the predictions worsen over time.

Figure 4 :
Figure4: Spatially-resolved CRPS scores (right two plots, lower is better).We compare the performance of RVD (proposed) against IVRNN on predicting the 10 th future frame of a video from CityScape.Darker areas point to larger disagreements with respect to the ground truth.
(a) Overview figure of the autoregressive denoising module using a U-net inspired architecture with skip-connections.We focus on the example of L denoise = 4 downsampling and upsampling layers.Each of these layers, DD and DU, are explained in Figure b below.All DD and DU layers are furthermore conditioned on a positional encoding (PE) of the denoising step n.(b) Downsampling/Upsampling layer design for the autoregressive denoising module.Each arrow corresponds to the arrows with the same color in Figure (a).As in Figure (a), each residual block is conditioned on a positional encoding (PE) of the denoising step n.

Figure 7 :
Figure 7: Prediction Quality for Simulation data.Top row indicates the ground truth, wherein we feed 4 frames as context (from t = 0 to t = 3) and predict the next 16 frames (from t = 4 to t = 19).This is a high resolution dataset (128 × 128) which we generated using a Lattice Boltzmann Solver.It simulates the von Kármán vortex street using Navier-Stokes equations.A fluid (with pre-specified viscosity) flows through a 2D plane interacting with a circular obstacle placed at the center left.This leads to the formation of a repeating pattern of swirling vortices, caused by a process known as vortex shedding, which is responsible for the unsteady separation of flow of a fluid around blunt bodies.Colors indicate the vorticity of the solution to the simulation.

Figure 8 :
Figure8: Prediction Quality for BAIR Robot Pushing.As seen before, the top row indicates the ground truth, wherein we feed 4 frames as context (from t = 0 to t = 3) and predict the next 16 frames (from t = 4 to t = 19).This is a low resolution dataset (64 × 64) that captures the motion of a robotic hand as it manipulates multiple objects.Temporal consistency and occlusion handling are some of the big challenges for this dataset.

Figure 9 :
Figure 9: Prediction Quality for KTH Actions.As seen before, the top row indicates the ground truth, wherein we feed 4 frames as context (from t = 0 to t = 3), but unlike other datasets, we only predict the next 12 frames (from t = 4 to t = 15).

Table 3 :
Configuration Table, see Section A for definitions.