Sequence-to-Sequence Video Prediction by Learning Hierarchical Representations

Video prediction which maps a sequence of past video frames into realistic future video frames is a challenging task because it is difficult to generate realistic frames and model the coherent relationship between consecutive video frames. In this paper, we propose a hierarchical sequence-to-sequence prediction approach to address this challenge. We present an end-to-end trainable architecture in which the frame generator automatically encodes input frames into different levels of latent Convolutional Neural Network (CNN) features, and then recursively generates future frames conditioned on the estimated hierarchical CNN features and previous prediction. Our design is intended to automatically learn hierarchical representations of video and their temporal dynamics. Convolutional Long Short-Term Memory (ConvLSTM) is used in combination with skip connections so as to separately capture the sequential structures of multiple levels of hierarchy of features. We adopt Scheduled Sampling for training our recurrent network in order to facilitate convergence and to produce high-quality sequence predictions. We evaluate our method on the Bouncing Balls, Moving MNIST, and KTH human action dataset, and report favorable results as compared to existing methods.


Introduction
The goal of video prediction is to predict future frames given past observations. Being able to model and predict the future is essential to many applications of machine learning and computer vision, such as human pose estimation and recognition [1], pedestrian detection and tracking [2], weather forecasting [3]. In recent years, a number of video prediction methods have been proposed [4][5][6][7], driven by the success of convolutional neural networks (CNN) [8][9][10][11][12], recurrent neural networks (RNN) [13], and generative adversarial networks (GAN) [14]. However, video prediction remains a challenging task in computer vision, because of not only the uncertain nature of future events, but also the unpredictability of spatio-temporal dynamics as well as the high dimensionality.
Recent approaches for video prediction focus on separating representations that exhibit different temporal dynamics. For example, DRNet [15] and MCnet [16] propose to decompose video frames into stationary and time-varying parts, and use RNNs to model the dynamics of time-varying components. Other recent works [17][18][19] consider extracting high-level features from input frames, then predict the temporal evolution of such high-level features, and use generative models to reconstruct future frames. However, such latent representations to be decomposed are often chosen manually. The aforementioned methods require the estimation of high-level structures which needs domain-specific knowledge or relies on heuristics, e.g., joint locations in human poses [17] or temporal stationarity of frames [15,16]. Thus, it would be desirable if (i) hierarchical features can be automatically

Related Work
A number of video prediction approaches [6,7,19,[27][28][29][30][31][32][33] have been proposed in recent years. The work of [4] introduced the sequence to sequence LSTM model originated from language modeling [34,35] in order to predict video frames. Shi et al. [3] extended the LSTM to ConvLSTM by introducing the convolutional operation in the recurrent transitions to better capture the spatiotemporal correlations for precipitation nowcasting. Finn et al. [36] found transforming pixels from the previous frames is able to better capture object motion and produce more accurate video predictions and developed an unsupervised action-conditioned video prediction model that estimates the distribution over pixel motion from previous frames. Lotter et al. [37] extended the ConvLSTM to build neural networks using predictive coding to predict future frames. MCnet [16] and DRnet [15] proposed decomposition-based approaches such that, each frame is decomposed into content and motion, and then is fed into separate encoders. Other methods [38][39][40] consider modelling the individual dynamics of decomposed objects, and propose to decompose frames into those of separated objects, and to model the motion of each object. The work of [17,18] proposed to use high-level features (e.g., human joint landmarks) to model the dynamics of motion, while the method by [19] focuses on predicting high-level features. PredRNN++ [41] proposed Causal LSTMs with cascaded dual memories to model short-term video. Pan et al. [42] proposed to generate video frames from a single semantic map. Wang et al. [43] developed a point-to-point network which generates intermediate frames given the start and end frames.
Besides the methods mentioned above, the GAN framework has been increasingly used in video generation or prediction [44][45][46][47]. The work of [5] applied the adversarial training method to a multi-scale architecture to obtain sharp frames for prediction. A dual-motion GAN framework is proposed in [48] by fusing the future-flow and future-frame. The work of [49] combined GAN framework and variational autoencoder to get the stochastic adversarial video prediction model. In this paper, we present a novel hierarchical architecture for video prediction which exploits the hierarchical structures of spatiotemporal features.

Methods
Given a sequence of input video frames of length T, our goal is to learn a function that maps the input frames into K realistic future video frames. We denote the ground truth video frame at time t by x t , and denote the sequence of input frames by x 1:T . The output frame at time t is denoted byx t for i = 2, · · · , T + K. We would like to generate predictionsx T+1:T+K which well-approximates the ground-truth future frames x (T+1):(T+K) .

Model
Overview. The proposed architecture makes use of the hierarchy of representations obtained from cascaded/multi-layer CNNs. Features output from different layers of cascaded CNNs capture visual information at different levels: say in object detection, high level features may represent object identities like cars, whereas mid level (resp. low level) features may represent shapes like circles (resp. edges). We posit that features at different levels tend to exhibit different temporal dynamics. For example, features representing a walking person as a whole will change slowly over time as compared to those of the motions of the person's limbs. Separating and capturing features at multiple levels of hierarchy is useful, e.g., capturing high-level features makes the prediction task simpler, whereas capturing low-level features helps generating realistic frames. We thus propose to use multiple ConvLSTMs, each of which is dedicated to model the temporal dynamics of each of multiple levels of CNN features. Our goal is to automatically capture sequential structures of low-, mid-, and high-level features with ConvLSTMs, and is to generate future frames by combining hierarchical features estimated by the ConvLSTMs. HRPAE consists of three modules: an Encoder CNN, ConvLSTMs and a Decoder CNN. These modules are used recurrently over the time horizon. An overview of 4-level HRPAE is shown in Figure 1.
Hierarchical Recurrent Predictive Auto-Encoder. The HRPAE architecture consists of three parts: a multi-layer Encoder CNN (EncCNN) with pooling layers, four ConvLSTMs and a multi-layer Decoder CNN (DecCNN) with up-sampling convolutional layers as shown in Figure 1. The EncCNN is used to extract hierarchical CNN features of different levels from a frame. Four parallel ConvLSTMs are used to model the dynamics of hierarchical (e.g., low-, mid-low-, mid-high-and high-level) features of the input sequence and DecCNN is used to generate predictions from the estimated hierarchical features. During training, our model computes context features by first obtaining different levels of features from EncCNN. Next, each of the ConvLSTMs receives the corresponding level of features to compute the hidden states given the previous hidden states. DecCNN combines the outputs from ConvLSTMs of corresponding level and the previous layer, and sequentially propagates the context information. The last layer of decoder network is CNN with 1 × 1 convolutional layer to produce output consistent with the frame size. The output from the last layer is fed back into the EncCNN to generate the next frame. The equations representing the model in Figure 1 are given as follows: Following the VGG structure [9], each block of the EncCNN consists of two successive 3 × 3 convolutional layers with Batch Normalization [50] and leaky ReLU activation function. The max pooling layer is placed at the end of each block. The DecCNN can be viewed as mirrored version of EncCNN; the difference is that we use bilinear interpolation to upsample the CNN feature maps by a factor of 2. Finally, a 1 × 1 convolutional layer is used as the last block of the DecCNN. The ConvLSTMs are equivalent to standard convLSTM except that we add Layer Normalization [51] after each convolutional layer. The motivation is that Layer Normalization is effective at stabilizing the hidden state dynamics in recurrent networks, and thus accelerates the training time. Skip-and-Update. The HRPAE is inspired by skip connections such as "U-Net" structure [20] (in Figure 1a, the "U"-shape structure is rotated clockwise, and appears as "C"-shape). U-Net consists of multi-layer CNNs with skip connections between the same levels of encoding and decoding layers. Such connections alleviate the problem of "information bottleneck" [52] in encoder-decoder networks. That is, skip connections enable sharing of low level information between encoding and decoding layers, which is essential in reconstructing details at the target images in image translation tasks. However, the difference in our network is that skip connections are intercepted by ConvLSTMs. The features of multiple levels of hierarchy from the current frame passed by encoding layer are combined with the latent variables of the previous frame; as a result, the ConvLSTMs output updated information, which is essentially a predicted feature map, to the decoding layer. Such sharing (updated) information between encoder and decoder layers are useful for prediction, e.g., for preserving the continuity of motion when the next frame is highly correlated with the current one. The skip-and-update in the prediction network is depicted in Figure 1b. In summary, the main purpose of skip-and-update is (i) to skip the encoder bottleneck to retain low-and mid-level information, (ii) to update such information capturing the sequential structure of each hierarchy, and propagate it to the decoder.

Training
Scheduled Sampling. RNNs are difficult to train, because early mistakes in the sequence prediction can be accumulated over the remaining prediction process. The issue is studied in Scheduled Sampling [53] which proposes to randomly sample either the current prediction or the ground-truth as input to the model during training. This makes the model robust to prediction mistakes during inference, as the model is gradually trained to correct such mistakes [36].
We use Scheduled Sampling to train HRPAE model as follows. We initialize the sampling probability to 0, and gradually increase the probability by constant step δ after each iteration during training. In the prediction process of Figure 1b, we randomly select either the ground-truth or generated video frames according to the probability, and feed it as input to HRPAE. As a result, HRPAE gradually learns how to recover from early mistakes in predictions with the increasing sampling probability, and is able to generate robust predictions during testing.
Loss function. We not only want to generate realistic frames, but also wish the generated frames to be continuous and capture the motion of objects fluently. To that end, we minimize the Mean Squared Error (MSE) reconstruction loss denoted by L 2 to enforce the generated frames to be close to the ground truth frames in the L 2 sense: where we sum up the pixel-wise L 2 frame loss over the predicted frames. Experimental results show that by training HRPAE with respect to Equation (1), our model can generate coherent video frames without an additional loss regularizer. The experimental results show that, by using the MSE loss, our model is able to generate sharp and coherent frames.

Experiments
In our experiments, we use 3 × 3 convolutional kernel, Batch Normalization, and ConvLSTMs. The generator network G is trained to predict K = 10 future frames conditioned on the first T = 10 input frames. We compare HRPAE with state-of-the-art baselines for video prediction: DRnet, MCnet and PredRNN++ by using either the codes released by the authors or re-implementations. We use the MSE, PSNR (PSNR computes the peak signal-to-noise ratio between the ground-truth and corresponding predicted video frame. The higher the PSNR, the better the quality of the predicted video frame) and SSIM (SSIM measures the structure similarity index between the ground-truth and corresponding predicted video frame from three visual impacts: luminance, contrast and structure, similar to PSNR, higher SSIM scores represent better predicted image quality) [26] as the evaluation metric.
We consider three datasets in order to demonstrate the capability of HRPAE in automatically capturing dynamics of features at multiple levels of hierarchy, but to varying degrees depending on the dataset:

Bouncing Balls
The Bouncing Balls dataset simulates 4 balls bouncing within the frame. It is challenging to predict future frames of bouncing balls because of the underlying physical dynamics and interactions, e.g., the balls can bounce off walls, and may collide with each other and change directions. The purpose of this experiment is to evaluate the capacity of HRPAE of predicting object trajectories in an interactive environment.
Implementation. Following the settings in [23,40], we generated 10,000 training and 1000 testing sequences with 128 × 128 frame size where each sequence has 30 frames. The batch size is set to 8, and the default HRPAE (as shown in Figure 1) is trained for 10 5 iterations (about 80 epochs) by randomly sampling 20 frames. The rate of Scheduled Sampling is gradually increased to 1 during the first 50,000 iterations. We trained PredRNN++ under the same experimental settings, whereas the first layer of PredRNN++ is expanded to 64 channels; as compared to MMNIST, we doubled the patch size in order to maintain the width and height of input frames, because the frame height and width are 128 pixels.
Results. We present qualitative results with the Bouncing Ball dataset in Figure 2. We observe that HRPAE can successfully predict how balls interact with their environment, e.g., colliding with other balls and bouncing off walls. As shown in Figure 2, HRPAE correctly predicts that the balls will collide, but will not be merged/entangled, unlike the digits in MMNIST dataset. Furthermore, HRPAE is able to predict complex interactions such as the balls bouncing off the wall and then colliding with another ball as shown in the 6th row of Figure 2.
Next we present quantitative results with the Bouncing Ball dataset. As shown in Table 1, HRPAE shows the superior performance in the overall MSE and PSNR per frame; this is achieved with the model size less than one-third of that of PredRNN++. Next, we evaluate the prediction accuracy, measured in terms of ball center positions. HRPAE only outputs predicted frames, but does not explicitly compute the ball positions. Thus, we perform image processing such as binary erosion [54] to extract ball center positions, denoted by c t at time step t, from the predicted frames. In Figure 3a we show the average error in the ball center positions in the L 2 distance. The results show HRPAE is able to predict the ball positions with high accuracy. The positional error is observed to be the within 2% of the frame size on average across 10 frames (less than 2 pixels).
Following the experimental settings in [23,40], the normalized velocity, defined as v t = c t+1 − c t−1 , of the balls, and the differences in L 2 norm between the ground-truth and the extracted ball velocities are shown in Figure 3b. The cosine similarity between the ground truth and predicted velocities is presented in Figure 3c. We observe that HRPAE outperforms PredRNN++ method with the smaller model size, e.g., the average cosine similarity of our method is 0.973, while that of PredRNN++ is 0.962. The results indicate that HRPAE excels at capturing high-level dynamics of physical states associated with the motion of the balls in a complex interactive environment.

Moving MNIST
The MMNIST dataset consists of two randomly moving hand-written digits in a gray-scale frame. The MMNIST contains 10,000 sequences of 20 frames. Each sequence shows a pair of digits bouncing around in a 64 × 64 frame. The digits can bounce off from the boundary, and may overlap with each other. We randomly split the dataset into a fixed 9000 training and validation set, and 1000 testing set. For all methods, we use the unmodified/original version of MMNIST dataset from [4], publicly available at [55] (meanwhile, in DRnet [15] and PredRNN++ [41], the authors synthesized their own Moving MNIST dataset for training and testing).
Implementation. The MMNIST image pixel values are re-scaled to the range [0, 1]. All compared models are trained to make 10 future predictions given the first 10 input frames. Our model is trained under similar training settings as the PredRNN++. We use the Adam optimizer [56] with a constant learning rate of 0.001 with the batch size 8. Our model is trained for 80,000 iterations (about 72 training epochs). During the first 50,000 iterations, the probability of Scheduled Sampling is gradually increased to 1.
Ablation Study. HRPAE consists of multiple levels of CNN blocks and convLSTMs. Our experiment contains the ablation study by varying the number of levels, and compare the following three variants.

•
The default 4-level HRPAE as shown in Figure 1a with the output channel sizes given by 32, 64, 128 and 256. Results. The experimental results are summarized in Table 2 which shows the comparison with the state-of-the-art models in terms of the MSE, PSNR and SSIM metrics. We observe that 4-level HRPAE outperforms DRNet, MCnet, PredRNN++ and gives the best performance. All three variants of our model show reasonable results on the testing MMNIST dataset. We find that variants with a higher number of hierarchical levels exhibit better performances. For instance, the averaged MSE loss decreases from 66 to 49 by increasing the number of levels from 2 to 4. The results demonstrate that it is important to exploit sequential structures of features at various levels of hierarchy. The frame-wise average losses over 10 prediction steps are shown in Figure 4. As shown in the figure in which we excluded the results for DRNet and MCnet results, 4-level HRPAE achieves the best results. Also 3-level HRPAE performs comparable to PredNet++. The boost in the performance is greater when the number of levels of HRPAE is increased from 2 to 3 than from 3 to 4. Thus, 3-level HRPAE achieves a good trade-off between the model size and performances; this aspect will be discussed in follow-up research. In Figure 5, we present examples of the predicted testing frames for the discussion of qualitative results. As shown in the figure, even though the digits get entangled with each other, our model is able to make accurate predictions. For example, in the left-side example of Figure 5, the two digits of "2" are overlapped with each other in the beginning of the target future sequence. Our model is able to maintain the shape of digits and make accurate predictions of the locations. This proves our models because even a small one, e.g., 2-level HRPAE, is able to preserve the distance information between digits. Similar observations can be made for the right-side example. Resource Efficiency. Next we discuss the resource efficiency of models, for which we compare the model sizes. The number of trainable parameters is compared on Table 3. The 2nd to the 5th row show the total number of convolutional weights in corresponding LSTMs. The 6th row shows the total number of convolutional weights. Overall, our models use a smaller number of trainable weights when compared with PredRNN++. For instance, our model only has about 74 K convolutional weights in the LSTM-1. By contrast, PredRNN++ requires 5921 K convolutional weights in its first layer of Causal LSTM unit, as shown in the 1st row of Table 3. With only 15% of the weights of PredRNN++, 3-level HRPAE is able to achieve comparable performances, which proves the effectiveness of our model. Note that our training settings are identical to the default settings of PredRNN++. The efficiency is attributable to the model's capability of automatically learning hierarchical structures, e.g., low-level features for generating the details of digit shapes, and high-level features for tracking digit positions.

KTH Action Dataset
The KTH dataset consists of 600 videos of 25 humans performing 6 types of actions (boxing, hand-clapping, hand-waving, jogging, running, and walking) in 4 scenarios. Each video has the duration of 4 seconds on average. Implementation. We use the same settings of MCnet [16] as follows. We use person 1-16 for training and 17-25 for testing. We gather 1525 video clips of various length for training and 2819 video clips (30 frames) for testing.
Then each frame is resized to a resolution of 120 × 120 where the pixel value is re-scaled to the range of [−1, 1]. We train our network and baselines by feeding the first 10 frames as input, and making the models to predict the next 10 frames. The batch size is set to 8 where the training is done for 2 × 10 5 iterations with Adam optimizer with constant learning rate of 0.001. The rate of Scheduled Sampling is gradually increased to 1 during the first 10 5 iterations. At testing time, the prediction horizon is extended to 20 future frames. For fair comparison, we have trained MCnet, PredRNN++ and HRPAE under the same experimental settings.
Results. In this section, we present quantitative results comparing HRPAE with baselines. Table 4 shows the overall prediction results compared with PredRNN++ and MCnet in terms of the PSNR and SSIM. We observe that 4-level HRPAE outperforms PredRNN++ and MCnet. Figure 6 shows the frame-wise losses over 20 prediction frames, which shows our model performs consistently better than the baseline models at all the time steps. Figure 7 presents qualitative prediction examples, which shows that our model is able to make accurate predictions about the human motion trajectories, and generates sharp video frames.

Conclusions and Future Work
In this paper, we propose a Hierarchical Recurrent Predictive Auto-Encoder (HRPAE) for video prediction. The idea is to automatically separate different levels of CNN representations, and to use ConvLSTMs to model the temporal dynamics at each level of hierarchical features combined with skip connections for sharing feature information between encoders and decoders. Experimental results on MMNIST, Bouncing Balls and KTH human action datasets show that HPRAE is able to generate highly accurate predictions in terms of high-level structures such as locations, velocity, interaction of objects, as well as low-level details such as pixel values. Our future work includes improving the performance of our model by incorporating the GAN framework for generating realistic frames, and enhancing our model for processing higher resolution videos.

Conflicts of Interest:
The authors declare no conflict of interest.