The Importance of Loss Functions for Increasing the Generalization Abilities of a Deep Learning-Based next Frame Prediction Model for Trafﬁc Scenes

: This paper analyzes in detail how different loss functions inﬂuence the generalization abilities of a deep learning-based next frame prediction model for trafﬁc scenes. Our prediction model is a convolutional long-short term memory (ConvLSTM) network that generates the pixel values of the next frame after having observed the raw pixel values of a sequence of four past frames. We trained the model with 21 combinations of seven loss terms using the Cityscapes Sequences dataset and an identical hyper-parameter setting. The loss terms range from pixel-error based terms to adversarial terms. To assess the generalization abilities of the resulting models, we generated predictions up to 20 time-steps into the future for four datasets of increasing visual distance to the training dataset—KITTI Tracking, BDD100K, UA-DETRAC, and KIT AIS Vehicles. All predicted frames were evaluated quantitatively with both traditional pixel-based evaluation metrics, that is, mean squared error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM), and recent, more advanced, feature-based evaluation metrics, that is, Fréchet inception distance (FID), and learned perceptual image patch similarity (LPIPS). The results show that solely by choosing a different combination of losses, we can boost the prediction performance on new datasets by up to 55%, and by up to 50% for long-term predictions.


Introduction
The ability to predict possible future actions of traffic participants is essential for anticipatory driving. As a human driver, we can make safe decisions in traffic because we automatically anticipate events based on our experience. In autonomous driving scenarios, predictions of probable future events can prove beneficial when used as additional inputs to the system. They can help to plan the next action more efficiently and to make decisions more informedly.
One way of realizing this is to extract information from an automatically rendered future frame. However, to extract information that can reliably support an autonomous driving system, the predicted frames have to be of high and stable visual quality. Therefore, the underlying video prediction network must constantly produce high-quality predictions, independent of variations in the input observations. Because of the domain shift between datasets, this is hard to achieve in reality. Yu et al. [1], for example, demonstrated that problem when they tested a semantic segmentation network that was trained on the Cityscapes [2] training subset. It achieved good results on the Cityscapes test subset, but poor results on the BDD100K [1] test subset. One solution to reduce the effects caused by the domain shift between the training examples and the test data is to enforce the prediction network to learn generic representations of the appearance and motion of objects. An advantage of generic features is that they can quickly be fine-tuned to new scene contents or tasks when used to initialize another network. The learned features of an ideal network for video prediction match both of the following criteria. First, they are generic enough to enable the model to generalize well over a variety of different scene contents. Secondly, they produce high-quality predictions that preserve details of the observed input scene across multiple prediction steps.
Simple prediction models are already capable of producing next frames of sufficient quality, while still being lightweight and requiring little training time. However, these models often fail for new datasets, and their long-term predictions are generally blurry. In this paper, the focus lies on investigating to what extent it is possible to have both the advantages of such a lightweight model and a good generalization performance. The idea is to find a loss function that enforces the model to learn features that meet the criteria described earlier. Our model is a three-layer convolutional long-short term memory (ConvLSTM) [3] network, which predicts the next frame based on four past frames. We train the network using 21 different combinations of seven loss terms. The loss terms range from terms that perform a pixel-or feature-based comparison to adversarial terms. To properly assess their generalization abilities, we evaluate all models on four datasets of increasing visual distance to the training data. Figure 1 shows exemplary next frame predictions for two of these datasets. Further, we generate frames up to 20 time-steps ahead to evaluate the long-term prediction performance of the models. To quantify the model performance, we calculate traditional pixel-based evaluation metrics, as well as more advanced feature-based ones. Our main contribution is the in-depth evaluation of the generalization abilities of a deep learning-based next frame video prediction network. In particular, this work provides detailed analyses of how different loss combinations influence the prediction quality. Based on these analyses, we can draw informed conclusions about the learned representations of the network. Our experiments show that an intelligently designed loss function is crucial as, it helps to stabilize the visual quality of the predictions over a variety of datasets and to improve the training convergence. The best performing loss combination can boost the prediction performance by 55% for the next frame predictions, and by 50% for the 10th frame predictions, in comparison to other loss combinations.

Related Work
The application of deep learning-based video prediction models for traffic scenes has become a popular field of research, in the last years. After Ranzato et al. [5] first introduced a general baseline for deep learning-based video prediction in 2014, many approaches explicitly started to focus on predicting traffic scenes.
Although a lot of effort has been put on this topic, generating plausible predictions of high visual quality over a variety of datasets is still not solved, especially not for real-world scenarios. The predictions often lack realism, particularly for distant future frames. The main problem of the video prediction task is: the future is uncertain and the nature of the model output is multi-modal.
One approach to tackle this is to directly address the uncertainty of the prediction output. Bhattacharyya et al. [15], for instance, recently proposed a novel Bayesian formulation, that jointly captures the model and the observation uncertainty to anticipate future scene states. Another way to address the uncertainty is to use a generative adversarial network (GAN) [16] as a framework for training [17][18][19][20][21].
A second approach to handle the problem of implausible prediction outputs that lack realism is to reduce the complexity of the problem. Many authors, for example, used data with lower-dimensional image content, such as label images, instead of natural image scenes [12,15,[22][23][24][25]. Others split the problem into two problems, motion and content prediction, and learn separate representations for the static and dynamic components. For training, these approaches either use a motion prior, such as optical flow information [9,20,23,[26][27][28], as a conditional input or use learned features to represent pixel dynamics [29].
Our approach builds on the idea of utilizing the loss function to enforce the network to learn feature representations that are more generic and less influenced by dataset-specific content. A loss term that helps the network to map motion to learned object representations rather than solely to individual pixel values could lead to a more realistic foreground and background separation. In this paper, we evaluate the influence of different loss terms on the generalization abilities of a prediction model. For traffic prediction models, it is common to train models on the KITTI [30] dataset and test them on the Caltech Pedestrian [31] dataset, or vice-versa [20,29]. However, the domain shift between these datasets is comparatively small. To our knowledge, Luc et al. [22] are the only other authors who directly investigate the generalization abilities of their traffic scene prediction model. There are detailed ablation studies of other authors that focus on the influence of the model complexity [11] or the loss functions [19], but they only measure the in-domain performance of the models.

Methodology for Predicting the Next Frame of Traffic Scenes
We follow a purely data-driven approach to predict the next frame of traffic scenes, assuming that additional input information or costly ground truth labels are not always available. Our approach incorporates a generative model to assist in the development of model-inherent attention mechanisms. This generative model, the prediction network, is based on a ConvLSTM architecture that is commonly used in similar forms as a baseline network [6,7,13], which makes our results easy to transfer. Its convolution operations function as a spatial and its LSTM units as a temporal attention mechanism. During training, the model optimizes the 3726019 parameters of the prediction network by minimizing the loss function. Our loss functions contain different combinations of non-adversarial and adversarial loss terms. The non-adversarial loss terms are trained in a supervised setting by directly comparing the ground truth frame and the predicted frame. The adversarial loss terms are trained in a self-supervised setting, where a second network, the discriminator network, is used [16]. To efficiently demonstrate the influence of each loss term, we built on low-level processing without costly upstream mechanisms. Following, we describe the technical details of the next frame prediction model and the different loss terms.

Prediction Network
The resolution preserving three-layer ConvLSTM [3] network G, illustrated in Figure 2, is the core network that generates the predictions. It sequentially processes the frames of an input sequence z = x t−t in +1 , . . . , x t and transforms them into the next future frame x = G(z) = ( x t+1 ) of the sequence. The parameter t in corresponds to the temporal depth of the input sequence. We use three ConvLSTM layers with convolutional kernels of size 5 × 5, a stride of 1, zero-padding of 2, and feature sizes of 128, 64, and 64. Additionally, we use one 2d convolutional layer with a kernel size of 1 × 1, stride 1, and zero-padding of 0, to map from feature-space to RGB-space.

Discriminator Network
When using an adversarial loss term to train the prediction model we utilize a second network, the discriminator network D. For D, we adopt the structure of the discriminator network of Aigner and Körner [18] for resolutions of 128 × 128 px, without progressive growing. As an input, D alternately receives x = x t−t in +1 , . . . , x t+1 frames from the training set, representing the ground truth sequence, and x = (z, G(z)) = x t−t in +1 , . . . , x t+1 . The latter sequence consists of the input and output frames of G. D outputs a score s = D(x) or s = D( x), respectively. This score ranks the given input as either being real or fake. The labels for real sequences are set to l real = 1 and the labels for fake sequences to l real = 0. We use weight scaling in G and in D to stabilize the training, as originally proposed by Karras et al. [32].

Loss Terms
The following paragraphs describe the individual terms of our training losses briefly. When combining different loss terms in one loss function, we multiplied each loss term by a loss-specific weight factor λ Loss . For simplicity, we refer to the ground truth frame x t+1 as x and the predicted frame x t+1 as x.

L1 Loss
This loss measures the mean absolute error (MAE) between the elements of the ground truth and the predicted frame. It is defined as where n and m are the width and height of the frames.

L2 Loss
The L2 loss measures the mean squared error (MSE) between each element of the ground truth and the predicted frame. It is defined as

BCE Loss
This loss measures the binary cross-entropy (BCE) between the ground truth and the predicted frame. It is defined as where x takes its values in {0, 1} and x in [0, 1].

Perceptual Loss
The perceptual loss [33] measures the L2 difference between the feature maps of the ground truth and the predicted frame of a specific layer from the VGG-19 [34] network, pre-trained on ImageNet [35]. Contrary to the L1 and L2 losses, which directly measure the image differences in pixel-space, the perceptual loss measures the differences in feature-space. It is defined as where φ k,l is the feature map obtained before the k-th max-pooling layer and after the l-th convolutional layer of the pre-trained VGG-19 network. W k,l and H k,l are the width and height dimensions of the feature maps.

GDL Loss
The image gradient difference loss (GDL) [36] computes the differences between the image gradients of the ground truth and the predicted frame. The GDL loss is given by where 1 ≤ α GDL ∈ N.

GAN Loss
This loss term is the standard loss function of the GAN [16]. It is based on the Jenson-Shannon-divergence between the distributions of the ground truth frames and the predicted frames. The loss function to train D is and the loss function to train G is

WGAN-gp Loss with Epsilon Penalty
This loss consists of the Wasserstein GAN with gradient penalty (WGAN-gp) [37] loss and an epsilon penalty [32] term that prevents the loss from drifting. It is based on measuring the Wasserstein distance between the distributions of the ground truth frames and the predicted frames. The WGAN-gp loss with epsilon penalty for optimizing D is defined as As described by Gulrajani et al. [37], P r is the data distribution, P g is the model distribution, implicitly defined by x = G(z), x ∼ p( x), ε is the epsilon-penalty coefficient, and λ gp is the gradient-penalty coefficient. P x is implicitly defined, sampling uniformly along straight lines between pairs of points sampled from the data distribution P r and the G distribution P g . The WGAN-gp loss for optimizing G is defined as The penalty coefficients of the WGAN-gp loss with epsilon-penalty are λ gp = 10 and ε = 0.001, as proposed by Karras et al. [32].

Experiments and Evaluation
To analyze the influence of each loss term on the model performance, we conducted experiments on 5 different datasets and trained our model on 21 different loss combinations. The next subsections contain details about the training settings, the datasets, and the analyses of the quantitative and qualitative results.

Training Settings
We trained the model described in Section 3.1 using the 21 losses listed in Table 1. To weight the loss terms in a combined loss function, we set λ GDL = 0.0001 and λ Perceptual = 0.01. The other weight factors were set to 1. When combining the perceptual loss term solely with the GDL term, we set λ GDL = 0.01 and λ Perceptual = 1. These values were heuristically chosen to balance the individual loss terms at a similar range. For the GDL loss, we set α GDL = 1, when combining it with an L1 term, and α GDL = 2, when combining it with an L2 loss term. To train the networks with adversarial loss terms, we applied weight scaling in G and D, as described by Karras et al. [32]. All 21 different prediction models were trained to predict the next frame after receiving four past frames as an input. We trained each model on the full Cityscapes Sequences [2] dataset with a batch size of 4 and a fixed random seed. As an optimization algorithm, we used the Adam optimizer [38] with β 1 = 0.0 and β 2 = 0.99. The initial learning rate was l = 0.001. Every 10th epoch, we decayed the learning rate by a heuristically set factor of 0.87. In total, all networks trained for 30 Epochs. Intermediate states were saved every 5th epoch for evaluation purposes. We trained the networks on an Asus GeForce RTX 2080 Ti GPU with 11 GB of RAM, except for most of the networks with a GAN loss term, which we trained on an NVIDIA Titan X Pascal with 12 GB of RAM. The code was implemented in PyTorch.

Datasets
We conducted experiments on five different datasets. For training, we used the full Cityscapes Sequences [2] dataset. For testing, we used four other datasets with an increasing domain shift to the training dataset-KITTI Tracking [30], BDD100K [1], UA-DETRAC [4], and KIT AIS Vehicles [39]. We chose these test datasets to investigate to what extent each model can generalize to new scenes. All frames for training and testing were retrieved by first center cropping and then resizing them bilinearly from their original resolution to a resolution of 128 × 128 px. Figure 3 shows example images from every dataset. The following paragraphs describe the datasets and the specifications of our customized subsets that we used to calculate and compare the evaluation metrics.

Cityscapes
The Cityscapes Sequences dataset consists of 5000 videos, that is, 2975 for training, 500 for validation, and 1525 for testing. These 8-bit color videos were recorded with a frame rate of 17 fps and an original resolution of 2048 × 1024 px in 50 different cities, primarily in Germany. The videos mainly show urban street scenes and a few different highway scenarios in similar weather and time conditions, that is, sunny, partly cloudy, and cloudy during daytime in spring, summer, and fall. All videos are 30 frames long. We used the full 5000 videos of the dataset for training. Since we trained our networks to predict the next frame based on four past frames, we had 30,000 training sequences in total.

Kitti Tracking
The KITTI Tracking sequences are recorded in Karlsruhe, Germany. The dataset contains 21 training and 29 testing videos, all of a varying sequence length and with an original resolution of 1392 × 512 px. The videos were captured at a frame rate of 10 fps, which results in higher motion differences in-between frames compared to our training examples. Otherwise, the displayed scenes are similar to those of Cityscapes but more evenly distributed between rural, urban street, and highway scenarios. The weather and time conditions match those of Cityscapes. For testing, we used the test split as provided by Geiger et al. [30]. To calculate the evaluation metrics and for comparison with the other datasets, we built a subset of 100 sequences using 24 frame-long snippets that were evenly distributed across the test sequences.

Bdd100k
The complete BDD100K dataset consists of 100,000 videos with an original resolution of 720 × 1280 px. All videos are 40 seconds long and captured at 30 fps in either New York, Berkeley, San Francisco, or the Bay Area. The test subset of 20,000 videos, provided by Yu et al. [1], contains 20 splits with 1000 videos each. For testing and evaluating, we took the first split of this test set. To roughly match the Cityscapes frame rate, we sub-sampled it to 15 fps. We then used the first 24 frames of every 10th sequence of the resulting split to build a customized test set of 100 sequences. The BDD100K videos were recorded under six different weather conditions, that is, clear, partly cloudy, overcast, rainy, snowy, and foggy, and during three different daytimes, that is, day, night, and dusk/dawn. This means the BDD100K scenes display completely different locations and a greater variety of weather and lighting conditions compared to the Cityscapes scenes.

Ua-Detrac
The full UA-DETRAC dataset consists of 100 videos, 60 for training, and 40 for testing, all of a varying sequence length and an original resolution of 960 × 540 px. The videos were captured at a frame rate of 25 fps at 24 different static locations in Beijing and Tianjin, China. The recorded scenes contain surveillance views of residential roads, highways, tunnels, gas stations, and a parking lot during day-time, night-time, and different weather conditions. As a result, the UA-DETRAC videos not only show different scene contents, compared to the training examples, but they also have different viewing angles and do not display any ego-motion. Additionally, the lower UA-DETRAC frame rate causes smaller differences in object motion in-between frames. To test and evaluate our models, we built a customized subset of 100 evenly distributed sequences of length 24 frames from the original test split.

KIT AIS Vehicles
The KIT AIS Vehicles dataset [39] consists of a single training split, which contains 9 sequences of aerial images with varying sequence lengths. The videos display different highway, crossroads, and street scenarios. All sequences are of varying original frame resolutions, captured at 2 fps from varying heights above the ground during similar weather and time conditions. In comparison to Cityscapes, this is the most challenging dataset. The viewing angle, the object motions, and the scene contents differ completely. We used the whole dataset, as provided by Schmidt [39], for testing. Due to insufficient sequence lengths, we predicted 10 future frames based on four input frames for this dataset. This resulted in a customized subset of 24 sequences for evaluation.

Evaluation Metrics
To quantitatively rate the performance of video prediction models, there is no consistent evaluation scheme. Traditionally, pixel value-based image comparison metrics, such as the mean squared error (MSE), the peak signal-to-noise ratio (PSNR), and the structural similarity index (SSIM) [40] are used by most authors. Although these metrics are very common for comparing video prediction approaches, there is one big problem. The values of these metrics often do not correlate well with the human perception of visual image quality. To assess this problem, we calculate two, more recent, evaluation metrics, the Fréchet inception distance (FID) [41], and the learned perceptual image patch similarity (LPIPS) [42] in addition to the MSE, the PSNR, and the SSIM. These metrics have shown to better correlate with human judgments about visual image quality. In contrast to the traditional metrics, which directly compare the pixel values of two images, the FID and the LPIPS values measure the distance between two images not in pixel-space, but feature-space. Their values are obtained based on the feature activations of one or more layers of a second, pre-trained, neural network. To calculate the FID and LPIPS values, we followed the procedures described by Heusel et al. [41] and Zhang et al. [42]. For the FID metric, we used an InceptionV3 [43] network, pre-trained on ImageNet [35]. For the LPIPS metric, we used the pre-trained network provided by Zhang et al. [42].

Qualitative and Quantitative Analyses
To properly assess the generalization abilities of a prediction model, it is important to evaluate its capability to generalize both to new datasets and a higher number of prediction steps. Therefore, we generated long-term predictions for four test datasets with every model. During testing, we let the models predict 20 future frames for the KITTI Tracking, the BDD100K, and the UA-DETRAC dataset and 10 future frames for the KIT AIS Vehicles dataset, because of insufficient sequence lengths in the dataset. To generate the long-term predictions, each predicted next frame of the model was recursively fed back in as an input. This means the long-term predictions during test time were based on only four real observations. Figure 4 shows the qualitative results of these predictions by three selected models of different loss combinations for all four datasets. The qualitative results of all loss combinations can be found in Appendix B. Additional videos and images are included in the supplementary material.
For the quantitative evaluation of the models, we calculated the metrics described in Section 4.3. To calculate these quantitative measures, if not otherwise stated, we used our customized subsets, as described in Section 4.2. They each contain 100 sequences of length 24 frames, except for KIT-AIS Vehicles, where only 24 sequences of length 14 frames were available. Figure

Discussion and Conclusions
In this paper, we have shown that an intelligently designed loss function is essential for a prediction model to generate plausible next frames of traffic scenes. An optimal choice of the training loss leads to both good test performance and high generalization abilities of the model. We provided qualitative and quantitative evaluations on the influence of the individual loss terms. These evaluations strongly suggest that the combination of loss terms is particularly important for enabling the network to learn generic representations of object motion and appearance.
For our experiments, we used a ConvLSTM video prediction network that was trained on the Cityscapes dataset to predict the next frame after observing a sequence of four frames. In total, we trained 21 different combinations of seven individual loss terms. To draw informed conclusions about the generalization capabilities, we tested the resulting models on four different datasets of increasing visual distance to the training dataset. During testing, we generated long-term predictions for every dataset. After evaluating the predictions qualitatively, we could see great performance differences between the different loss combinations, especially when inspecting the long-term prediction results. The best performing model was the model that was trained on a combination of the perceptual and the L1 loss term. This model preserved object-specific features such as color and detailed content of the input scene across multiple prediction steps for all datasets. Models that were solely trained with a per-pixel error loss or an adversarial loss often averaged out such features, leading to a quick loss of detail after a few prediction steps. These predictions, therefore, tended to get blurry earlier. The predictions of the best performing model, on the other hand, remained sharp for a higher number of prediction steps. Additionally, the best performing model was able to identify moving objects and correctly propagate motion patterns across several time-steps. Interestingly, this was even the case for the KIT AIS Dataset, although it was recorded at a completely different frame rate and from a different viewing angle than the training data. For the quantitative evaluation of the models, we calculated three traditional pixel-based image comparison metrics, the MSE, the PSNR, and the SSIM. In addition to those metrics, we calculated two more advanced feature-based image comparison metrics, the FID, and the LPIPS. These feature-based evaluation metrics confirmed our visual impression of the qualitative results. The best performing loss combination generated next frame predictions up to 55% better and 10th frame predictions up to 50% better compared to the predictions of models trained with other loss combinations. These numbers were obtained from the LPIPS values.
Our experiments verify that an intelligent combination of loss terms is essential. It enables even a very lightweight model to reliably produce high-quality predictions over a variety of datasets. The evaluations suggest that the well-performing loss functions, in contrast to the other ones, helped the model to learn generic representations of the appearance and motion of objects and how to propagate these features correctly across time.

Acknowledgments:
The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan X Pascal GPU, used for this research.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Extended Quantitative Results
The figures in the following paragraphs display the mean quantitative evaluation values per predicted frame for all four test datasets. To obtain these values, we used the models that were trained on Cityscapes for 20 epochs. All models were trained to predict the next frame based on four past frames. The results for the non-adversarial and the adversarial loss combinations are visualized separately for each dataset. (a) MSE  Number of Predicted Frame

Appendix B. Extended Qualitative Results
Appendix B.  Figure A5. Qualitative results for the KITTI Tracking test split. To generate these images, we used the models that were trained on Cityscapes for 20 epochs. The models were trained to predict the next frame based on four past frames. All images are included in the supplementary material. (The images are best viewed on screen).  Figure A6. Qualitative results for the BDD100K test split. To generate these images, we used the models that were trained on Cityscapes for 20 epochs. The models were trained to predict the next frame based on four past frames. All images are included in the supplementary material. (The images are best viewed on screen).  Figure A7. Qualitative results for the UA-DETRAC test split. To generate these images, we used the models that were trained on Cityscapes for 20 epochs. The models were trained to predict the next frame based on four past frames. All images are included in the supplementary material. (The images are best viewed on screen).  Figure A8. Qualitative results for the KIT AIS Vehicles test split. To generate these images, we used the models that were trained on Cityscapes for 20 epochs. The models were trained to predict the next frame based on four past frames. All images are included in the supplementary material. (The images are best viewed on screen).