Learning Latent Representation of Freeway Trafﬁc Situations from Occupancy Grid Pictures Using Variational Autoencoder

: Several problems can be encountered in the design of autonomous vehicles. Their software is organized into three main layers: perception, planning, and actuation. The planning layer deals with the sort and long-term situation prediction, which are crucial for intelligent vehicles. Whatever method is used to make forecasts, vehicles’ dynamic environment must be processed for accurate long-term forecasting. In the present article, a method is proposed to preprocess the dynamic environment in a freeway trafﬁc situation. The method uses the structured data of surrounding vehicles and transforms it to an occupancy grid which a Convolutional Variational Autoencoder (CVAE) processes. The grids (2048 pixels) are compressed to a 64-dimensional latent vector by the encoder and reconstructed by the decoder. The output pixel intensities are interpreted as probabilities of the corresponding ﬁeld is occupied by a vehicle. This method’s beneﬁt is to preprocess the structured data of the dynamic environment and represent it in a lower-dimensional vector that can be used in any further tasks built on it. This representation is not handmade or heuristic but extracted from the database patterns in an unsupervised way.


Introduction
Hierarchical design is the most common approach when designing software for selfdriving vehicles. According to this approach, one can think about three main layers: perception, planning, and actuation [1]. Perception combines sensor information to form a model of the world around. In the planning layer, the vehicle model and the environment model are used to plan a route, trajectory, maneuvers, taking into account the specified driving style, destination, and other input instructions. The actuation layer is responsible for the implementation of the planes and gives orders to the actuators. The operation of this and the perception layer are outside the scope of this article. The focus is on the second planning layer, more specifically on behavior prediction. Autonomous vehicles need to make decisions continuously about the maneuvers so that they navigate safely. These decisions can be individual or cooperative based on the traffic system [2].
For the planning layer to function correctly, i.e., create safe, accurate, feasible plans, it must be able to interpret and predict the latent intentions or behaviors of drivers of surrounding vehicles [3]. This task can be captured by trajectory prediction, maneuver detection, and prediction, combining these aspects via multi-modal trajectory prediction. As highway transport is an interactive system, the decisions of vehicle drivers are influenced by all other drivers, and since common road rules must be followed, it can be said to be a strongly coupled system. This means that the intentions of the agents in a traffic situation, or the trajectory itself, can be predicted by using information indicating the inner states of each agent. If these are neglected, one can make limited statements about the future states [4]. In such a strongly coupled system, the past states of a given agent do not encode the internal states of the traffic situation that are required for prediction. This can also be formulated so that we cannot infer the future from the past trajectory of a vehicle without environmental information. Without being exhaustive, behavior prediction is very challenging and attracts great attention in light of these findings. Trajectory patterns can be classified or predicted by Gaussian process regressions [5]. One other famous model is a hierarchical dynamic Bayesian network which is utilized for predicting future patterns about behavior [6].
According to a possible categorization approach, motion prediction systems can be classified into one of the physics-based, maneuver-based, and interaction aware models [7]. The simplest is physics-based, for they consider only the law of physics behind the motion of the vehicles. One gets a more advanced model if it considers the drivers' intentions. These are the maneuver-based models. Interaction-aware models aim to recognize and consider all the inter-dependencies between the vehicles in a traffic situation. In this article, a method is introduced, with the effect of the agents in the traffic situation being considered by an occupancy grid. Such an image has a large dimensionality, so some compression is required. The proven Variational Autoencoder scheme for trajectory copying [8] only could be used for the prediction task if environmental information was included in the input data from which the decoder part can infer suitable predictions about the possible future outcomes. The point of this is to supplement the context vector formed from the trajectory with a "situation" vector that encodes the environmental information. This is how the encoder and decoder are trained to provide the future trajectory at the output. Kim et al. propose a model that predicts the future location of vehicles with 0.5 s, 1 s, and a 2 s time horizon, on an occupancy grid map [9]. In contrast, the goal of this article is only to filter out the most valuable and concise information possible from the location of surrounding vehicles.
The situation vector can be determined in several ways. One approach is structured data about the traffic environment. A novel approach is presented in [10] for predicting the movement of vehicles on the freeway. The output of the model is a multi-modal distribution of trajectories. The input is constructed from the self vehicle (ego) longitudinal and lateral positions appended by the relative positions of six surrounding vehicles: two cars in front of and behind the ego and two more pairs in the adjacent lanes. This input tensor is organized in a specific structured order so that the encoder could learn the proper context representation. Note that the input vectors are embedded using a 64 unit dense layer before the Long-Short Term Memory (LSTM) layer, so the situation encoding representation is 64-dimensional as in this present article. A very similar representation can be found in the research of Feng et al. [11]. They propose a Conditional Variational Autoencoder for intention-based trajectory prediction. The inputs are organized similarly to the previous example. A structured historical observation sequence is taken, containing the EGO's and five surrounding vehicles' displacement, velocity, and longitudinal distance between the ego and the other vehicle. The difference is that the rear vehicle is not included in the structured input since the front vehicle takes no responsibility for the rear according to traffic regularizations.
Agents adjust their path and trajectory by reasoning about their neighbors' movement, and other surrounding agents influence these neighbor agents. Every agent in a freeway may have different neighbors, and, in a crowded traffic situation, the number of correlations will be so high that it cannot be calculated. This is why social pooling is introduced [12] to combine the information from all neighboring states. It was used to handle human trajectory prediction in crowded places. This idea was improved for multi-modal vehicle trajectory prediction in [13] and the authors proposed convolutional social pooling instead of fully connected layers to social-tensors of the LSTM states encoding the historical motion of surrounding vehicles.
Another important way to consider the environment is the occupancy grid. Hoerman et al. introduce a deep convolutional network trained by dynamic occupancy grid maps [14]. The learning-based situation prediction approach utilizes one single neural network. The input is a time series of the data from multiple sensors. The network can segment the static and dynamic areas of the perceivable scene. Cui et al. propose a network architecture that handles the neighbor actors with Convolutional Neural Networks (CNN) [15]. The essence of the method is that they rasterize a birds-eye-view raster image encoding the map of the traffic environment of a given actor. A CNN network processes it, and a raster feature vector is created. The actor's state vector is then concatenated to it and further processed by fully connected layers. Thus, it yields a context vector encoding the actor's state and the surrounding vehicles.
Occupancy grid maps are often used in navigation tasks, without the need for completeness, either for GPS data processing or for the movement of automotive robots or vehicles. Nawaz et al. [16] propose a bidirectional recurrent autoencoder to generate the missing point of a trajectory from GPS data over an occupancy grid map. Small automotive robots need a representation of the environment [17,18] and occupancy grids are suitable for that purpose. Most importantly, occupancy grids are used for long-term prediction of time evolution in a municipal complex traffic system [19]. Deep CNN networks utilize long short-term memories to seize the data's static and dynamic features and focus on the dynamic part for prediction. Park et al. attempt to predict vehicles' future trajectories over an occupancy grid [20]. In the article, an autoencoder architecture analyzes the patterns of past trajectories using LSTM. In the work of Lu et al. [21], a Variational Autoencoder (VAE) neural network is used to encode the front-view visual information of the driving scene and to decode it into a bird-eye view semantic-metric occupancy grid that outperforms the deterministic mapping approach with the flat-plane assumption by more than 12% mean intersection-over-union. These research directions and results suggest that it is worthwhile to extract dynamic information influencing driver decisions from occupancy grids with a VAE neural network.
This paper presents a 2D Convolutional Variation Autoencoder application to copy and compress occupancy grids, thus constructing a situation vector. Deep neural networks received great attention in several fields since they showed promising performance for various tasks in machine learning [22]. In Section 4.1, the preparation is included, and the details of how the dataset of grid images is constructed are described. The methodology and mathematical derivation are described in Section 4.2, and the details of the training in Section 4.3. The discussion of the training results and the quality of the encoder are summarized in Section 5 with a focus on the reconstruction capability on Section 5.1 and the latent space on Section 5.1. Section 5 discusses possible advances in research, new directions, and conclusions.

Problem Statement
In this article, we address behavior prediction, an essential part of the analysis and prediction of the behavior of other agents in a traffic situation. It is essential for planning to estimate the possible trajectories of other vehicles and other road users as accurately as possible.
Extracting the environmental information needed for reliable behavior prediction is a very important subtask. Extracting latent information relevant to the prediction from the full information in the occupancy grid using deep learning is an exciting problem. Therefore, a CVAE architecture and a procedure to reduce the occupancy grid dimensionality to 3.125% are presented in this article. The size of the images is 16 times 128 pixels, and the latent space dimension is 64. The value of the pixels covered by a vehicle is 1, while the value of the other pixels is 0. The data preparation is explained in detail in Section 4.1. The model used is explained in Section 4.2, as well as details on training in Section 4.3.

Contributions of the Paper
In this paper, a method for the task of depicting a traffic situation is presented. Of the structured data, lidar data, occupancy grid methods, the latter is examined. By compressing and copying the grids, the proposed model learns a representation that can be used as input in other tasks. It can copy an image of 2048 pixels while creating a 64-dimensional situation vector that contains valuable information about the traffic situation at a given time. Four Variational Autoencoder models have been trained for comparison by two different prior distributions and two kinds of training methods. In the variational autoencoder, the decoder distribution is Bernoulli, but the latent space generated by the encoder and the prior distribution give significantly better results in the case of Gaussian than in the case of Bernoulli. Furthermore, more accurate reconstruction and faster convergence can be achieved with Adversarial Training than without it.

Solution
This section details the proposed solution to the problem discussed in Section 2, starting with the source of the data and the details of its processing in Section 4.1. The model used and the loss function, and the mathematical considerations related to them are presented in Section 4.2. Finally, details of the training are given in Section 4.3.

Training Dataset
The data required for the training task were extracted from the NGSIM trajectory database. This database contains vehicle trajectories passing through two U.S. highway sections, the US-101 and I-80 [23,24]. Precise locations, velocities, and acceleration values for vehicles are included every 0.1 s. There are 11.8 million registers, each representing one vehicle in a specific frame of time. The trajectories of I-80 are used for preparation and for creating occupancy grid images. In Figure 1, one can see the I-80 freeway scene from where the data were collected. The occupancy grid images were extracted in the following steps. First, the algorithm goes through each vehicle. All vehicles in the database are considered ego vehicles, and the algorithm iterates through each T time at which ego is found. The resolution of the grid consists of 0.5 m squares. The square in which the extent of the vehicle is included is considered to be occupied. Each square is taken as one pixel, so we get a 1-channel image. In the center of each image is the center of the rear of the ego vehicle. The algorithm then locates any vehicle that is near the ego vehicle at time T and inserts it into the grid. In the lateral (x) direction, a distance of 4 m is taken into account, both to the right and left. In the longitudinal direction (y), a distance of 32 m is taken into account so that the shape of the samples is 16, 128.

Methodology
Variational Autoencoder and Adversarial Autoencoder neural networks have been successfully trained for the previously defined grid copy and with it for the compression task. In the following, we briefly summarize the theoretical considerations behind this choice. Starting from the probabilistic interpretation of the encoder and decoder, we describe the line of reasoning that led to the choice of the appropriate loss function.
Variational Autoencoders [25] explicitly have a regularization term in the training method because the encoder transforms the input to a distribution over the latent space and not to single point. The decoder receives a sample from that distribution and tries to reconstruct the original data as accurately as possible. This latent distribution could converge to a Dirac-delta without a constraint. In contrast to the regularisation of the VAE, something happens differently in Adversarial Autoencoders (ADVAE). In adversarial training, a discriminator neural network competes with the encoder neural network for conflicting objective [26]. The encoder plays the role of the generation process generating samples from the latent distribution similar to the prior distribution. Meanwhile, the discriminator is trained to find the differences between the generated and prior distribution samples. In other words, the encoder attempts to mislead the discriminator and simultaneously the latter one tries to separate the generated from the priori samples. By means of variational inference formulation [27], one can define a prior distribution p(z) over the latent space where z is the latent vector. The probabilistic decoder and encoder are defined by conditional probability distributions p(x|z) and p(z|x). The probabilistic decoder denotes the conditional distribution of the decoded variable x given the encoded variable z while the probabilistic encoder is the opposite. Furthermore, one can suppose that the prior is a known distribution which can be easily sampled during training. The sampling process must not prevent the back-propagation of the training loss. Here, two notable distributions are applied as a hypothesis, the Standard Normal and the Bernoulli distributions. The Standard Normal in Equation (1) means that we expect latent vector components to be independent and have zero expected value with unit standard deviation.
On the other hand, in Equation (2), our expectation is that the pixels are independent, and have a value of 1 with a probability of 0.5 and 0, respectively: The encoder distribution can be expressed by the Bayesian theorem: The integral is infeasible so one should live with an approximation. This means that p(z| x) is approximated by the normal distribution or Bernoulli distribution: The M, S, and Θ function sets are parametrized by encoder neural networks. The optimal approximation is yielded by the optimal (m * , s * ) function parameters by which the Kullback-Leibler Divergence (KLD) [28] is minimal: By the definition of the KLD and Equation (3): where the terms can be factorized further The last term is independent of the optimization parameters, and the rest is formulated as (m * , s * ) = arg min The effect of the second term is the regularization. Minimizing the KLD between the prior and the generated q x (z) orders the latent space not too far from the origin encourages the encoder to map similar patterns to similar distributions. The first term is the negative log-likelihood of the decoder distribution, which still cannot be computed; therefore, we take advantage of what can be known about the data and what we expect from the generated data. The input image x is a matrix with values of zero and one. This can be considered as every pixel is labeled to one if it is occupied and zero if it is empty. The output of the decoder is between 0 and 1 because of the sigmoid nonlinearity at the output layer, so f (z) can be interpreted as a probability of the original pixel value is 1. Thus, the Bernoulli distribution is a reasonable assumption for the decoder distribution: Here, D is a set of functions that can be parametrized by decoder neural networks. The task is to maximize the log-likelihood of the decoder or, in other words, minimize the negative log-likelihood: Thus, the result is the Binary Cross-Entropy (BCE) of the output generated image distribution and the ground-truth image plus the KLD between the latent and prior distribution.
Returning to the assumptions about the latent distribution taken in Equations (1) and (2), one can explicitly calculate the second term of Equation (11). Substituting Equation (2) into Equation (11), it yields the loss function for the Variational Autoencoder using Bernoulli distribution as prior: Similarly, using Equation (1), one gets the loss function for the Variational Autoencoder with Normal distribution. The Autoencoders with both prior assumptions also were trained in terms of adversarial training [26]. The reconstruction loss function used for this is the same as the first term in (12) and (13). The regularisation term is ignored. Instead, two other losses come into play: the discriminator term and the generator term. Each can be understood as a binary classification; since the discriminator's output is a single δ(z) = p ∈ [0, 1], probability of the input is not generated. The discriminator consists of the batch of latent vectors sampled from the latent distribution generated by the encoder and the same sized batch of vectors from the prior distribution. The expected output is 0 and 1, respectively. The discriminator loss term is formed by the BCE of the expected output and the network's output. It is used to train only the discriminator. Lastly, the generator term formed the same way with the difference that the input is only the batch of the latent vectors, and the discriminators expected output is 1 instead of 0. Then, the BCE loss is back-propagated but used for training the encoder; meanwhile, the discriminator stays intact. This is the concept of competing objectives: the discriminator is trained to separate the latent vectors from priori samples while the encoder is trained to deceive the discriminator, making it mix them up. In equilibrium, the discriminator yields p = 1 2 for every input.

Training Details
The schematic architecture of the CVAE is illustrated in Figures 2-4. The input and output layers are the 16 times 128-pixel grids. The encoder for Gaussian prior is in Figure 2. After every transpose convolutional layer in the decoder part and the encoder first segment, a 2D batch normalization layer and Leaky ReLU nonlinearity of parameter 0.2 are applied. The second segment of the encoder applies only one 2D batch normalization layer after the first convolution and Leaky ReLU of parameter 0.2 after the first three. The encoder for the Bernoulli prior is in Figure 3. The differences are that the last layer is a Sigmoid nonlinearity, and there are no two branches of the second segment of complex layers. All parameters are organized in Table 1 and the learning curves are presented on Figure 5.
The decoder reconstructs the initial images from the latent samples, as illustrated in Figure 4. The parameters are in Table 2. Sigmoid nonlinearity is the last layer in the decoder. The discriminator is a Multi Layer Perceptron (MLP) with five complex, fully connected layers. The input dimension is 64, and the hidden layer sizes are 32, 16, and 8. Batch normalization and Leaky ReLU nonlinearity with a parameter of 0.2 are used for better gradient flow and much faster learning. The output layer is one-dimensional and Sigmoid nonlinearity is applied.       The training of VAE networks took place in a PyTorch [29] environment, and ADAM optimization [30] is applied with a learning rate of 0.001. The loss function is the sum of BCE and KLD for the reasons discussed in the previous section. Due to better convergence, the KLD term was provided with a multiplication factor of 0.1.
Adversarial training was also performed with ADAM optimization, but the MLP discriminator was trained with Stochastic Gradient Descent (SGD). Separate ADAM optimizer instances are used in the reconstruction step for both the encoder and decoder and for the regularisation step, only the encoder. In all cases, the learning rate was set to be 0.001.
Sampling of the prior and latent distributions yielded by the encoder during training is essential. The sampling method must not disable the backward method calculating the gradient of the losses. Therefore, one cannot sample the z = N (µ, σ); instead, a random variable is sampled from = N (0, I) standard normal distribution for the Gaussian autoencoders, and it is transformed by the µ mean and λ logarithmic variance values yielded by the encoder In the case of Bernoulli, the sampling is not that simple, being a discrete distribution, so the sampling from Bernoulli with θ parameters from the encoder is not a continuous operation, which means there is no gradient to be calculated. A random variable u is sampled from u ∼ U [0, 1] uniform distribution and b ∼ B(θ) is sampled by the rule but b ∈ {0, 1} does not allow the gradients to be calculated. As a workaround, one should avoid using b by detaching (b − θ) from the PyTorch variables [29], which becomes just randomized constant C. Adding C to θ gives the same values of b while the gradient calculation is not hindered. The training of the networks is made using GPU acceleration on hardware "NVIDIA GeForce GTX 1060 6 GB". The training and validation losses are captured in Figure 5. The training of neural networks is inherently slow, but the operation of inference itself is fast since one does not need to calculate gradients and backpropagate the losses through the network. Furthermore, in a current timestep, it does not have a large dataset to process, but a few samples based on the actual traffic situation, so a CPU can deal with it within a reasonable amount of time.

Results
In this section, the discussion of the results of the training is summarized. In Section 5.1, the neural network's reconstruction ability is illustrated. The good reconstruction ability shows the good quality of the decoder and that the 64-dimensional code contains information that represents the original picture. The quality of the mapping to latent space is discussed in Section 5.2. It is stated that the slightly different grid samples are mapped to a close-up place. In the time series of grids, similar images are located in the adjacent time steps, but, with interpolation, it is not possible to model temporal dynamics. The rest of the columns are the generated content for these samples, respectively. The copying process is loaded with some noise, which is not surprising for lossy compression. The location of the objects follows the ground truth pattern well. Perfect accuracy is not required anyway since the purpose of the decoder part is merely to learn a proper representation when trained together with the encoder. One significant difference between ADVAE and VAE is that the parameters of the ADVAE encoder are updated twice within an epoch propagating back the discriminator error. Thus, in each epoch, the encoder was encouraged not only to represent the data very well but to generate content similar to the prior distribution. Compared to VAE, an analytically calculated loss function is not used to measure the distance of the distributions but the discriminator network as a function. This may allow for more effective learning. During the training, the model parameters were saved if they produced a lower validation loss than before. Table 3 shows the lowest validation loss values, as well as the loss values haviing taken on the training set with the same model parameters.  Note that the ADVAE scores are better than the VAE and using Gaussian prior scores better than Bernoulli prior. This supports the hypothesis that the prior distribution is more similar to the Gaussian distribution than the Bernoulli distribution. Although the decoder distribution was intuitively assumed to be Bernoulli, this no longer applies to the latent spatial distribution.

Latent Space
It is important to note that, during compression and copying, the encoder does not learn any information about the temporal dynamics of the grids. This is not surprising since the training samples are snapshots. What is illustrated in this subsection is that the encoder does not capture temporal dynamics. However, due to the regularizing effect of VAE, similar samples are mapped to similar latent vectors. Consider a continuous time series of an ego vehicle for grid images. Let G 0 = G(t 0 ) be a grid image at timet 0 and G N = G(t 0 + N∆t) be another one N time steps away. Let their image be m(G 0 ) and m(G N ). The latent vector m i is obtained by Linear interpolation is said to be meaningful if its preimage is the same or similar to the corresponding grid image in the time series. Preimages can be approximated by the decoder d(m i ).
An illustrative example is shown in Figure 7. The second column is the original G i grids sequence, and the first column is the associated interpolated preimages d(m i ). The first and the last ones from the preimages are equal to the reconstructions of the first and last samples from the original pictures. While we see the real transition of the grids on the right side, on the left, we did not obtain this property by linear interpolation. It is not encoded whether there is an object in a particular location. Grids obtained by interpolation do not have fields for vehicles in an intermediate location. Instead, objects at the beginning and final grid of the sequence gradually disappear and appear, and objects do not continuously transform into each other.

Conclusions and Future Work
The presented CVAE neural network can significantly reduce the dimensionality of occupancy grids. Figures 6 and 7 show the ability of the network to learn the momentary context of the traffic situation but not to interpret it over a time series. The network presented in this article may be helpful for further research in which the traffic situation should also be used in some way, but it is not tied to how these data are plotted. The latent information of these grids is included in the context vector and can thus be used for situation-dependent maneuvers or trajectory predictions.
There are a lot of environmental representation solutions for autonomous vehicles. The current approach is generic and not sensitive for the number of surrounding vehicles, since it takes a fixed window around the ego vehicle, and the other agents are registered within it without a concern for their multiplicity. In our opinion, it can be extended to other topologies as well. The present study provides a proof of concept based on the data of one NGSIM set.
If we also want to encode the dynamic nature of the traffic situation, we need to analyze the time series of occupancy grids. This can also be tested with 3D convolutional Autoencoders. Another method is to train recurrent neural networks, such as LSTM, to encode the time series of latent vectors into a latent space. Comparing two approaches will be useful in developing effective predictive models. The recurrent models can take the series of occupancy grids. A pretrained 2D convolutional encoder would create the compressed form of the grid and pass it to the recurrent unit. To each such hidden code, the positions of the ego vehicle can be concatenated so that the recurrent unit can find a correlation between them. Thus, the recurrent unit is able to encode the trajectory while also processing the environmental information, which is necessary for the prediction.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: