Compression of Vehicle Trajectories with a Variational Autoencoder

: The perception and prediction of the surrounding vehicles’ trajectories play a signiﬁcant role in designing safe and optimal control strategies for connected and automated vehicles. The compression of trajectory data and the drivers’ strategic behavior’s classiﬁcation is essential to communicate in vehicular ad-hoc networks (VANETs). This paper presents a Variational Autoencoder (VAE) solution to solve the compression problem, and as an added beneﬁt, it also provides classiﬁcation information. The input is the time series of vehicle positions along actual real-world trajectories obtained from a dataset containing highway measurements, which also serves as the target. During training, the autoencoder learns to compress and decompress this data and produces a small, few element context vector that can represent vehicle behavior in a probabilistic manner. The experiments show how the size of this context vector affects the performance of the method. The method is compared to other approaches, namely, Bidirectional LSTM Autoencoder and Sparse Convolutional Autoencoder. According to the results, the Sparse Autoencoder fails to converge to the target for the speciﬁc tasks. The Bidirectional LSTM Autoencoder could provide the same performance as the VAE, though only with double context vector length, proving that the compression capability of the VAE is better. The Support Vector Machine method is used to prove that the context vector can be used for maneuver classiﬁcation for lane changing behavior. The utilization of this method, considering neighboring vehicles, can be extended for maneuver prediction using a wider, more complex network structure.


Introduction
The general approach for modeling autonomous vehicles contains hierarchical three-layered modeling, i.e., the perception layer, a planning layer, and a trajectory control layer. The perception layer preprocesses data from multiple sensors (lidar, radar, video camera, GPS, etc.) to fuse them into a reliable perception of the surrounding environment and traffic situation. The trajectory control layer drives the actuators and gives commands based on the second planning layer. This contains three sublayers: route planning, behavioral decision-making, and path planning [1].
The route and path planning layer generates a global route and feasible local trajectories. The behavioral decision-making layer provides safe and feasible driving actions on the strategic level, such as "left lane change, follow, etc.". Self-driving vehicles' decision-making system should foresee the near future to predict the surrounding vehicles' future driving behavior [2]. Without the prediction function, emergency incidents may happen, such as the collision between MIT's "Talos" AV and Cornell's "Skynet" AV during the 2007 urban challenge, which was held by the Defense Advanced Research Projects Agency (DAPRA) [3].
Autonomous vehicles must navigate safely and efficiently in a complex vehicle traffic system, either by individual or cooperative decisions [4]. To do this, they need to make decisions continuously about what steps to take in a given traffic situation, such as when to start an overtaking or lane change maneuver. A necessary condition for decision-making is that the autonomous vehicle could interpret and consider the movement of surrounding vehicles' behavior [5]. The behavioral prediction for autonomous driving is challenging and surrounded by great attention. Dagli, Brost, and Breuel present a hierarchical dynamic Bayesian network utilized for predicting behavioral patterns [6]. Gaussian process regressions are also used for trajectory pattern classification and prediction by Trautman and Krause [7]. The Gaussian mixture model for trajectory prediction is also a subject of research. Wiest et al. [8] present a solution which can predict the future trajectory by learning patterns in trajectories, to infer a joint probability distribution as a motion model. The future path is predicted by calculating the probability for the motion, conditioned on the current observed trajectory. Their work's novelty is that they provide a distribution over the future trajectories. Hence, the evaluation of the statistical properties can predict a specific scenario.
Cruz et al. present a study of location prediction applied to trajectories obtained from sensors placed on road networks [9]. A variety of Recurrent Neural Networks (RNN) is applied using a different combination of features to measure the features' impact on the prediction task. The authors show the underlined road network's use to estimate a finer granularity trajectory definition and obtain better models in terms of accuracy and error of distance.
A novel approach presents a Long-Short Term Memory (LSTM) model for motion prediction of surrounding vehicles on freeways, aware of all the surrounding cars [10]. This model's output is not a single motion trajectory but a multi-modal distribution over future motion. A trajectory encoder LSTM encodes the vehicle's track histories and relative positions being predicted and its adjacent vehicles in a context vector. The context vector is appended with maneuver encodings of the lateral and longitudinal maneuver classes. The decoder LSTM generates maneuver specific future distributions of vehicle positions at each time step, and the maneuver classification branch assigns maneuver probabilities.
Rodrigues et al. claim that, in a system consisting of three planners (global path, behavioral, and local path), the behavior planner is the limitation of a successful oath planning solution. A new tactical behavior planner is proposed, motivated by how expert human drivers behave in intersections and is made up of a three-module architecture [11].
High-reliability lane change maneuver prediction is achieved by combining Support Vector Machine (SVM) and Artificial Neural Network (ANN) methods by Dou, Yan, and Feng [12]. The SVM and ANN classifiers predict the feasibility and suitability to change lane under certain environmental conditions. Three different classifiers to predict lane changes are compared, and the best performance is the proposed combined model with 94% accuracy for non-merge behavior and 78% accuracy for merge behavior.
Izquierdo et al. have examined vehicle movement and lane changing predictions using ANN and SVM classifiers. In their paper [13], they evaluate the performance of two kinds of ANNs predicting the lateral motion of the ego vehicle. Vehicle-to-vehicle communication systems can perform such prediction extensible to the surrounding vehicles. These two ANN architectures are evaluated in two datasets to achieve different results in different datasets with different variability. The authors use a Nonlinear Autoregressive Neural Network (NARNN), which is specially tuned to predict time series in dynamical models, which indicates when mapping between inputs and outputs is desired. In conclusion, the authors propose a baseline method to evaluate the lateral position prediction algorithms' performance. The NARNN specially indicated to predict dynamical systems is not better than the baseline method. However, the FFNN can reduce the mean absolute error on the predicted positions about 23% and 30% up to 4 s in the University of Peking and University of Alcala datasets, respectively. For lane change detection, an SVM classifier is tested. The SVM can detect precisely a lane change 3 s before it happens [13].
Maneuver detection and short-term forecasting of road vehicles are essential parts of the autonomous cars' behavior planning algorithms.
Several articles address the problem of classifying the maneuvers of vehicles moving in the ego vehicle's environment based on their trajectory data [10][11][12][13]. In [14], an LSTM RNN classifier is presented, which can classify 3 s of vehicle trajectory data to three lateral maneuver classes with 86% precision without using any information about surrounding vehicles. For this specific task, other approaches and input compilations are compared and discussed. On the other hand, a completely unified framework for surrounding vehicle maneuver recognition and motion prediction is presented with a greater generality by Nachiket et al. [15]. Their framework outperforms an interacting multiple model-based trajectory prediction baselines and runs in real-time at about six frames per second. Trajectory prediction has an inherent uncertainty because of the surrounding vehicles, and it can be handled by multi-modal prediction. In [16], this problem is addressed even just in [10], but a CNN based approach is presented. Their approach first generates a raster image encoding each vehicle actor's context and uses a CNN model to output several possible trajectories and their probabilities.
These operations are computationally intensive and require computational capabilities for real-time traffic analysis. On the other hand, the data transferred by V2X are sensitive and need secure communication, which places additional overhead on the computational requirements and hence hardens the real-time application [17,18]. It is advisable to compress the trajectory data in such a way that the compressed data are at least one order of magnitude smaller and capable of reconstructing the real trajectory with relatively high accuracy. At the same time, carrying useful information about the latent properties beyond the data distribution.

Contributions of the Paper
This paper presents a Variational Autoencoder (VAE) to compress trajectory data, which can select useful latent features with little loss and small reconstruction error. It is shown that the representation learned during compression inherently separates the trajectories according to three maneuver classes (lane keeping, right, and left lane change). VAEs can be used for learning images and, based on the encoded information, annotate or label them in an unsupervised way [19,20]. It opens new possibilities for clustering multi-dimensional data [21,22].
Furthermore, this tool can be used to classify texts even better than LSTM-based encoder-decoder tools [23]. It is not a new idea to use VAE and generative models for trajectory analysis either. Krajewski et al. show that Generative Adversarial Networks (GAN)s and VAEs can learn latent features that are unintelligible by polynomial models [24]. Classification and compression are two very distinct tasks. Compression is trained in an unsupervised way, without class labels, merely copying the input trajectory to the output as accurately as possible. During classification, the device learns in a supervised manner using class labels, only the separation without being forced to find the data's appropriate latent properties. Overall, there are two main findings of the research: • First, the data compression capabilities of the VAE is presented, which is proved to be an effective tool for representing road vehicle trajectories in a dense, highly compressed context vector. • Second, coming from the continuous encoding nature of the VAE, the encoding of similar trajectories are also close to each other in the context vector, hence it can provide maneuver classification information without training information.
In Section 2, the problem is outlined regarding sequential data encoding and interpretation and gives the motivation of this research. Briefly, the purpose of this article is to provide examples of efficient trajectory data compression for behavior prediction using autoencoder neural networks. Trajectory prediction is not the subject of this article as it attempts to analyze trajectories in more depth. Sequential data compression and classification are separate tasks. All the experiments are performed using the Next Generation Simulation (NGSIM) trajectory database [25,26]. The database is presented in Section 3, along with each preprocessing step. In Section 4, the models used in the research are introduced, with special emphasis on the VAE, divided into three sub-sections. The LSTM and Convolutional Sparse Autoencoders (Sections 4.1 and 4.2) and the Convolutional Variational Autoencoders (CVAE) Section 4.3 are shown in separate sub-sections. The explanation of the CVAE's loss function is a key point, and therefore its introduction is justified in more detail. Section 5 presents the results of model training and data research. All details regarding the training can be found in Section 5.1, while the coding ability in Section 5.2 is illustrated with figures. The result of the network's coding capability is used for a classification problem and discussed in Section 5.3, and finally Section 5.4 is devoted to the analysis of latent spatial behavior.

Problem Formulation
Convolutional neural networks (CNN) and LSTM Recurrent Neural Networks [27,28] for the optimization are studied in this paper. Different tools can be applied for the maneuver classification problem, such as Support Vector Machine, Gaussian Classifier, and LSTM neural networks [10][11][12][13][14]. For more precise trajectory forecasting and analysis, one way is to use some method of dimension reduction.
The obvious choice for sequence analysis is an RNN architecture. Still, in the case of longer sequences, the vanishing and exploding gradient problem become the main obstacle of the optimization, according to [29]. To avoid this, one can use, e.g., LSTM RNN, but its complexity (the number of parameters) can be high. Therefore, the training, as well as the inference process, can be slow. It is not an important aspect during the research work if GPUs are available for accelerating the optimization and the inference. Lately, several forecasting and classification concepts have been thoroughly proven in the literature. However, carefully choosing the neural networks' architecture to minimize the model complexity can be useful in the practice.
A CNN can also find and learn temporal features and patterns in a one-dimensional "image"; in our case, this dimension is the time [16]. Every pixel is one-time step, and the number of channels is equal to the d feature number. CNN based autoencoder can yield the same or better result as a Bidirectional LSTM with a tenth of its complexity.
The context vector can be used for maneuver detection, classification, and data visualization. This paper presents autoencoders trained to copy the trajectories while the context vector is low dimensional and carries useful information. Using the same architecture for predicting future trajectory fails because of the lack of information about the surrounding vehicles [15,30,31]. A hypothesis is set up predicting the future trajectory's context vector instead of the future trajectory itself. From this, it is decoded that the performance should be better. It is shown that the experimental result overturns this hypothesis, so the trajectories do not decode enough latent information about surrounding vehicle motions.

The Training Dataset
The data set for training is X ≡ R d×s and every x ∈ X data point is a vehicle trajectory with sequence length s and feature dimension d. Sequence length is s = 60, which means 6 s, and the feature dimension is d = 2, which is the longitudinal and lateral coordinate of the vehicle in every time step. The traffic situation is a multi-lane highway road in the US, and the data collection was made by the New Generation Simulations (NGSIM).
The NGSIM Unites States Highway 101 (US-101) ( Figure 1) and Interstate 80 (I-80) ( Figure 2) datasets [25,26] are used for training and evaluating the trajectory autoencoding problem. This trajectory data has precise location, velocity, and acceleration of each vehicle within a specific area every 0.1 s. Furthermore, it provides relative positions to surrounding vehicles and lane position in every frame. There are 11.8 million rows and 25 columns in this dataset. Each row represents one vehicle in a specific frame with all the information. The columns in ascending order are the following: vehicle identification number, frame identification number ascending by start time, the total number of frames in which the vehicle appears, global time, lateral (x) and longitudinal (y) coordinate of the vehicle's front center concerning the left-most edge of the section in the direction of travel, global x, y coordinate, length and width of the vehicle, vehicle type (1-motorcycle, 2-auto, 3-truck), instantaneous velocity and acceleration, and lane position.

Data Preprocessing and Preparations
The database for the training consists of 1614 vehicles. One-third of them perform a lane-changing maneuver to the left, and another one third is right lane changing; the rest is lane-keeping.
Two trajectory pieces are extracted from the complete path data of each vehicle. One is 6 s (60 time-steps) ahead of the occurrence of the lane index changing, and one other 6 s after. In the case of the lane-keeping ones, two consecutive pieces of trajectory are taken. All the x, y coordinate units have been changed from feet to meters. The following steps have been performed on both subsets.
First, we checked if a vehicle had at least 120 time-steps in the data set because two 60 long sequences are desired, and separated according to what maneuver is done. Second, a vehicle is only considered in the training set if the lane changing occurred later than the first 60 time-steps, with the two sequences extracted. Thirdly, the two sets have been added together and the largest possible balanced data set is randomly chosen and drawn, with the same number of lane-keeping, left, and right lane changing vehicles.
For the sake of the higher generalization capability, every trajectory sample is translated to start from the origin.

Methodology
The idea behind the presented research uses autoencoders to train a neural network to a simple task copying the input data into the output as accurately as possible [20]. An Autoencoder (AE) contains an encoder neural network and a decoder network. The encoder takes the input vector and transforms it into another vector (usually with a smaller dimension) called the context vector. The decoder receives this code and dilates it back to the size of the input data. This scheme fits into any AE's basic concept, and it is summarized in Figure 3. At each epoch, one feeds the data forward, the encoder, and the decoder takes the reconstruction error, backpropagates it through the network, and updates its weights. The process can be trained by any gradient descent optimization method. All the models are optimized by Adam [32] with different learning rates between 0.0001-0.001, depending on the convergence performance.
The training uses mean square error (MSE) loss, which is the reconstruction loss. An additional loss function is also taken to smooth the output. The derivative of the target and decoded sequence is formed, their MSE is calculated and added to the reconstruction loss after multiplication by the weight of 0.1: . .

Autoencoder
This device is capable of lossy copying of 6-second trajectories.
Meanwhile, in latent space, it tries to map the probability distribution that generates the trajectories.

Bidirectional LSTM Autoencoder
Based on LSTM networks' success in learning sequential data generation process's nonlinear, time-dependent dynamics [27,28], the application of LSTM networks is evident. The structure of the model is described below. The Encoder is a Bidirectional LSTM RNN with an adjustable c context dimension-meaning that the LSTM [33] cell has a hidden state and cell state dimension equal to c.
Coordinates x, y are fed in every time-step. The hidden and cell states are initialized randomly in the first time step, and the resulting new one is passed to the next step. The last ones are taken as the context vector and used as the initialization of the Encoder LSTM. During training mode, teacher forcing [34] is applied with a ratio of 0.5. The decoder in the first step takes the original trajectory's last time step x, y coordinates. The other steps take the previous output or the corresponding ground truth from the original trajectory. In the case of validation and test, teacher forcing is turned off. The context dimension is set to be c = 10, 12, 15, 20 using three LSTM layers and 0.5 dropout.

Convolutional Sparse Autoencoder
The encoder and decoder are one-dimensional CNNs with four compound convolutional layers and one linear layer. The numbers of input channels of the layers are 2, 6, 12, and 8 and the kernel sizes 4, 6, 8, and 5, respectively. PReLU activation function is applied in every layer with one parameter per channel. The last two layers use maximum pooling to reduce sequence length. Their kernel size is 3, and 2 with the dilation of 1. Thus, the output shape of the compound convolutional layers is 5 × 5 = 25. A single fully connected layer is used to reduce this number to the c context dimension.
The decoder's transposed convolutional layers upscale the channel number and the sequence length. The channel numbers are 1, 3, 8, 4, and 2:

Convolutional Variational Autoencoder
Variational Autoencoder inherently has an explicit regularization during the training process [35]. VAEs encode the input as a distribution over the latent space instead of encoding it as a point. The context vector is then sampled from this distribution, the sampled context is decoded, and the reconstruction error is computed. One data point from the data set is denoted by x and the encoded representation, called "context vector" is z. Let's denote the prior distribution over the latent space by p(z).
Probabilistic decoder: This is defined by p(x|z), and it describes the distribution of the decoded variable given the encoded variable.
Probabilistic encoder: This is defined by p(z|x), and it describes the distribution of the encoded variable given the decoded variable.
The probabilistic encoder and decoder concept is not enough to solve the regularization problem regarding the content generation. It is easy to see that the model is not prevented from returning distributions like Dirac delta or punctual functions behave like classic models leading to overfitting. In the next paragraphs, the details of the mathematical concepts of the VAE are covered, and the deduction of the correct loss function for the optimization is presented.
Below, two assumptions are made about the distributions: F denotes a set of functions that can be parametrized by a finite number of parameters, so it does not cover all the functions in general. The F elements can be approximated by neural networks, which is the key to finding the optimal decoding. On the other hand, the Bayesian theorem can unfold the probabilistic encoder which is used in Equation (7) for the deduction of a practical formula. Otherwise, the integral in Equation (5) cannot be evaluated, so it cannot be used in practice. By means of the variational inference formulation, the p(z|x) encoder distribution is approximated by an other Gaussian q x (z), where the mean and the variance are functions of variable x: Here, G and H are sets of parametrized functions just like F. Among these functions, one needs to find the best approximation of the g, h functions by determining their parameters. The best approximation is supposed to minimize the Kullback-Leibler divergence [36] between the approximation and the original p(z|x) distribution. The optimal g * and h * functions obtained by Here, Equation (5) is substituted and one gets After rearranging the terms, the Kullback-Leibler divergence of q x (z) and p(z) distributions appears, which can be calculated. The last term does not depend on g and h, so, generally, it gives only a constant contribution so that it can be omitted in the minimization problem: (g * , h * ) = arg min The last equation expresses that the optimizer needs to find a balance between the reconstruction loss, which is the maximization of the log-likelihood for the p(x|z) and the KL divergence of q x (z) and the prior which enforce the encoder to stay close to the standard normal distribution. This is a trade-off between how much one can rely on the data and how fair assumption is the prior.
In case of the function f , one should maximize the expected log-likelihood of output x given a z context vector sampled from g x (z): Taking Equation (11) into account, the final optimization problem is formulated in maximizing the following expression with respect to the parameters of the three functions ( f , g, h): The mean square error approximates the first term. The second term is the Kullback-Leibler divergence between the normal distribution q x (z) with a diagonal covariance matrix and the standard normal distribution. After performing the calculation, one gets the loss function for the VAE; In addition, using the mean squared error of the first derivatives gives a better and smoother result, thus the final loss function for the VAE with Equation (1) takes the following form: All the functions f , g, h are parametrized by the parameters of neural networks, and F, G, H sets are defined by their architectural design. Empirically determined parameters for the encoder and decoder are given in Tables 1 and 2. The Two Separate Part for the µ and σ Generation

Training
All the models and the training are implemented using the Pytorch [37] framework. In training mode, random Gaussian noise with zero mean and 0.1 variance is added to every trajectory sample in the training set. Meanwhile, during the validation and test mode, the noise is not applied. This method helps the neural network to learn denoising capabilities and reconstruct the trajectories more smoothly. In this subsection, the training loss values are presented in Tables 3 and 4. The tables contain the lowest value of the validation losses and the corresponding training losses. All values are derived from the reconstruction loss and the regularization term. An average per sample is taken on the data set, so this value is the expected value of the reconstruction and regularization error of a sample trajectory with 2 × 60-parameters. Furthermore, the learning curves of SAE, LSTM AE, and CVAE are presented in Figure 4 in the case of 10-dimensional latent space. A moving average smoothes the curves with a window size of 1000 epochs for the sake of better visibility.  The Convolutional Sparse Autoencoder (SAE) described in Section 4.2 fails to converge during the training process. Using the regularization term in loss function Equation (2) gives a better result than without using it though the training loss freezes around 50-100 based on the learning rate and latent space dimension. In Figure 4, the red lines show that the SAE gets stuck in a local optimum during the optimization, unable to make that progress like the CVAE (cyan blue curves) does. Since the loss values are the sum of the losses along the 2 times 60 components of the meters' trajectory, the value above 35 means that the average deviation is around 0.5 m for each component. This is too big to say that SAE training is satisfactory. A similar finding can be made for the LSTM AE.
The LSTM AE (Section 4.1) with the same regularization gives slightly better results regarding the loss values in Table 4. The LSTM can only approach CVAE's loss values with much greater complexity. Once the latent space dimension is chosen below 10, the training process can not converge, so the losses become unmanageable big values and extremely fluctuating. On Figure 4, the purple lines correspond to the LSTM AE. Much less epoch number is used for the training due to the larger run-time. The training error could be further reduced, but the validation loss's rise indicates the overfitting phenomenon; therefore, it does not make sense. These results suggest that the sparse regularization term in Equation (2) assumes a prior distribution in the latent space that does not approximate the real one well.
The lowest loss values (Table 3) are resulted by the Convolutional Variational Autoencoder (Section 4.3) with the inherent regularization term in Equation (14). This is included in the assumption of the standard normal distribution for the prior distribution in the latent space. The result of the CVAE is reported below.

Encoding and Decoding Performance
In Figure 5, the CVAE encoding performance is illustrated on three maneuvers. From top to bottom: Reconstruction with 10-, 6-, and 4-dimensional latent space. From left to right: The same trajectory of a lane-keeping and lane changing on the right and left. The red curve is the ground truth trajectory, while the green one is the most probable reconstruction given the latent distribution. With blue lines, several reconstructions are plotted using 100 different context vectors sampled from the latent distribution. The transparency of the line is proportional to its likelihood. The horizontal axis is the longitudinal y coordinate of the vehicle's trajectory, and the vertical is the lateral x coordinate. Under the trajectory graphs, the hidden state vector is plotted.
It can be seen that, by reducing the dimension, the reconstruction ability deteriorates, which can also be observed in the values in Table 3. However, this does not have such a significant effect on the classification, see Table 5. Table 5. Results of SVM classification on latent representations.

Maneuver Classification
In a maneuver classification task, a learning algorithm is trained to specify which maneuver from a fixed class belongs to the input trajectories. Supervised learning algorithms solve this task by producing a function mapping from the input space X to a finite category class C, which is a subset of the natural numbers N. In this case, the input trajectory is a tensor of X ≡ R 2×60 and the category set is C = {left lane changing, lane keeping, right lane changing} and can be identified by a numeric label code {0, 1, 2} respectively. Although the CVAE is not trained as a classifier, it can cluster the data by learning its intrinsic properties, which is useful for classification as well.  The CVAE has learned latent information about the maneuvers in an unsupervised way, illustrated by the 3D latent space visualization in Figure 6. An arbitrary 6-second trajectory is mapped into the latent space as a Gaussian distribution by the trained CVAE encoder part. The centrum of this distribution is the expected value vector. These vectors can be visualized using a projection from the latent space to the three-dimensional space. On the three-dimensional space in Figure 6, every dot colored by the maneuver label belongs to a trajectory. According to the maneuvers, the points representing different trajectories are automatically separated in the latent space, although the labels belonging to the maneuvers haven't been used during the training. This is due to the construction of the CVAE with adequate regularizing ability, in which the prior distribution of the latent space variable is assumed to be standard normal, and the probabilistic encoder and decoder follow a normal distribution. The CVAE preprocessed the trajectories by selecting the characteristic features, making it easier to find planes that separate the classes with a small error. A Support Vector Machine [38] on context vectors can classify trajectories according to maneuvers with 88-91% accuracy (see Table 5), which is more than the tools can that have been specifically trained for maneuver detection [14].

Latent Space Visualization
In Figure 7, an interpolation is visualized in the latent representation space. The series begins with a right lane change and ends with a left lane change. The three internal states result from decoding the weighted average of the two extreme states' context vectors. Continuous transition is observed in terms of the distance traveled longitudinally and laterally and in the trajectory shape. The continuous transition between two completely different trajectories suggests that the regularization properly organizes the latent space. The decoded trajectories from the internal states are not uncertain as expected without regularization but eligible ones, thus the decoder can be used for content generation. The VAE described in Section 4.3 learns meaningful mapping and coding.
The t-distributed Stochastic Neighbor Embedding (t-SNE) is a widespread technique for the latent space visualization [39]. After 120 data trajectories are encoded by the VAE into a 3 to 10-dimensional vector space, this procedure is applied to visualize the data in three dimensions.

Conclusions
The research presented in this paper shows that a CVAE, with an appropriate neural network structure, is capable of the lossy compression of real-world vehicle trajectories. The compressed code, also known as the context vector, or latent representation, can also be used for maneuver detection and classification. In addition, a representation that continuously maps similar patterns is learned, successfully modeling the vehicle trajectories automatically and meaningfully.
In the future, the input for trajectory data analysis and behavior prediction should be further investigated. A more realistic model should have the three lateral classes for lane changing behavior, multiplied by several longitudinal ones. Longitudinal classes could separate acceleration behaviors. Environmental conditions can also affect driver behavior, as light and dense traffic have different inner structures. Hence, a more sophisticated model should handle maneuvers that are specific to this situation.
The paper also shows that the encoded context vector can be used for maneuver classification. It could be used for trajectory prediction, but it is essential to consider more information about the traffic situation to make predictions. One should take the trajectory of surrounding vehicles into account in some manner. It would be possible to test whether such networks can encode the vehicle's trajectory under investigation with the information of vehicles moving around it. An autoencoder network should be trained for yielding the future trajectory of the vehicle to the output by feeding the historical trajectory of the ego vehicle and the surroundings to the input. It is possible to separate the input encoding (and copying) task to the prediction task. In other words, on the one hand, one can train to predict the future trajectory, but one can also train the method to predict in latent space. In the latter case, it is quite simply a matter of a separately trained device encoding the trajectories and mapping them into the latent space, followed by another performing the prediction on the context vector from which the decoder decrypts the future trajectory. Funding: The research reported in this paper was supported by the Higher Education Excellence Program in the frame of Artificial Intelligence research area of Budapest University of Technology and Economics (BME FIKP-MI/FM). The project is also supported by the Hungarian Government and co-financed by the European Social Fund, EFOP-3.6.3-VEKOP-16-2017-00001: Talent management in autonomous vehicle control technologies.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: