GPU-Accelerated Interaction-Aware Motion Prediction

: Before their massive deployment, autonomous vehicles need to prove in complex scenarios such that they can reach human driving proﬁciency while guaranteeing higher safety levels. One of the most important human traits to negotiating trafﬁc is the ability to predict the future behavior of surrounding vehicles as a basis for agile and safe navigation. This capability is particularly challenging for an autonomous system in highly interactive driving situations, such as intersections or roundabouts. In this paper, a set of techniques to bring a computationally expensive state-of-the-art motion prediction algorithm to real-time execution are presented with the goal of meeting a standard motion-planning algorithm execution frequency of 5 Hz, which is the primary consumer of motion predictions. This is achieved by applying novel and existing parallelization algorithms that take advantage of graphic processing units (GPUs) through the compute uniﬁed device architecture (CUDA) programming language and managing to produce an average 5 × speedup over raw C++ in the cases studied. The optimizations are then evaluated in public datasets and a real vehicle on a test track.


Introduction
Autonomous vehicles have seen a huge influx of resources and research attention in recent years, as there is a real drive to make this technology common and of everyday use. Some of the challenges that autonomous driving tries to address are high collision rates [1], traffic congestion [2], difficult access to public transport [3] and lack of workforce [4]. A solution to these problems needs to be technically ready to be deployed in automated vehicles to allow them to navigate real traffic.
One of the main requirements for autonomous navigation in real traffic is the ability to predict the future motion of the surrounding vehicles in a timely manner. The main characteristic of vehicle motion prediction is that the traffic agents influence one another and that interaction needs to be taken into account to produce accurate predictions [5]. An illustrative example of these interactions can be seen in Figure 1, where the predictions of each vehicle at a given time are depicted with multi-modal footprints (i.e., a spatiotemporal probabilistic distribution per potential corridor). However, interaction-aware motion prediction poses a problem in computing scalability, as every increase in the number of surrounding vehicles incurs added computing costs in the motion prediction of all other vehicles.
The objective of this work is to produce an interaction-aware motion prediction algorithm that can run in real time while dealing with dense traffic. A specific goal for this system is to be able to feed reliable motion predictions into a real-time path planning algorithm, which is often working at frequencies not lower than 5 Hz. In this context, a collection of GPU acceleration methods are designed and implemented to bring down the computational cost of a multi-modal probabilistic interaction-aware motion prediction algorithm. GPUs are selected as the acceleration driver for this work because of their ubiquity in modern computers, their relation between price and processing power, and the relative ease of development thanks to modern parallel frameworks. The set of techniques presented are both (i) problem specific, taking advantage of application knowledge to bring down computation time further than available libraries could, and (ii) general acceleration methods that are proven to work in other existing scenarios. The proposed methods tackle the most computationally expensive activities, such as memory accesses, arithmetic instructions and sequential computation bottlenecks. A video of the execution of the experiment cases containing the motion grid for the last time step can be found in the Supplementary Materials.
The main contributions of this paper are as follows: • A set of novel GPU parallelization techniques to allow a state-of-the-art interactionaware motion prediction algorithm to run in real time. • A study of the developed technique's performance in public datasets and on a real vehicle in a test track. • A comparison of the accelerated algorithm's prediction error with a state-of-the-art algorithm. The rest of the paper is structured as follows: Section 2 presents a review of the most relevant related work. Section 4 describes the set of techniques developed to accelerate motion prediction through the use of a GPU. Section 5 describes the experimental setup and the experiments carried out to document the effect of the proposed strategy. Section 6 gives insights regarding the prediction performance of the accelerated algorithm and contextualizes it with a comparison. Finally, Section 7 draws some concluding remarks.

Related Work
As of late, there has been a development push to bring scientific advances to the real world by accelerating algorithms. This push has been made in the algorithms themselves and in the architectures that host them. One of the most popular embedded architectures used is field programmable gate arrays (FPGAs) due to their low latency, efficient power consumption and robustness to adverse conditions. In [6], the authors argue that FPGA development is mature enough to host expensive simultaneous localization and mapping algorithms. In [7], they apply an FPGA to offload algorithm components while acting as a fallback system in the case of failure. In [8], they demonstrate parallelization techniques for a decision-making algorithm deployed on an FPGA with its corresponding execution time study.
Together with FPGAs, another hardware platform used for accelerating software is application-specific integrated circuits (ASICs), which are chips that are specifically designed for a particular use. ASICs can be extremely efficient and cost effective when acquired in bulk but require a steep investment upfront, which makes them unsuitable for most robotic applications as well as for scientific investigation.
A more common computing platform used in robotics is the GPU. GPUs offer the ability to perform parallel computation on large amounts of data that, when coupled with parallel algorithms, can perform at speeds orders of magnitude faster than linear central processing unit (CPU) computations. GPUs have been used to optimize perception systems based on LiDAR sensors and convolutional neural networks [9], moving object lidar-based segmentation [10], vehicle collision analysis from point clouds [11], path planning for multiple unmanned aerial vehicles [12] and motion prediction as a stage of a larger smart mobility application [13] amongst others. Lately, they have been adopted by several key players in the autonomous driving industry through the use of Nvidia drive computing platforms.
To fully exploit the GPU capabilities, it is often necessary to develop dedicated parallel algorithms, which has brought an influx of novel techniques to classic algorithms. One example is the resampling stage in particle filters, which is historically complex to parallelize due to their sequential nature. In [14], the authors offer a parallel reimplementation of existing resampling methods while improving upon them by taking advantage of the multi-threaded architecture of GPUs. Another example is the inference of probabilistic predictions. In [15], the authors propose a Bayesian temporal factorization framework that is able to generate predictions by modeling multidimensional time series that accounts for parallelization in their model inference algorithms. In [16], the authors use a GPU in training and inference to perform traffic flow prediction based on a temporal graph neural network.
Another field that has stirred interest in optimization is motion prediction based on Markov chain matrix multiplication, due to the high computational cost that comes associated with it. In [17], the authors acknowledge this fact and propose the possibility of accelerating the matrix multiplication using dedicated hardware. Even though there has been some effort to make this method feasible in online execution [18], there is still a struggle to reach a prediction frequency of over 1 Hz in complex scenarios.

Architecture Overview
The interaction-aware motion prediction (IAMP) strategy considered in this work first infers the surrounding traffic agents' intentions, using a dynamic Bayesian network, and then predicts their motion using Markov chains. It explicitly considers interactions, the road layout and traffic rules, and can be used in almost any driving context, as it does not rely on training data. IAMP receives information from the perception module that informs the algorithm about the position and velocities of the surrounding vehicles and the ego vehicle and from a pre-computed vectorized map that is loaded once in the Lanelet [19] format.
At each time step, the interactions amongst vehicles and between vehicles and the layout are taken into account to infer the probabilities to stop at intersections, the probabilities to change lanes and the probability of being in each of the possible navigable corridors. These interactions are computed for every vehicle in the scene against the rest and take into account both information that is coded into the traffic rules and also potential deviations from the expected road-safe behavior to deal with unsafe driving situations, such as violating a stop line. This feature makes this interaction-aware motion prediction algorithm especially useful in complex situations with dense traffic, like unsignalized intersections [20], sudden unsafe lane changes [21], or roundabouts [22]. The intentions are then fused with the motion predictions computed with a kinematic model to produce a 3D motion grid used by the ego vehicle to navigate through the scene.
The motion grid is a cube composed of a collection of matrices, one per time step until the prediction horizon, that host the grid of predictions for each vehicle. The grid-based probabilistic representation of the predictions is centered around the ego vehicle and takes into account the uncertainties both in the motion model and in the input data.
The algorithm can be summarized in the scheme of Figure 2. The main parts where the GPU-based acceleration has significantly reduced the computation time are highlighted in blue in the flowchart, namely in the particle filter used to compute the intentions and in the matrix multiplication of the motion prediction. This flowchart is extensively described in [20][21][22], where the algorithm is validated and compared with state-of-the-art approaches. Hence, it will only be briefly reviewed below.

Corridors
One of the core parts of interaction-aware motion prediction is to define the potential driving lanes of each vehicle to determine which ones influence each other and which ones are independent. The current navigation model is reliant on maps that are formed by lanelets [19].
Corridors are the collection of adjacent lanelets that start from the current position of the vehicle and reach the distance that the vehicle can travel at a certain time at its current speed. The adjacent lanelets are found via a graph search in the physical layer using the neighborhood relations stored in the lanelets. Corridors store the list of lanelet ids that form them and their topological relation to other corridors, resulting in an inexpensive way to find out if a vehicle's corridors intersect with another, and therefore a relation of dependence can be established. Figure 3 show an example of a vehicle's corridors.

Relations
Given a map and the list of corridors obtained for all vehicles present at the scene, three types of relations are obtained: lateral relation, corridor-to-intersection (distance to intersections) and corridor-to-corridor (corridors dependencies):

•
Lateral relation: For each target vehicle, a search of the surrounding vehicles is performed, and the bumper-to-bumper distance and their velocities are stored. • Corridor-to-intersection: The intersections each corridor goes through are determined by intersecting the lanelet's identifiers. Then, for all intersections a corridor goes through, the distance to the intersection is computed in the Frenet frame. • Corridor-to-corridor: In order to determine which corridor will influence the predictions of the other, a search is performed by pairwise intersecting the centerlines of all corridors. This generates a list of possible collisions. Based on a set of rules, one (if any) corridor is selected as the corridor influencing the motion of another vehicle.
The main function of the relation computation is to create influence matrices that determine which vehicles affect each other's future behavior in the intention estimation step of the algorithm.

Intention Estimation
The interactions between various elements in a vehicle's environment play a crucial role in determining its behavior. In order to estimate the intentions of the traffic participants, the dynamic Bayesian network (DBN) described in [23] inspired the one used in this work. It takes as input the corridors and relations obtained in the previous steps. DBN is a powerful modeling tool that tackles uncertainty and promotes explainability [24]. The network represented in Figure 4 is instantiated for each of the vehicles present in the scene. Note that bold arrows represent the influences of the other vehicles on vehicle n through some key variables (E n t , I n t , Φ n t , Z n t ) described below, where n ∈ 1 . . . N and N is the number of vehicles present at the scene. The relationships between the variables in Figure 4 can be represented by the following joint distribution [23]: where we have the following: • Expectation The relations obtained from the pairwise analysis of the corridors are used to estimate the longitudinal E s [20] and lateral E c [21] behaviors of the vehicles.
An exact inference of (1) is often unfeasible. Hence, a particle filter [25] is used to estimate the hidden states E t , I t and Φ t , given the observed variables Z t .
The number of vehicles at the scene might change at each time step. This may occur due to the ego vehicle's sensors range or due to vehicles entering or leaving the covered zone in the case of a dataset. To manage such situations, for every new vehicle that appears, the DBN is instantiated, and the initial intentions and route are randomly set from the list of possible values. Once a vehicle leaves the scene, its physical state is updated for a few steps before being completely discarded. This is done to avoid having to initialize a vehicle already seen in previous steps.

Motion Prediction
The computation of the surrounding vehicles' predictions is inspired by the method proposed by [26]. The system dynamics are abstracted into Markov chains, where the state space X and input space U are discretized into intervals. The state space consists of longitudinal position s and velocity v, and the input space represents the potential acceleration a.
The transition probability matrices of the Markov chains for a time step Υ(τ), and for a time interval Υ([0, τ]), where τ is the time increment, are computed offline with reachability analysis [26], using the following differential equation as the vehicle's longitudinal dynamics:ṡ where a max and v max are the maximum allowed acceleration and velocity, respectively, and u is a value in the [−1, 1] range sampled from the discretized input space U.
The states probability distributions for future time steps p(t k+1 ) and time intervals p(t k , t k+1 ) are computed as follows: where Γ(t k ) is the input transition matrix that represents the transition probabilities between the input states. These probabilities take into account the acceleration profiles obtained for each corridor considering the vehicle-to-vehicle and vehicle-to-layout relations as explained in [20]. These predictions are projected onto the corridor in a road-shaped format and are fused with the corridors' probabilities, inferred by the DBN, to generate a motion grid.

GPU Programming Model
GPUs were created to process graphics in a dedicated manner independently of CPUs. Due to its advantages in performing simple math, in parallel, the GPU was divorced from its original use. In 2007, Nvidia released the CUDA API [27], which allows developers to make use of the GPU from a high-level programming language on top of C++. The usual workflow of a CUDA program is allocating memory on the GPU, transferring data from CPU, making program calls to the GPU from the CPU and transferring back the data to the CPU.
CUDA establishes an execution hierarchy and a memory hierarchy. In the execution hierarchy, it defines kernels, which are blocks of code executed by a single thread. The threads are executed simultaneously in groups called warps, which are scheduled within blocks. Blocks are then grouped into grids and executed in the different streaming multiprocessors (SM) of the GPU. The memory hierarchy defines types of memory based on their size, speed and accessibility.

•
Global memory is accessible from all threads in the GPU and is the highest in size but it also has the most access latency. • Shared memory is accessed from within the streaming multiprocessors, is shared across the threads in a block, and has very low latency. • Local memory is the memory that each thread can statically allocate and is only visible from within the thread.
Analyzing performance, also called profiling, is a core part of CUDA development. CUDA defines several concepts to study code performance such as occupancy, coalescence and instruction optimization. Occupancy is the ratio between the number of active warps to the maximum number of warps per SM; it gives information about the used percentage of the GPU, which should be as high as possible for maximum performance. Coalescence is a memory access technique that aims to group memory transactions to reduce them and make optimal use of memory bandwidth. Instruction optimization studies the number of low-level instructions issued to the processors to keep them to a minimum.

Parallel Particle Filter
This section describes the implementation of acceleration techniques on the different stages of a particle filter (predict, update, resample and output generation) that are used for the inference of the intention, expectation and future pose of the surrounding vehicles.

Predict
The predict stage of the particle filter produces an estimation of the future state of the system based on the previously estimated state and the measured state.
Each particle contains a complete representation of the environment: the estimated and measured states of all vehicles and their intentions and expectations. Every particle holds a vector of vehicle states that define a complete scene within itself. The vehicle state is defined by its position in the Cartesian coordinate system (x, y), orientation θ and velocity v. This maintains full independence across particles and unlocks the possibility of assigning each particle to a thread, therefore enabling the parallel computing of the prediction. It also minimizes the amount of information that needs to be exchanged from CPU to GPU at the beginning of each iteration, as everything except from the measured state is computed and remains in the GPU memory.
The data structure of a particle can be seen in Figure 5, where the subindex k reflects the number of particles minus one, the subindex n reflects the number of vehicles minus one, and the variables with subindex e represent the estimated state of a vehicle. Every variable is stored in a C++ float, which takes 4 bytes of memory. The design choice made here was to create an array of structs of vehicle states to enable a concurrent memory access pattern per particle. This is necessary because in each iteration, the algorithm accesses every field in a vehicle state to compute the intention and expectation, and therefore it needs them to be contiguous in memory to allow coalescing to happen. GPUs require natural alignment in their data; in other words, memory reads and writes will always work with words of size equal to 1, 2, 4, 8, or 16 bytes or a number divisible by any of those [28]. The parallel programming design consideration taken here was to keep the data size to a number divisible by 8 bytes to guarantee contiguous memory alignment. The complete particle struct is formed by 12 floats per vehicle, which amounts to 48 bytes and is divisible by 8. If the amount of vehicles present in the scene results in a particle data structure that is not divisible by 8, padding is added. This contiguous memory alignment allows for faster performance, as it minimizes the number of memory transactions from global memory to the individual threads necessary per iteration.

Update
The update phase of the particle filter computes the weights of each particle based on the quality of the prediction. To do that, it compares the predicted pose and velocity to the measured ones using the following normal distribution: The process works by setting the mean µ as the measured value, the standard deviation σ as a fixed value that is set based on how spread the particle population needs to be, and the input x, y, θ, v as the predicted position, orientation and velocity, respectively.
This process is performed for every vehicle, and then their weights are multiplied to produce the total particle weight w k as follows: where N is the number of vehicles. This combination allows for the weight to define the prediction quality of a full scene instead of evaluating predictions per vehicle in isolation.
The weights of all particles are computed simultaneously using a kernel with an optimized version of the normal distribution function. This aims to optimize the number of arithmetic instructions issued to the processors by substituting the square root for its numeric equivalent. The square root is chosen to be replaced since the CPU uses Newton's method to calculate it [29], which is computationally expensive. There is a loss of precision with a maximum error of ±10 −8 [m, rad, m/s²], which is the float precision and is deemed an acceptable trade-off for the gained performance.

Resample
The next stage in the particle filter is resampling, which works to avoid weight degradation and filters out poor-performing particles. The measurement of the overall particle filter performance is the effective population N e f f , which reflects the degree of the weight quality [30]: where K is the total number of particles. This stage is only triggered when this effective population falls below a threshold that is set to be half the total number of particles.
The resampling algorithm is based on the low variance sampler [31], which is an efficient sampler of complexity O(M), where M is the size in bits needed to represent the input that selects particles with a probability directly proportional to its weight but only computing a single random number. Figure 6 shows a schematic of the resampling algorithm that works by splitting the particle range into sections proportional to their weight w p . The selection of the new particles occurs by first sampling a random value r ∈ [0, K −1 ] and then adding this value to the inverse of the total number of particles K multiplied by the new particle number m: The particle whose section corresponds to u is added to the new set of particles. This process duplicates performing particles and removes low-performing particles. The algorithm in CUDA works by generating a vector of indexes using a single thread and storing it in global memory. Then, it uses a kernel to copy the chosen particles in a new vector using grid-stride looping. Grid-stride looping is a CUDA technique that uses the grid dimension as the period for a loop wrapping the code within the kernel and therefore achieving complete coalescence in memory by ensuring that all memory accesses are contiguous. Figure 7 shows a simplified representation of the technique in which a grid of size 2 with blocks of size 2 use their threads to access the memory spaces, in blue, with a stride of grid size × block size, meaning that the first thread of the grid will access memory at spaces 0, 4, 8, etc. This unit-stride memory access is what achieves coalescence.

Output Generation
Once the particles have been resampled based on their weight, the algorithm goes into its final stage, which is producing an estimation of the system's state S as follows: where each particle specific substate S s (E s , I s , E c , I c , x, y, θ or v) is multiplied by its weight w and then added to the rest of the particles to produce a result. This state estimation output is then used two-fold: as feedback for the particle filter's next iteration and to produce a probability distribution per corridor by discriminating the sum of weights per group of particles with that corridor. This problem is solved by using a reduction algorithm in which a large dataset is transformed into a small dataset, or a single value per corridor in this case, by applying an operator to each element. In CUDA, the algorithm can be accelerated by taking advantage of its execution hierarchy and its memory hierarchy at the same time using a technique called parallel reduction sequential addressing [32]. The technique consists in using thread blocks to iteratively reduce fragments of the dataset while storing them in shared memory. The sequential addressing part of the technique consists in combining the values by accessing the memory in strides of a size equal to a multiple of the block size. As a result, the accesses are contiguous, and therefore better coalescence is achieved. Figure 8 shows a simplified representation of the reduction algorithm used for the computation of the longitudinal expectation E s . Note that there is an array of eight elements and a parallel reduction with strides of four, two and one per iteration is applied.

Parallel Motion Prediction
This section describes the technique used to accelerate the motion prediction stage of the algorithm. It uses the predicted intention, pose and velocity of the surrounding vehicles from the particle filter and outputs a set of predicted motions for each perceived vehicle.

Memory-Aligned Matrix Combination
The motion prediction of the vehicles is computed by abstracting their system dynamics into Markov chains, which produce a discretized state space X s,v and input space U a , and then multiplying them with transition matrices to produce an estimation of their future states. The state space X s,v is formed by the discretized position space s and the discretized velocity space v; the input space U a is formed by the discretized acceleration space a.
One of the issues with this approach is that it escalates poorly with the resolution of those state spaces as the matrices grow and thus the computational cost to multiply them. A good opportunity for acceleration appears on the computation of the input transition matrix Γ, which stores the probabilities of a given system to change from one input state to another as can be seen in Equation (3). It is computed by multiplying the priority vector λ as a diagonal matrix with the input dynamic matrix ψ: The priority vector λ describes the restrictions imposed on the vehicle by the road geometry or other vehicles present in the scene and, as it changes on every iteration, it cannot be computed offline and stored. The input dynamic matrix ψ represents the intrinsic characteristics of the vehicle input dynamics when there are no restrictions at play. The dimensions of each matrix are n s · n v · n a × n s · n v · n a , each one being the number of possible discretized positions, velocities and accelerations, respectively.
One interesting characteristic of ψ is that it is composed of repeating submatrices with dimensions n a × n a because it is assumed that the changes between inputs are not time-dependent and do not influence other states. λ is a vector transformed into a diagonal matrix, and its shape can be seen below. Using this problem-specific knowledge, it is possible to produce an informed combination of the vector λ and a submatrix of ψ to substitute the costly matrix multiplication, thus reducing the number of instructions executed and memory accesses required.
A simplified, albeit lacking physical sense, numeric example is presented below that illustrates the combination of λ and ψ with n s = 1, n v = 1 and n a = 2: The combination consists in multiplying each λ n value with the first set of values from the ψ n a ×n a submatrix in a repeating pattern. This reduces both the number of calculations and memory since only ψ n a ×n a and the diagonal vector are required. This process is conducted following Γ i = λ f loor(i/n a ) * ψ n a ×n a mod(i,n 2 a ) where ψ n a ×n a = ψ 1 ψ 2 ψ 3 ψ 4 and each Γ i value is computed by multiplying individual values of ψ n a ×n a by each λ n and then repeating with a stride of size n a . The indexes of λ and ψ n a ×n a are precomputed and stored offline, as they do not change throughout the execution, significantly reducing the computational cost of the mod operation. The ψ n a ×n a submatrix is stored in local memory, as the input space is usually small (<10), producing a submatrix of floats that fits in the faster local memory space. Then, each thread performs the multiplication of the corresponding λ n and ψ n a ×n a value while tracking the Γ i index to maximize memory coalescing on the Γ memory space.

Experimental Setup
The GPU acceleration methods presented in this paper were evaluated using two different approaches: first, using the open source and publicly available datasets inD [33] and openDD [34]; second, acquiring a dataset using the AUTOPIA test track and experimental vehicle.

inD and openDD
The experiments conducted in this section compared the original Matlab version of the algorithm against a pure C++ adaptation and the CUDA version that includes all the work presented in this paper. The chosen maps from the datasets can be seen in Figure 9 and were chosen to have two different intersections and two different roundabouts to showcase enough interaction between the cars.

Autopia Test Track
The datasets in this section were acquired using a Citroen DS3 modified by the Autopia Group [35] to be fully autonomous as can be seen in Figure 11. The layout of the test track can be seen in Figure 12.  The results of this section were obtained in a Nvidia Drive PX2 that is hosted on the Autopia car. The hardware architecture of the Nvidia Drive PX2 hosts a twin system, each with an ARM Cortex-A57 CPU, an integrated Pascal GPU and a dedicated GP106 GPU. The GPU-accelerated interaction-aware motion prediction was assigned the Tegra SoC B along with its dedicated GPU.

Timing Computation
One of the core parts of this work is the precise measurement of the performance of each implementation. This section aims to describe the inner workings of the three separate timing acquisition systems for Matlab, C++ and CUDA. All the Matlab timings were computed using the tic/toc [36] functions that query the gettimeofday [37] Linux function directly. The gettimeofday function uses a frequency of 10 6 Hertz [38] that is traduced into a 1 µs resolution that is considered sufficient for timing the Matlab interaction-aware motion prediction functions that very rarely go under 10 ms.
For C++, the timing system used was the steady_clock [39] function from the standard chrono library. In Linux, and under the GCC 7.5.0 compiler, steady_clock gets traduced into the clock_gettime [40] function working in monotonic mode, which guarantees an ever-increasing clock that is not affected by changes in the system time-of-day clock and is the most appropriate way to measure elapsed time between functions. The resolution of this system varies depending on the computer characteristics, but an empirical measurement showed that for the computer used in the experiments, the resolution was ∼67.5 ns.
In CUDA, the standard C++ functions are not appropriate since they measure CPU performance and would only measure the time it takes to launch a kernel and not its execution time. An alternative to keep using pure CPU timing measurements would be to synchronize after every kernel call but that would stall the GPU pipeline and affect the overall performance. For timing the GPU execution, CUDA offers the Event API [41] that can measure independent timings for code that executes simultaneously in the GPU. The resolution of this clock is reported to be 0.5 µs, which is fitting for measuring the CUDA implementation since the execution time never reaches 1 µs for the measured functions.
The rest of this chapter is structured as follows: an extensive study on the acceleration of the different parts of the algorithm studied on the inD and openDD datasets is presented in Sections 5.2-5.5. Then, the experiments conducted on the Autopia test track are shown in Section 5.6.

Particle Filter
This section compares the time of execution of the different stages of the particle filter between the three different implementations of the algorithm. Every experiment was run three times on each map and then averaged. Then, to reduce the influence of the specificities of each map, the results across the four maps were averaged. All the experiments in this section were run with a pool size of 20,000 particles. Figure 13 shows the overall time comparison. In Figure 13a, the box plots for the predict section of the algorithm can be seen. Here, it is interesting to see that even though the predict CUDA implementation has no algorithm modifications, apart from the fact that each particle prediction is computed in parallel, the gains are already visible in terms of computation time. Its box plot is significantly shorter than the Matlab and C++ implementations and shows no outliers, which means that its execution time is predictable and remains constant across different situations and car combinations.  Figure 13b,d show the update and result stages of the algorithm, which are the less-time-intensive stages for the three implementations. They show the least variance between maximum and minimum values in the Matlab and C++ implementations, denoting a constant effort demanded by the algorithm regardless of the driving context. As it happens in the other graphs, Matlab shows the biggest amount of outliers, which the C++ implementation manages to mitigate and CUDA almost eradicates. CUDA manages to outperform the other two implementations by at least a median factor of 8 in both stages.
Finally, Figure 13c shows the resample stage times. Here, C++ shows the biggest spread between the maximum and minimum values potentially due to slow memory allocations when restructuring the vector that hosts the particles; in contrast, the Matlab matrix access architecture does a more consistent job of creating copies of the high-scoring particles. CUDA shows here the biggest speedup (×21.8 over Matlab and ×12.7 over C++) thanks to the parallel reduction algorithm.

Effective Population
This section compares the quality of the prediction when executing the algorithm on the datasets. Only the C++ and CUDA versions are compared, as the Matlab implementation is not able to run online due to its long execution time. Table 1 shows the average accumulated weight of all the particles in the particle filter in the four different scenarios. It is possible to see that the CUDA implementation shows a higher accumulated weight across all scenarios. Higher levels of accumulated weight indicate that there is a better match between prediction and reality within the particle filter, which also improves the motion estimation in later stages of the algorithm. The difference is more acute in the roundabout areas possibly because a faster iterating algorithm is better prepared to deal with predicting the future position on turning vehicles. Indeed, the internal physical model used does not take into account the Ackermann configuration of the vehicle and could therefore incur overshooting when predicting an orientation with a big time increment. The impact of this phenomenon would be smaller on longer straight sections that are less demanding on execution times, which is consistent with the higher scores of C++ at the intersection maps.

Scalability
The goal of this section is to evaluate the scalability of the particle filter by comparing the average execution time of the CUDA and C++ implementations while varying its number of particles. Figure 14 shows the study in which the C++ curves have a consistently higher slope, making the implementation less adaptable to a potential need to increase the number of particles to deal with more complex scenarios or to produce a more accurate prediction. The C++ curves are also non-linear with the slope changing, as can be seen more noticeably in Figure 14b, making it harder to estimate the time investment required to increase the number of particles.

Motion Prediction
The time gains that were aimed at with the substitution of the matrix multiplication in the Markov Chain computation presented in Section 4.3.1 are described in this section. Figure 15 shows the average time of execution of the motion prediction computation across the four different scenarios. The Matlab and C++ versions include their own naive matrix multiplication optimizations that consist in creating smaller matrices from the non-zero elements of the full matrices and saving time by multiplying those. The CUDA implementation still manages to outperform the other two across all scenarios. It does so most dramatically over the Matlab version, while the C++ version manages to stay within reasonable limits of the time execution that could potentially be of use in a simple scenario with a low number of cars and corridors. The horizon prediction chosen for these experiments was 4 seconds.

Complete Loop
This section presents the overall speedup produced by the CUDA implementation over the complete execution of the algorithm. Results can be seen in Table 2. The speedup is obtained from parallelizing the most expensive sections of the interaction-aware motion prediction algorithm, such as the matrix multiplications and the parallel prediction of the particles while avoiding the expensive GPU-CPU memory transfers by keeping most of the processing in the GPU VRAM.  Table 3 shows the prediction frequency that each implementation is able to produce on average with each dataset. A minimum rate of prediction production of roughly 5 Hz means that the algorithm is ready to feed a potential path-planning system that runs at that speed (e.g., refs. [42,43]), validating the purpose of these optimizations.

Baseline Comparison
This section aims to further illustrate the time gain of the proposed GPU accelerations by comparing them to the baseline GPU algorithms. Using the same methodology as in Section 5.2, each experiment was run three times in every dataset and then averaged. In the following, the three major GPU acceleration proposals of the publication are studied: parallel resampling, parallel output generation and memory-aligned matrix combination.
The baseline chosen for the parallel resampling comparison was a combination of the permutation iterator [44] with the vector copy algorithm [45] from Thrust [46]. Thrust is a C++ library that aims to offer standard C++ functionality in parallel algorithms. This experiment was run with 20,000 particles for the particle filter. The results that can be seen in Figure 16 show the proposed approach outperforming the baseline a median of 24.22 times. This is due to the importance of having coalesced memory accesses when copying thousands of particles which are memory-heavy and escalate poorly with naive copy methods.  Figure 17 shows the comparison of the proposed parallel output generation using the parallel reduction sequential addressing method versus a baseline that consists in accessing the memory values in parallel and then using the atomicAdd CUDA function [47] to safely reduce a vector. The experiment was run using 20,000 particles for the particle filter. The figure shows a median speedup of 4.56 times over the baseline, which can be explained by the avoidance of serialization that the parallel reduction algorithm is able to perform when using shared memory within the blocks as a middle step to the final reduction. In Figure 18, the memory-aligned matrix combination baseline comparison is presented. Here, the baseline chosen was the CUDA cusparseScsrgemm [48] sparse matrix multiplication function. The state space has dimensions n s = 200, n v = 20 and n a = 6, which results in the matrices λ and ψ of size 24,000 × 24,000. The results show that the proposed method achieves a median speedup of 108.61 times over the stock matrix multiplication method. This speedup happens because using an informed combination of the matrices allows for skipping a significant amount of arithmetic computations and also unlocks faster memory accesses by keeping the relatively small ψ submatrix in GPU local memory. Finally, Figure 19 shows a qualitative comparison of the motion prediction output using the baseline methods and the proposed methods, depicting very similar outputs that support the proposed approach's validity.  Figure 19. Qualitative comparison of the prediction output using cuSPARSE matrix multiplication and the proposed method. Blue rectangle represents the studied vehicle and the blue oval-shaped shadows represent its probabilistic motion predictions.

Nvidia Drive PX2 Tests
This section presents the results from running the algorithm on a Nvidia Drive PX2 that was fed on perception data produced by the sensors and algorithms [49,50] running on the car presented in Figure 11. The test was performed following a car on the Autopia Test Track depicted in Figure 12 and computing its motion prediction for 9 min. Figure 20a shows the complete loop execution time with the number of possible corridors of the following car overlapped. It illustrates a direct relation between the number of corridors and the execution time. The computing time remains under 200 ms at all times, even when the car has a maximum of nine potential corridors that require the algorithm to produce nine different predictions. Figure 20b shows the time of producing a prediction per corridor, which for this test remains on average at 20 ms.   Figure 21 shows the effective population of the particle filter during the test run on the Autopia Test Track. The horizontal black line at 10,000 shows the resampling threshold of the filter. With 60.3% of the iterations staying above the threshold and therefore requiring no resampling, the algorithm shows proficiency at predicting the pose, speed and orientation of the followed vehicle. It also creates a positive feedback loop, in which the algorithm can save the time of resampling to produce a faster prediction that tends to be more accurate, as it is closer to the measured state of the car.

Prediction Evaluation
In order to contextualize the motion prediction algorithm being accelerated, this section presents a study of the prediction error through a comparison of IAMP with a state-of-the-art motion prediction algorithm.
The state-of-the-art algorithm chosen for the comparison is the enhanced graph-based interaction-aware trajectory prediction (GRIP++) model [51], which uses a long short-term memory neural network to produce vehicle motion prediction. The reasons that this algorithm was selected for comparison are that the training framework is open source, which offered the possibility to train it in the same cases that IAMP was evaluated in. Also, it is the same algorithm that was used in [20], and it was deemed coherent to use the same algorithm from comparison across publications.
The quantitative prediction metrics used are as follows: • ADE (average displacement error): average L 2 distance between the ground truth positions and the weighted average position of the predictions, • FDE (final displacement error): L 2 distance between the last ground truth position and the weighted average position of the last prediction, Given that the output from the IAMP framework is multi-modal, the minimal value from the metrics is utilized for each vehicle. The results for the evolution of the situations in all cases are presented in Table 4.  Figure 22 presents the position error distribution over the prediction horizon for all vehicles in all situations with mean value and ±3σ boundary for both approaches. With IAMP, the average error for the last prediction is 3.53 ± 8.64 m. One factor that affects these values is the resolution of the Markov chain abstraction since the probability distribution within each state is uniform. The results in the last prediction for GRIP are significantly worse, being that the position error is 8.12 ± 24.62 m. A qualitative comparison of the predictions for the whole prediction horizon for both approaches is presented in Figure 23. The output from IAMP is shown in blue and from GRIP, in red. The ground truth position is displayed in black. Finally, Table 5 shows the execution time comparison of the motion prediction modules of IAMP versus GRIP. It is possible to see that GRIP has the edge, even though both stay within a comparable range. A possible explanation could come from the fact that GRIP produces a single point of motion prediction while IAMP produces a full probabilistic footprint. Moreover, IAMP produces multi-modal predictions, while GRIP produces a single prediction per vehicle. This causes an execution time trade-off that arguably pays off in the prediction performance based on the results that can be seen in Table 4 and Figure 22.

Conclusions
In this paper, a novel set of techniques is presented to bring a computationally expensive motion prediction strategy from theoretical simulation to real-time execution. It exploits the latest GPU architectures and demonstrates its suitability in a real-time setting with real traffic conditions. The proposed acceleration techniques combine existing methods that reduce memory accesses, arithmetic instructions and increase GPU occupancy with novel application-specific techniques that make use of informed combinations to minimize the computation workload. The paper demonstrates that the implemented techniques produce a scalable solution that runs faster across different scenarios and platforms. In future work, multi-GPU techniques will be implemented to exploit the additional integrated GPU on the Nvidia Drive PX2 and the memory regions that are shared with the CPU. Additionally, new features from IAMP, such as the integration of learn-based models [52], will be included.