GPU-Accelerated Interaction-Aware Motion Prediction

Hortelano, Juan Luis; Trentin, Vinicius; Artuñedo, Antonio; Villagra, Jorge

doi:10.3390/electronics12183751

Open AccessArticle

GPU-Accelerated Interaction-Aware Motion Prediction

Centro de Automática y Robótica, CSIC—Universidad Politécnica de Madrid, Ctra. Campo Real, Km 0.200, Arganda del Rey, 28500 Madrid, Spain

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(18), 3751; https://doi.org/10.3390/electronics12183751

Submission received: 26 July 2023 / Revised: 30 August 2023 / Accepted: 4 September 2023 / Published: 5 September 2023

(This article belongs to the Special Issue Emerging Technologies in Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Before their massive deployment, autonomous vehicles need to prove in complex scenarios such that they can reach human driving proficiency while guaranteeing higher safety levels. One of the most important human traits to negotiating traffic is the ability to predict the future behavior of surrounding vehicles as a basis for agile and safe navigation. This capability is particularly challenging for an autonomous system in highly interactive driving situations, such as intersections or roundabouts. In this paper, a set of techniques to bring a computationally expensive state-of-the-art motion prediction algorithm to real-time execution are presented with the goal of meeting a standard motion-planning algorithm execution frequency of 5 Hz, which is the primary consumer of motion predictions. This is achieved by applying novel and existing parallelization algorithms that take advantage of graphic processing units (GPUs) through the compute unified device architecture (CUDA) programming language and managing to produce an average 5× speedup over raw C++ in the cases studied. The optimizations are then evaluated in public datasets and a real vehicle on a test track.

Keywords:

autonomous vehicles; motion prediction; interaction aware; GPU; CUDA

1. Introduction

Autonomous vehicles have seen a huge influx of resources and research attention in recent years, as there is a real drive to make this technology common and of everyday use. Some of the challenges that autonomous driving tries to address are high collision rates [1], traffic congestion [2], difficult access to public transport [3] and lack of workforce [4]. A solution to these problems needs to be technically ready to be deployed in automated vehicles to allow them to navigate real traffic.

One of the main requirements for autonomous navigation in real traffic is the ability to predict the future motion of the surrounding vehicles in a timely manner. The main characteristic of vehicle motion prediction is that the traffic agents influence one another and that interaction needs to be taken into account to produce accurate predictions [5]. An illustrative example of these interactions can be seen in Figure 1, where the predictions of each vehicle at a given time are depicted with multi-modal footprints (i.e., a spatio-temporal probabilistic distribution per potential corridor). However, interaction-aware motion prediction poses a problem in computing scalability, as every increase in the number of surrounding vehicles incurs added computing costs in the motion prediction of all other vehicles.

The objective of this work is to produce an interaction-aware motion prediction algorithm that can run in real time while dealing with dense traffic. A specific goal for this system is to be able to feed reliable motion predictions into a real-time path planning algorithm, which is often working at frequencies not lower than 5 Hz. In this context, a collection of GPU acceleration methods are designed and implemented to bring down the computational cost of a multi-modal probabilistic interaction-aware motion prediction algorithm. GPUs are selected as the acceleration driver for this work because of their ubiquity in modern computers, their relation between price and processing power, and the relative ease of development thanks to modern parallel frameworks. The set of techniques presented are both (i) problem specific, taking advantage of application knowledge to bring down computation time further than available libraries could, and (ii) general acceleration methods that are proven to work in other existing scenarios. The proposed methods tackle the most computationally expensive activities, such as memory accesses, arithmetic instructions and sequential computation bottlenecks. A video of the execution of the experiment cases containing the motion grid for the last time step can be found in the Supplementary Materials.

The main contributions of this paper are as follows:

A set of novel GPU parallelization techniques to allow a state-of-the-art interaction-aware motion prediction algorithm to run in real time.
A study of the developed technique’s performance in public datasets and on a real vehicle in a test track.
A comparison of the accelerated algorithm’s prediction error with a state-of-the-art algorithm.

The rest of the paper is structured as follows: Section 2 presents a review of the most relevant related work. Section 4 describes the set of techniques developed to accelerate motion prediction through the use of a GPU. Section 5 describes the experimental setup and the experiments carried out to document the effect of the proposed strategy. Section 6 gives insights regarding the prediction performance of the accelerated algorithm and contextualizes it with a comparison. Finally, Section 7 draws some concluding remarks.

2. Related Work

As of late, there has been a development push to bring scientific advances to the real world by accelerating algorithms. This push has been made in the algorithms themselves and in the architectures that host them. One of the most popular embedded architectures used is field programmable gate arrays (FPGAs) due to their low latency, efficient power consumption and robustness to adverse conditions. In [6], the authors argue that FPGA development is mature enough to host expensive simultaneous localization and mapping algorithms. In [7], they apply an FPGA to offload algorithm components while acting as a fallback system in the case of failure. In [8], they demonstrate parallelization techniques for a decision-making algorithm deployed on an FPGA with its corresponding execution time study.

Together with FPGAs, another hardware platform used for accelerating software is application-specific integrated circuits (ASICs), which are chips that are specifically designed for a particular use. ASICs can be extremely efficient and cost effective when acquired in bulk but require a steep investment upfront, which makes them unsuitable for most robotic applications as well as for scientific investigation.

A more common computing platform used in robotics is the GPU. GPUs offer the ability to perform parallel computation on large amounts of data that, when coupled with parallel algorithms, can perform at speeds orders of magnitude faster than linear central processing unit (CPU) computations. GPUs have been used to optimize perception systems based on LiDAR sensors and convolutional neural networks [9], moving object lidar-based segmentation [10], vehicle collision analysis from point clouds [11], path planning for multiple unmanned aerial vehicles [12] and motion prediction as a stage of a larger smart mobility application [13] amongst others. Lately, they have been adopted by several key players in the autonomous driving industry through the use of Nvidia drive computing platforms.

To fully exploit the GPU capabilities, it is often necessary to develop dedicated parallel algorithms, which has brought an influx of novel techniques to classic algorithms. One example is the resampling stage in particle filters, which is historically complex to parallelize due to their sequential nature. In [14], the authors offer a parallel reimplementation of existing resampling methods while improving upon them by taking advantage of the multi-threaded architecture of GPUs. Another example is the inference of probabilistic predictions. In [15], the authors propose a Bayesian temporal factorization framework that is able to generate predictions by modeling multidimensional time series that accounts for parallelization in their model inference algorithms. In [16], the authors use a GPU in training and inference to perform traffic flow prediction based on a temporal graph neural network.

Another field that has stirred interest in optimization is motion prediction based on Markov chain matrix multiplication, due to the high computational cost that comes associated with it. In [17], the authors acknowledge this fact and propose the possibility of accelerating the matrix multiplication using dedicated hardware. Even though there has been some effort to make this method feasible in online execution [18], there is still a struggle to reach a prediction frequency of over 1 Hz in complex scenarios.

3. Architecture Overview

The interaction-aware motion prediction (IAMP) strategy considered in this work first infers the surrounding traffic agents’ intentions, using a dynamic Bayesian network, and then predicts their motion using Markov chains. It explicitly considers interactions, the road layout and traffic rules, and can be used in almost any driving context, as it does not rely on training data. IAMP receives information from the perception module that informs the algorithm about the position and velocities of the surrounding vehicles and the ego vehicle and from a pre-computed vectorized map that is loaded once in the Lanelet [19] format.

At each time step, the interactions amongst vehicles and between vehicles and the layout are taken into account to infer the probabilities to stop at intersections, the probabilities to change lanes and the probability of being in each of the possible navigable corridors. These interactions are computed for every vehicle in the scene against the rest and take into account both information that is coded into the traffic rules and also potential deviations from the expected road-safe behavior to deal with unsafe driving situations, such as violating a stop line. This feature makes this interaction-aware motion prediction algorithm especially useful in complex situations with dense traffic, like unsignalized intersections [20], sudden unsafe lane changes [21], or roundabouts [22]. The intentions are then fused with the motion predictions computed with a kinematic model to produce a 3D motion grid used by the ego vehicle to navigate through the scene.

The motion grid is a cube composed of a collection of matrices, one per time step until the prediction horizon, that host the grid of predictions for each vehicle. The grid-based probabilistic representation of the predictions is centered around the ego vehicle and takes into account the uncertainties both in the motion model and in the input data.

The algorithm can be summarized in the scheme of Figure 2. The main parts where the GPU-based acceleration has significantly reduced the computation time are highlighted in blue in the flowchart, namely in the particle filter used to compute the intentions and in the matrix multiplication of the motion prediction. This flowchart is extensively described in [20,21,22], where the algorithm is validated and compared with state-of-the-art approaches. Hence, it will only be briefly reviewed below.

3.1. Corridors

One of the core parts of interaction-aware motion prediction is to define the potential driving lanes of each vehicle to determine which ones influence each other and which ones are independent. The current navigation model is reliant on maps that are formed by lanelets [19].

Corridors are the collection of adjacent lanelets that start from the current position of the vehicle and reach the distance that the vehicle can travel at a certain time at its current speed. The adjacent lanelets are found via a graph search in the physical layer using the neighborhood relations stored in the lanelets. Corridors store the list of lanelet ids that form them and their topological relation to other corridors, resulting in an inexpensive way to find out if a vehicle’s corridors intersect with another, and therefore a relation of dependence can be established. Figure 3 show an example of a vehicle’s corridors.

3.2. Relations

Given a map and the list of corridors obtained for all vehicles present at the scene, three types of relations are obtained: lateral relation, corridor-to-intersection (distance to intersections) and corridor-to-corridor (corridors dependencies):

Lateral relation: For each target vehicle, a search of the surrounding vehicles is performed, and the bumper-to-bumper distance and their velocities are stored.
Corridor-to-intersection: The intersections each corridor goes through are determined by intersecting the lanelet’s identifiers. Then, for all intersections a corridor goes through, the distance to the intersection is computed in the Frenet frame.
Corridor-to-corridor: In order to determine which corridor will influence the predictions of the other, a search is performed by pairwise intersecting the centerlines of all corridors. This generates a list of possible collisions. Based on a set of rules, one (if any) corridor is selected as the corridor influencing the motion of another vehicle.

The main function of the relation computation is to create influence matrices that determine which vehicles affect each other’s future behavior in the intention estimation step of the algorithm.

3.3. Intention Estimation

The interactions between various elements in a vehicle’s environment play a crucial role in determining its behavior. In order to estimate the intentions of the traffic participants, the dynamic Bayesian network (DBN) described in [23] inspired the one used in this work. It takes as input the corridors and relations obtained in the previous steps. DBN is a powerful modeling tool that tackles uncertainty and promotes explainability [24]. The network represented in Figure 4 is instantiated for each of the vehicles present in the scene. Note that bold arrows represent the influences of the other vehicles on vehicle n through some key variables (

E_{t}^{n}

,

I_{t}^{n}

,

Φ_{t}^{n}

,

Z_{t}^{n}

) described below, where

n \in 1 \dots N

and N is the number of vehicles present at the scene.

The relationships between the variables in Figure 4 can be represented by the following joint distribution [23]:

\begin{matrix} P (E_{0 : T}, I_{0 : T}, Φ_{0 : T}, Z_{0 : T}) = P (E_{0}, I_{0}, Φ_{0}, Z_{0}) \times \\ \prod_{t = 1}^{T} \times \prod_{n = 1}^{N} [P (E_{t}^{n} | I_{t - 1} Φ_{t - 1}) \times P (I_{t}^{n} | Φ_{t - 1}^{n} I_{t - 1}^{n} E_{t}^{n}) \times \\ P (Φ_{t}^{n} | Φ_{t - 1}^{n} I_{t - 1}^{n} I_{t}^{n}) \times P (Z_{t}^{n} | Φ_{t}^{n})] \end{matrix}

(1)

where we have the following:

Expectation $E_{t}^{n}$ : This describes the expected maneuver of a vehicle n at time t based on the road geometry, traffic regulation and other vehicles behavior. It is decomposed in two dimensions: longitudinal and lateral. Longitudinal Expectation, also called Expectation to Stop ( $E_{s}$ ), estimates the probability of a vehicle stopping at an intersection. Lateral Expectation, also called Expectation to Change ( $E_{c}$ ), estimates the probability of a vehicle changing lanes. Both values are binary, their possible values are go and stop for $E_{s}$ and stay and change for $E_{c}$ .
Intention $I_{t}^{n}$ : This describes the intended maneuver of a vehicle n at time t based on the previous intention, current expectation and vehicle physical state. There are also Lateral and Longitudinal intentions ( $I_{c}$ and $I_{s}$ , respectively).
Physical vehicle state $Φ_{t}^{n}$ : This describes the estimated position, orientation and speed of each vehicle.
Measured vehicle state $Z_{t}^{n}$ : This describes the measured pose, orientation and speed of each vehicle. They are obtained through the available sensors of the vehicle.

The relations obtained from the pairwise analysis of the corridors are used to estimate the longitudinal

E_{s}

[20] and lateral

E_{c}

[21] behaviors of the vehicles.

An exact inference of (1) is often unfeasible. Hence, a particle filter [25] is used to estimate the hidden states

E_{t}

,

I_{t}

and

Φ_{t}

, given the observed variables

Z_{t}

.

The number of vehicles at the scene might change at each time step. This may occur due to the ego vehicle’s sensors range or due to vehicles entering or leaving the covered zone in the case of a dataset. To manage such situations, for every new vehicle that appears, the DBN is instantiated, and the initial intentions and route are randomly set from the list of possible values. Once a vehicle leaves the scene, its physical state is updated for a few steps before being completely discarded. This is done to avoid having to initialize a vehicle already seen in previous steps.

3.4. Motion Prediction

The computation of the surrounding vehicles’ predictions is inspired by the method proposed by [26]. The system dynamics are abstracted into Markov chains, where the state space X and input space U are discretized into intervals. The state space consists of longitudinal position s and velocity v, and the input space represents the potential acceleration a.

The transition probability matrices of the Markov chains for a time step

Υ (τ)

, and for a time interval

Υ ([0, τ])

, where

τ

is the time increment, are computed offline with reachability analysis [26], using the following differential equation as the vehicle’s longitudinal dynamics:

\begin{matrix} \dot{s} = v \\ \dot{v} = \{\begin{matrix} a^{m a x} u, & 0 < v < v^{m a x} \\ 0, & v \leq 0 \lor v \geq v^{m a x} \end{matrix} \end{matrix}

(2)

where

a^{m a x}

and

v^{m a x}

are the maximum allowed acceleration and velocity, respectively, and u is a value in the [−1, 1] range sampled from the discretized input space U.

The states probability distributions for future time steps

p (t_{k + 1})

and time intervals

p (t_{k}, t_{k + 1})

are computed as follows:

\begin{matrix} p (t_{k + 1}) = Γ (t_{k}) Υ (τ) p (t_{k}) \\ p (t_{k}, t_{k + 1}) = Υ ([0, τ]) p (t_{k}) \end{matrix}

(3)

where

Γ (t_{k})

is the input transition matrix that represents the transition probabilities between the input states. These probabilities take into account the acceleration profiles obtained for each corridor considering the vehicle-to-vehicle and vehicle-to-layout relations as explained in [20].

These predictions are projected onto the corridor in a road-shaped format and are fused with the corridors’ probabilities, inferred by the DBN, to generate a motion grid.

4. Real-Time Framework

4.1. GPU Programming Model

GPUs were created to process graphics in a dedicated manner independently of CPUs. Due to its advantages in performing simple math, in parallel, the GPU was divorced from its original use. In 2007, Nvidia released the CUDA API [27], which allows developers to make use of the GPU from a high-level programming language on top of C++. The usual workflow of a CUDA program is allocating memory on the GPU, transferring data from CPU, making program calls to the GPU from the CPU and transferring back the data to the CPU.

CUDA establishes an execution hierarchy and a memory hierarchy. In the execution hierarchy, it defines kernels, which are blocks of code executed by a single thread. The threads are executed simultaneously in groups called warps, which are scheduled within blocks. Blocks are then grouped into grids and executed in the different streaming multiprocessors (SM) of the GPU. The memory hierarchy defines types of memory based on their size, speed and accessibility.

Global memory is accessible from all threads in the GPU and is the highest in size but it also has the most access latency.
Shared memory is accessed from within the streaming multiprocessors, is shared across the threads in a block, and has very low latency.
Local memory is the memory that each thread can statically allocate and is only visible from within the thread.

Analyzing performance, also called profiling, is a core part of CUDA development. CUDA defines several concepts to study code performance such as occupancy, coalescence and instruction optimization. Occupancy is the ratio between the number of active warps to the maximum number of warps per SM; it gives information about the used percentage of the GPU, which should be as high as possible for maximum performance. Coalescence is a memory access technique that aims to group memory transactions to reduce them and make optimal use of memory bandwidth. Instruction optimization studies the number of low-level instructions issued to the processors to keep them to a minimum.

4.2. Parallel Particle Filter

This section describes the implementation of acceleration techniques on the different stages of a particle filter (predict, update, resample and output generation) that are used for the inference of the intention, expectation and future pose of the surrounding vehicles.

4.2.1. Predict

The predict stage of the particle filter produces an estimation of the future state of the system based on the previously estimated state and the measured state.

Each particle contains a complete representation of the environment: the estimated and measured states of all vehicles and their intentions and expectations. Every particle holds a vector of vehicle states that define a complete scene within itself. The vehicle state is defined by its position in the Cartesian coordinate system (x, y), orientation

θ

and velocity v. This maintains full independence across particles and unlocks the possibility of assigning each particle to a thread, therefore enabling the parallel computing of the prediction. It also minimizes the amount of information that needs to be exchanged from CPU to GPU at the beginning of each iteration, as everything except from the measured state is computed and remains in the GPU memory.

The data structure of a particle can be seen in Figure 5, where the subindex k reflects the number of particles minus one, the subindex n reflects the number of vehicles minus one, and the variables with subindex e represent the estimated state of a vehicle. Every variable is stored in a C++ float, which takes 4 bytes of memory. The design choice made here was to create an array of structs of vehicle states to enable a concurrent memory access pattern per particle. This is necessary because in each iteration, the algorithm accesses every field in a vehicle state to compute the intention and expectation, and therefore it needs them to be contiguous in memory to allow coalescing to happen.

GPUs require natural alignment in their data; in other words, memory reads and writes will always work with words of size equal to 1, 2, 4, 8, or 16 bytes or a number divisible by any of those [28]. The parallel programming design consideration taken here was to keep the data size to a number divisible by 8 bytes to guarantee contiguous memory alignment. The complete particle struct is formed by 12 floats per vehicle, which amounts to 48 bytes and is divisible by 8. If the amount of vehicles present in the scene results in a particle data structure that is not divisible by 8, padding is added. This contiguous memory alignment allows for faster performance, as it minimizes the number of memory transactions from global memory to the individual threads necessary per iteration.

4.2.2. Update

The update phase of the particle filter computes the weights of each particle based on the quality of the prediction. To do that, it compares the predicted pose and velocity to the measured ones using the following normal distribution:

f (x | μ, σ) = \frac{1}{σ \sqrt{2 π}} e^{\frac{- {(x - μ)}^{2}}{2 σ^{2}}}

(4)

The process works by setting the mean

μ

as the measured value, the standard deviation

σ

as a fixed value that is set based on how spread the particle population needs to be, and the input x, y, θ, v as the predicted position, orientation and velocity, respectively.

This process is performed for every vehicle, and then their weights are multiplied to produce the total particle weight

w_{k}

as follows:

w_{k} = \prod_{n = 1}^{N} f (x | μ_{x}, σ_{x}) f (y | μ_{y}, σ_{y}) f (θ | μ_{θ}, σ_{θ}) f (v | μ_{v}, σ_{v})

(5)

where N is the number of vehicles. This combination allows for the weight to define the prediction quality of a full scene instead of evaluating predictions per vehicle in isolation.

The weights of all particles are computed simultaneously using a kernel with an optimized version of the normal distribution function. This aims to optimize the number of arithmetic instructions issued to the processors by substituting the square root for its numeric equivalent. The square root is chosen to be replaced since the CPU uses Newton’s method to calculate it [29], which is computationally expensive. There is a loss of precision with a maximum error of

\pm 10^{- 8}

[m, rad, m/s²], which is the float precision and is deemed an acceptable trade-off for the gained performance.

4.2.3. Resample

The next stage in the particle filter is resampling, which works to avoid weight degradation and filters out poor-performing particles. The measurement of the overall particle filter performance is the effective population

N_{e f f}

, which reflects the degree of the weight quality [30]:

N_{e f f} = \frac{1}{\sum_{k = 1}^{K} w_{k}^{2}}

(6)

where K is the total number of particles. This stage is only triggered when this effective population falls below a threshold that is set to be half the total number of particles.

The resampling algorithm is based on the low variance sampler [31], which is an efficient sampler of complexity O(M), where M is the size in bits needed to represent the input that selects particles with a probability directly proportional to its weight but only computing a single random number.

Figure 6 shows a schematic of the resampling algorithm that works by splitting the particle range into sections proportional to their weight

w_{p}

. The selection of the new particles occurs by first sampling a random value

r \in [0, K^{- 1}]

and then adding this value to the inverse of the total number of particles K multiplied by the new particle number m:

u = r + (m - 1) K^{- 1}

(7)

The particle whose section corresponds to u is added to the new set of particles. This process duplicates performing particles and removes low-performing particles.

The algorithm in CUDA works by generating a vector of indexes using a single thread and storing it in global memory. Then, it uses a kernel to copy the chosen particles in a new vector using grid-stride looping. Grid-stride looping is a CUDA technique that uses the grid dimension as the period for a loop wrapping the code within the kernel and therefore achieving complete coalescence in memory by ensuring that all memory accesses are contiguous. Figure 7 shows a simplified representation of the technique in which a grid of size 2 with blocks of size 2 use their threads to access the memory spaces, in blue, with a stride of grid size × block size, meaning that the first thread of the grid will access memory at spaces 0, 4, 8, etc. This unit-stride memory access is what achieves coalescence.

4.2.4. Output Generation

Once the particles have been resampled based on their weight, the algorithm goes into its final stage, which is producing an estimation of the system’s state S as follows:

S = \sum_{k = 1}^{K} w_{k} S_{s}

(8)

where each particle specific substate

S_{s}

(

E_{s}

,

I_{s}

,

E_{c}

,

I_{c}

, x, y,

θ

or v) is multiplied by its weight w and then added to the rest of the particles to produce a result. This state estimation output is then used two-fold: as feedback for the particle filter’s next iteration and to produce a probability distribution per corridor by discriminating the sum of weights per group of particles with that corridor.

This problem is solved by using a reduction algorithm in which a large dataset is transformed into a small dataset, or a single value per corridor in this case, by applying an operator to each element. In CUDA, the algorithm can be accelerated by taking advantage of its execution hierarchy and its memory hierarchy at the same time using a technique called parallel reduction sequential addressing [32]. The technique consists in using thread blocks to iteratively reduce fragments of the dataset while storing them in shared memory. The sequential addressing part of the technique consists in combining the values by accessing the memory in strides of a size equal to a multiple of the block size. As a result, the accesses are contiguous, and therefore better coalescence is achieved.

Figure 8 shows a simplified representation of the reduction algorithm used for the computation of the longitudinal expectation

E_{s}

. Note that there is an array of eight elements and a parallel reduction with strides of four, two and one per iteration is applied.

4.3. Parallel Motion Prediction

This section describes the technique used to accelerate the motion prediction stage of the algorithm. It uses the predicted intention, pose and velocity of the surrounding vehicles from the particle filter and outputs a set of predicted motions for each perceived vehicle.

4.3.1. Memory-Aligned Matrix Combination

The motion prediction of the vehicles is computed by abstracting their system dynamics into Markov chains, which produce a discretized state space

X_{s, v}

and input space

U_{a}

, and then multiplying them with transition matrices to produce an estimation of their future states. The state space

X_{s, v}

is formed by the discretized position space s and the discretized velocity space v; the input space

U_{a}

is formed by the discretized acceleration space a.

One of the issues with this approach is that it escalates poorly with the resolution of those state spaces as the matrices grow and thus the computational cost to multiply them. A good opportunity for acceleration appears on the computation of the input transition matrix

Γ

, which stores the probabilities of a given system to change from one input state to another as can be seen in Equation (3). It is computed by multiplying the priority vector

λ

as a diagonal matrix with the input dynamic matrix

ψ

:

Γ = λ * ψ

(9)

The priority vector

λ

describes the restrictions imposed on the vehicle by the road geometry or other vehicles present in the scene and, as it changes on every iteration, it cannot be computed offline and stored. The input dynamic matrix

ψ

represents the intrinsic characteristics of the vehicle input dynamics when there are no restrictions at play. The dimensions of each matrix are

n_{s} \cdot n_{v} \cdot n_{a}

×

n_{s} \cdot n_{v} \cdot n_{a}

, each one being the number of possible discretized positions, velocities and accelerations, respectively.

One interesting characteristic of

ψ

is that it is composed of repeating submatrices with dimensions

n_{a}

×

n_{a}

because it is assumed that the changes between inputs are not time-dependent and do not influence other states.

λ

is a vector transformed into a diagonal matrix, and its shape can be seen below. Using this problem-specific knowledge, it is possible to produce an informed combination of the vector

λ

and a submatrix of

ψ

to substitute the costly matrix multiplication, thus reducing the number of instructions executed and memory accesses required.

A simplified, albeit lacking physical sense, numeric example is presented below that illustrates the combination of

λ

and

ψ

with

n_{s} = 1

,

n_{v} = 1

and

n_{a} = 2

:

λ = [\begin{matrix} λ_{1} & 0 & 0 & 0 \\ 0 & λ_{2} & 0 & 0 \\ 0 & 0 & λ_{3} & 0 \\ 0 & 0 & 0 & λ_{4} \end{matrix}]

ψ = [\begin{matrix} ψ_{1} & ψ_{2} & 0 & 0 \\ ψ_{3} & ψ_{4} & 0 & 0 \\ 0 & 0 & ψ_{1} & ψ_{2} \\ 0 & 0 & ψ_{3} & ψ_{4} \end{matrix}]

Γ = [\begin{matrix} λ_{1} \cdot ψ_{1} & λ_{1} \cdot ψ_{2} & 0 & 0 \\ λ_{2} \cdot ψ_{3} & λ_{2} \cdot ψ_{4} & 0 & 0 \\ 0 & 0 & λ_{3} \cdot ψ_{1} & λ_{3} \cdot ψ_{2} \\ 0 & 0 & λ_{4} \cdot ψ_{3} & λ_{4} \cdot ψ_{4} \end{matrix}]

The combination consists in multiplying each

λ_{n}

value with the first set of values from the

ψ^{n_{a} \times n_{a}}

submatrix in a repeating pattern. This reduces both the number of calculations and memory since only

ψ^{n_{a} \times n_{a}}

and the diagonal vector are required. This process is conducted following

Γ_{i} = λ_{f l o o r (i / n_{a})} * ψ_{m o d (i, n_{a}^{2})}^{n_{a} \times n_{a}}

(10)

where

ψ^{n_{a} \times n_{a}} = [\begin{matrix} ψ_{1} & ψ_{2} \\ ψ_{3} & ψ_{4} \end{matrix}]

and each

Γ_{i}

value is computed by multiplying individual values of

ψ^{n_{a} \times n_{a}}

by each

λ_{n}

and then repeating with a stride of size

n_{a}

. The indexes of

λ

and

ψ^{n_{a} \times n_{a}}

are precomputed and stored offline, as they do not change throughout the execution, significantly reducing the computational cost of the mod operation. The

ψ^{n_{a} \times n_{a}}

submatrix is stored in local memory, as the input space is usually small (<10), producing a submatrix of floats that fits in the faster local memory space. Then, each thread performs the multiplication of the corresponding

λ_{n}

and

ψ^{n_{a} \times n_{a}}

value while tracking the

Γ_{i}

index to maximize memory coalescing on the

Γ

memory space.

5. Experiments

5.1. Experimental Setup

The GPU acceleration methods presented in this paper were evaluated using two different approaches: first, using the open source and publicly available datasets inD [33] and openDD [34]; second, acquiring a dataset using the AUTOPIA test track and experimental vehicle.

5.1.1. inD and openDD

The experiments conducted in this section compared the original Matlab version of the algorithm against a pure C++ adaptation and the CUDA version that includes all the work presented in this paper. The chosen maps from the datasets can be seen in Figure 9 and were chosen to have two different intersections and two different roundabouts to showcase enough interaction between the cars.

Figure 10 shows the number of corridors computed per vehicle along the dataset execution. The number of corridors is chosen instead of the number of cars because one prediction needs to be computed per corridor, thus giving a better picture of the dataset complexity.

All the experiments in this section were executed on an HP ZBook Studio G5 with an Intel(R) Core(TM) i7-9750H CPU @ 2.60 GHz, 16 GB RAM DDR4 2667 MHz and an Nvidia Quadro P1000 Mobile with 4 GB VRAM.

5.1.2. Autopia Test Track

The datasets in this section were acquired using a Citroen DS3 modified by the Autopia Group [35] to be fully autonomous as can be seen in Figure 11. The layout of the test track can be seen in Figure 12.

The results of this section were obtained in a Nvidia Drive PX2 that is hosted on the Autopia car. The hardware architecture of the Nvidia Drive PX2 hosts a twin system, each with an ARM Cortex-A57 CPU, an integrated Pascal GPU and a dedicated GP106 GPU. The GPU-accelerated interaction-aware motion prediction was assigned the Tegra SoC B along with its dedicated GPU.

5.1.3. Timing Computation

One of the core parts of this work is the precise measurement of the performance of each implementation. This section aims to describe the inner workings of the three separate timing acquisition systems for Matlab, C++ and CUDA. All the Matlab timings were computed using the tic/toc [36] functions that query the gettimeofday [37] Linux function directly. The gettimeofday function uses a frequency of

10^{6}

Hertz [38] that is traduced into a 1 μs resolution that is considered sufficient for timing the Matlab interaction-aware motion prediction functions that very rarely go under 10 ms.

For C++, the timing system used was the steady_clock [39] function from the standard chrono library. In Linux, and under the GCC 7.5.0 compiler, steady_clock gets traduced into the clock_gettime [40] function working in monotonic mode, which guarantees an ever-increasing clock that is not affected by changes in the system time-of-day clock and is the most appropriate way to measure elapsed time between functions. The resolution of this system varies depending on the computer characteristics, but an empirical measurement showed that for the computer used in the experiments, the resolution was ∼67.5 ns.

In CUDA, the standard C++ functions are not appropriate since they measure CPU performance and would only measure the time it takes to launch a kernel and not its execution time. An alternative to keep using pure CPU timing measurements would be to synchronize after every kernel call but that would stall the GPU pipeline and affect the overall performance. For timing the GPU execution, CUDA offers the Event API [41] that can measure independent timings for code that executes simultaneously in the GPU. The resolution of this clock is reported to be 0.5

μ

s, which is fitting for measuring the CUDA implementation since the execution time never reaches 1

μ

s for the measured functions.

The rest of this chapter is structured as follows: an extensive study on the acceleration of the different parts of the algorithm studied on the inD and openDD datasets is presented in Section 5.2, Section 5.3, Section 5.4 and Section 5.5. Then, the experiments conducted on the Autopia test track are shown in Section 5.6.

5.2. Particle Filter

This section compares the time of execution of the different stages of the particle filter between the three different implementations of the algorithm. Every experiment was run three times on each map and then averaged. Then, to reduce the influence of the specificities of each map, the results across the four maps were averaged. All the experiments in this section were run with a pool size of 20,000 particles.

Figure 13 shows the overall time comparison. In Figure 13a, the box plots for the predict section of the algorithm can be seen. Here, it is interesting to see that even though the predict CUDA implementation has no algorithm modifications, apart from the fact that each particle prediction is computed in parallel, the gains are already visible in terms of computation time. Its box plot is significantly shorter than the Matlab and C++ implementations and shows no outliers, which means that its execution time is predictable and remains constant across different situations and car combinations.

Figure 13b,d show the update and result stages of the algorithm, which are the less-time-intensive stages for the three implementations. They show the least variance between maximum and minimum values in the Matlab and C++ implementations, denoting a constant effort demanded by the algorithm regardless of the driving context. As it happens in the other graphs, Matlab shows the biggest amount of outliers, which the C++ implementation manages to mitigate and CUDA almost eradicates. CUDA manages to outperform the other two implementations by at least a median factor of 8 in both stages.

Finally, Figure 13c shows the resample stage times. Here, C++ shows the biggest spread between the maximum and minimum values potentially due to slow memory allocations when restructuring the vector that hosts the particles; in contrast, the Matlab matrix access architecture does a more consistent job of creating copies of the high-scoring particles. CUDA shows here the biggest speedup (

\times 21.8

over Matlab and

\times 12.7

over C++) thanks to the parallel reduction algorithm.

5.2.1. Effective Population

This section compares the quality of the prediction when executing the algorithm on the datasets. Only the C++ and CUDA versions are compared, as the Matlab implementation is not able to run online due to its long execution time.

Table 1 shows the average accumulated weight of all the particles in the particle filter in the four different scenarios. It is possible to see that the CUDA implementation shows a higher accumulated weight across all scenarios. Higher levels of accumulated weight indicate that there is a better match between prediction and reality within the particle filter, which also improves the motion estimation in later stages of the algorithm. The difference is more acute in the roundabout areas possibly because a faster iterating algorithm is better prepared to deal with predicting the future position on turning vehicles. Indeed, the internal physical model used does not take into account the Ackermann configuration of the vehicle and could therefore incur overshooting when predicting an orientation with a big time increment.

The impact of this phenomenon would be smaller on longer straight sections that are less demanding on execution times, which is consistent with the higher scores of C++ at the intersection maps.

5.2.2. Scalability

The goal of this section is to evaluate the scalability of the particle filter by comparing the average execution time of the CUDA and C++ implementations while varying its number of particles.

Figure 14 shows the study in which the C++ curves have a consistently higher slope, making the implementation less adaptable to a potential need to increase the number of particles to deal with more complex scenarios or to produce a more accurate prediction. The C++ curves are also non-linear with the slope changing, as can be seen more noticeably in Figure 14b, making it harder to estimate the time investment required to increase the number of particles.

5.3. Motion Prediction

The time gains that were aimed at with the substitution of the matrix multiplication in the Markov Chain computation presented in Section 4.3.1 are described in this section. Figure 15 shows the average time of execution of the motion prediction computation across the four different scenarios. The Matlab and C++ versions include their own naive matrix multiplication optimizations that consist in creating smaller matrices from the non-zero elements of the full matrices and saving time by multiplying those. The CUDA implementation still manages to outperform the other two across all scenarios. It does so most dramatically over the Matlab version, while the C++ version manages to stay within reasonable limits of the time execution that could potentially be of use in a simple scenario with a low number of cars and corridors. The horizon prediction chosen for these experiments was 4 seconds.

5.4. Complete Loop

This section presents the overall speedup produced by the CUDA implementation over the complete execution of the algorithm. Results can be seen in Table 2. The speedup is obtained from parallelizing the most expensive sections of the interaction-aware motion prediction algorithm, such as the matrix multiplications and the parallel prediction of the particles while avoiding the expensive GPU-CPU memory transfers by keeping most of the processing in the GPU VRAM.

Table 3 shows the prediction frequency that each implementation is able to produce on average with each dataset. A minimum rate of prediction production of roughly 5 Hz means that the algorithm is ready to feed a potential path-planning system that runs at that speed (e.g., refs. [42,43]), validating the purpose of these optimizations.

5.5. Baseline Comparison

This section aims to further illustrate the time gain of the proposed GPU accelerations by comparing them to the baseline GPU algorithms. Using the same methodology as in Section 5.2, each experiment was run three times in every dataset and then averaged. In the following, the three major GPU acceleration proposals of the publication are studied: parallel resampling, parallel output generation and memory-aligned matrix combination.

The baseline chosen for the parallel resampling comparison was a combination of the permutation iterator [44] with the vector copy algorithm [45] from Thrust [46]. Thrust is a C++ library that aims to offer standard C++ functionality in parallel algorithms. This experiment was run with 20,000 particles for the particle filter. The results that can be seen in Figure 16 show the proposed approach outperforming the baseline a median of 24.22 times. This is due to the importance of having coalesced memory accesses when copying thousands of particles which are memory-heavy and escalate poorly with naive copy methods.

Figure 17 shows the comparison of the proposed parallel output generation using the parallel reduction sequential addressing method versus a baseline that consists in accessing the memory values in parallel and then using the atomicAdd CUDA function [47] to safely reduce a vector. The experiment was run using 20,000 particles for the particle filter. The figure shows a median speedup of 4.56 times over the baseline, which can be explained by the avoidance of serialization that the parallel reduction algorithm is able to perform when using shared memory within the blocks as a middle step to the final reduction.

In Figure 18, the memory-aligned matrix combination baseline comparison is presented. Here, the baseline chosen was the CUDA cusparseScsrgemm [48] sparse matrix multiplication function. The state space has dimensions

n_{s} = 200

,

n_{v} = 20

and

n_{a} = 6

, which results in the matrices

λ

and

ψ

of size

24,000 \times 24,000

. The results show that the proposed method achieves a median speedup of 108.61 times over the stock matrix multiplication method. This speedup happens because using an informed combination of the matrices allows for skipping a significant amount of arithmetic computations and also unlocks faster memory accesses by keeping the relatively small

ψ

submatrix in GPU local memory.

Finally, Figure 19 shows a qualitative comparison of the motion prediction output using the baseline methods and the proposed methods, depicting very similar outputs that support the proposed approach’s validity.

5.6. Nvidia Drive PX2 Tests

This section presents the results from running the algorithm on a Nvidia Drive PX2 that was fed on perception data produced by the sensors and algorithms [49,50] running on the car presented in Figure 11. The test was performed following a car on the Autopia Test Track depicted in Figure 12 and computing its motion prediction for 9 min.

Figure 20a shows the complete loop execution time with the number of possible corridors of the following car overlapped. It illustrates a direct relation between the number of corridors and the execution time. The computing time remains under 200 ms at all times, even when the car has a maximum of nine potential corridors that require the algorithm to produce nine different predictions. Figure 20b shows the time of producing a prediction per corridor, which for this test remains on average at 20 ms.

Figure 21 shows the effective population of the particle filter during the test run on the Autopia Test Track. The horizontal black line at 10,000 shows the resampling threshold of the filter. With 60.3% of the iterations staying above the threshold and therefore requiring no resampling, the algorithm shows proficiency at predicting the pose, speed and orientation of the followed vehicle. It also creates a positive feedback loop, in which the algorithm can save the time of resampling to produce a faster prediction that tends to be more accurate, as it is closer to the measured state of the car.

6. Prediction Evaluation

In order to contextualize the motion prediction algorithm being accelerated, this section presents a study of the prediction error through a comparison of IAMP with a state-of-the-art motion prediction algorithm.

The state-of-the-art algorithm chosen for the comparison is the enhanced graph-based interaction-aware trajectory prediction (GRIP++) model [51], which uses a long short-term memory neural network to produce vehicle motion prediction. The reasons that this algorithm was selected for comparison are that the training framework is open source, which offered the possibility to train it in the same cases that IAMP was evaluated in. Also, it is the same algorithm that was used in [20], and it was deemed coherent to use the same algorithm from comparison across publications.

The quantitative prediction metrics used are as follows:

ADE (average displacement error): average $L_{2}$ distance between the ground truth positions and the weighted average position of the predictions,

$ADE = \frac{1}{n} \sum_{i = 1}^{n} \sqrt{{(x_{i}^{G T} - {\hat{x}}_{i})}^{2} + {(y_{i}^{G T} - {\hat{y}}_{i})}^{2}}$
FDE (final displacement error): $L_{2}$ distance between the last ground truth position and the weighted average position of the last prediction,

$FDE = \sqrt{{(x_{n}^{G T} - {\hat{x}}_{n})}^{2} + {(y_{n}^{G T} - {\hat{y}}_{n})}^{2}}$

Given that the output from the IAMP framework is multi-modal, the minimal value from the metrics is utilized for each vehicle. The results for the evolution of the situations in all cases are presented in Table 4.

Figure 22 presents the position error distribution over the prediction horizon for all vehicles in all situations with mean value and

\pm 3 σ

boundary for both approaches. With IAMP, the average error for the last prediction is 3.53 ± 8.64 m. One factor that affects these values is the resolution of the Markov chain abstraction since the probability distribution within each state is uniform. The results in the last prediction for GRIP are significantly worse, being that the position error is 8.12 ± 24.62 m.

A qualitative comparison of the predictions for the whole prediction horizon for both approaches is presented in Figure 23. The output from IAMP is shown in blue and from GRIP, in red. The ground truth position is displayed in black.

Finally, Table 5 shows the execution time comparison of the motion prediction modules of IAMP versus GRIP. It is possible to see that GRIP has the edge, even though both stay within a comparable range. A possible explanation could come from the fact that GRIP produces a single point of motion prediction while IAMP produces a full probabilistic footprint. Moreover, IAMP produces multi-modal predictions, while GRIP produces a single prediction per vehicle. This causes an execution time trade-off that arguably pays off in the prediction performance based on the results that can be seen in Table 4 and Figure 22.

7. Conclusions

In this paper, a novel set of techniques is presented to bring a computationally expensive motion prediction strategy from theoretical simulation to real-time execution. It exploits the latest GPU architectures and demonstrates its suitability in a real-time setting with real traffic conditions. The proposed acceleration techniques combine existing methods that reduce memory accesses, arithmetic instructions and increase GPU occupancy with novel application-specific techniques that make use of informed combinations to minimize the computation workload. The paper demonstrates that the implemented techniques produce a scalable solution that runs faster across different scenarios and platforms. In future work, multi-GPU techniques will be implemented to exploit the additional integrated GPU on the Nvidia Drive PX2 and the memory regions that are shared with the CPU. Additionally, new features from IAMP, such as the integration of learn-based models [52], will be included.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics12183751/s1, Video S1: GPU-IAMP: execution of the scenarios.

Author Contributions

Conceptualization, J.L.H., V.T., A.A. and J.V.; Data curation, J.L.H., V.T., A.A. and J.V.; Formal analysis, J.L.H., V.T., A.A. and J.V.; Funding acquisition, J.V.; Investigation, J.L.H., V.T., A.A. and J.V.; Methodology, J.L.H., V.T., A.A. and J.V.; Project administration, J.V.; Resources, J.L.H., V.T., A.A. and J.V.; Software, J.L.H. and V.T.; Supervision, A.A. and J.V.; Validation, J.L.H., V.T., A.A. and J.V.; Visualization, J.L.H. and V.T.; Writing—original draft, J.L.H. and V.T.; Writing—review and editing, J.L.H., V.T., A.A. and J.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially funded by MCIN/AEI/10.13039/501100011033 and by ERDF A way of making Europe with the project DISCERN (PID2021- 125850OB-I00), and by the Community of Madrid through the SEGVAUTO 4.0-CM Program (S2018-EMT-4362).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, J.S.; Knipling, R.R.; Goodman, M.J. The role of driver inattention in crashes: New statistics from the 1995 Crashworthiness Data System. In Proceedings of the 40th Annual Proceedings of the Association for the Advancement of Automotive Medicine, Vancouver, BC, Canada, 7–9 October 1996; Volume 377, p. 392. [Google Scholar]
Rathery, A. Traffic flow problems in Europe. Transp. Rev. 1993, 13, 1–23. [Google Scholar] [CrossRef]
Berg, J.; Ihlström, J. The importance of public transport for mobility and everyday activities among rural residents. Soc. Sci. 2019, 8, 58. [Google Scholar] [CrossRef]
Large, R.O.; Breitling, T.; Kramer, N. Driver Shortage and Fluctuation: Occupational and Organizational Commitment of Truck Drivers. Supply Chain Forum Int. J. 2014, 15, 66–72. [Google Scholar] [CrossRef]
Villagra, J.; Clavijo, M.; Díaz-Álvarez, A.; Trentin, V. Motion prediction and risk assessment. In Decision-Making Techniques for Autonomous Vehicles; Elsevier: Amsterdam, The Netherlands, 2023; pp. 61–101. [Google Scholar]
Gerlein, E.A.; Díaz-Guevara, G.; Carrillo, H.; Parra, C.; Gonzalez, E. Embbedded system-on-chip 3D localization and mapping—eSoC-SLAM. Electronics 2021, 10, 1378. [Google Scholar] [CrossRef]
Hao, C.; Sarwari, A.; Jin, Z.; Abu-Haimed, H.; Sew, D.; Li, Y.; Liu, X.; Wu, B.; Fu, D.; Gu, J.; et al. A hybrid GPU+ FPGA system design for autonomous driving cars. In Proceedings of the 2019 IEEE International Workshop on Signal Processing Systems (SiPS), Nanjing, China, 20–23 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 121–126. [Google Scholar]
Nadales, J.; Manzano, J.; Barriga, A.; Limon, D. Efficient FPGA Parallelization of Lipschitz Interpolation for Real-Time Decision-Making. IEEE Trans. Control Syst. Technol. 2022, 30, 2163–2175. [Google Scholar] [CrossRef]
Haneche, A.; Lachachi, Y.; Niar, S.; Ouarnoughi, H. A GPU enhanced LIDAR Perception System for Autonomous Vehicles. In Proceedings of the 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), Västerås, Sweden, 11–13 March 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 232–236. [Google Scholar]
Xie, X.; Bai, L.; Huang, X. Real-time LiDAR point cloud semantic segmentation for autonomous driving. Electronics 2021, 11, 11. [Google Scholar] [CrossRef]
Shah, H.; Ghadai, S.; Gamdha, D.; Schuster, A.; Thomas, I.; Greiner, N.; Krishnamurthy, A. GPU-Accelerated Collision Analysis of Vehicles in a Point Cloud Environment. IEEE Comput. Graph. Appl. 2022, 42, 37–50. [Google Scholar] [CrossRef] [PubMed]
Mufti, S.; Roberge, V.; Tarbouchi, M. A GPU accelerated path planner for multiple unmanned aerial vehicles. In Proceedings of the 2019 IEEE Canadian Conference of Electrical and Computer Engineering (CCECE), Edmonton, AB, Canada, 5–8 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
Chen, Z.; Khemmar, R.; Decoux, B.; Atahouet, A.; Ertaud, J.Y. Real time object detection, tracking, and distance and motion estimation based on deep learning: Application to smart mobility. In Proceedings of the 2019 Eighth International Conference on Emerging Security Technologies (EST), Colchester, UK, 22–24 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Nicely, M.A.; Wells, B.E. Improved parallel resampling methods for particle filtering. IEEE Access 2019, 7, 47593–47604. [Google Scholar] [CrossRef]
Chen, X.; Sun, L. Bayesian temporal factorization for multidimensional time series prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4659–4673. [Google Scholar] [CrossRef]
Wang, P.; Zhang, Y.; Hu, T.; Zhang, T. Urban traffic flow prediction: A dynamic temporal graph network considering missing values. Int. J. Geogr. Inf. Sci. 2023, 37, 885–912. [Google Scholar] [CrossRef]
Althoff, M.; Mergel, A. Comparison of Markov chain abstraction and Monte Carlo simulation for the safety assessment of autonomous cars. IEEE Trans. Intell. Transp. Syst. 2011, 12, 1237–1247. [Google Scholar] [CrossRef]
Hubmann, C.; Becker, M.; Althoff, D.; Lenz, D.; Stiller, C. Decision making for autonomous driving considering interaction and uncertain prediction of surrounding vehicles. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1671–1678. [Google Scholar]
Poggenhans, F.; Pauls, J.H.; Janosovits, J.; Orf, S.; Naumann, M.; Kuhnt, F.; Mayr, M. Lanelet2: A High-Definition Map Framework for the Future of Automated Driving. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 1672–1679. [Google Scholar]
Trentin, V.; Artuñedo, A.; Godoy, J.; Villagra, J. Multi-Modal Interaction-Aware Motion Prediction At Unsignalized Intersections. IEEE Trans. Intell. Veh. 2023, 1–17. [Google Scholar] [CrossRef]
Trentin, V.; Artuñedo, A.; Godoy, J.; Villagra, J. Interaction-Aware Motion Prediction at Highways: A Comparison of Three Lane Changing Models. In Proceedings of the Smart Cities, Green Technologies, and Intelligent Transport Systems; Klein, C., Jarke, M., Helfert, M., Berns, K., Gusikhin, O., Eds.; Springer: Cham, Switzerland, 2022; pp. 274–296. [Google Scholar]
Trentin, V.; Artuñedo, A.; Godoy, J.; Villagra, J. Interaction-Aware Intention Estimation at Roundabouts. IEEE Access 2021, 9, 123088–123102. [Google Scholar] [CrossRef]
Lefèvre, S.; Laugier, C.; Ibañez-Guzmán, J. Intention-Aware Risk Estimation for General Traffic Situations, and Application to Intersection Safety; Research Report RR-8379; INRIA, Hal: Le Chesnay-Rocquencourt, France, 2013. [Google Scholar]
Derks, I.P.; de Waal, A. A Taxonomy of Explainable Bayesian Networks. In Proceedings of the Artificial Intelligence Research; Gerber, A., Ed.; Springer: Cham, Switzerland, 2020; pp. 220–235. [Google Scholar]
Arulampalam, M.S.; Maskell, S.; Gordon, N.; Clapp, T. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process. 2002, 50, 174–188. [Google Scholar] [CrossRef]
Althoff, M. Reachability Analysis and its Application to the Safety Assessment of Autonomous Cars. Ph.D. Thesis, Technische Universität München, Munchen, Germany, 2010. [Google Scholar]
NVIDIA. CUDA API Documentation. Available online: https://docs.nvidia.com/cuda/cuda-runtime-api/index.html (accessed on 30 June 2022).
NVIDIA. CUDA C Programming Guide. Available online: https://docs.nvidia.com/cuda/cuda-c-programming-guide (accessed on 24 May 2022).
Gil, A.; Segura, J.; Temme, N. Numerical Methods for Special Functions; SIAM: Philadelphia, PA, USA, 2007; pp. 193–197. [Google Scholar]
Li, T.; Bolic, M.; Djuric, P.M. Resampling methods for particle filtering: Classification, implementation, and strategies. IEEE Signal Process. Mag. 2015, 32, 70–86. [Google Scholar] [CrossRef]
Thrun, S.; Burgard, W.; Fox, D. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents); The MIT Press: Cambridge, MA, USA, 2005. [Google Scholar]
Cano, A.; Zafra, A.; Ventura, S. Parallel evaluation of Pittsburgh rule-based classifiers on GPUs. Neurocomputing 2014, 126, 45–57. [Google Scholar] [CrossRef][Green Version]
Bock, J.; Krajewski, R.; Moers, T.; Runde, S.; Vater, L.; Eckstein, L. The inD Dataset: A Drone Dataset of Naturalistic Road User Trajectories at German Intersections. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 1929–1934. [Google Scholar] [CrossRef]
Breuer, A.; Termöhlen, J.A.; Homoceanu, S.; Fingscheidt, T. openDD: A large-scale roundabout drone dataset. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
AUTOPIA. CAR CSIC. Available online: https://autopia.car.upm-csic.es/ (accessed on 30 June 2022).
MATLAB. TIC/TOC Timing Framework. Available online: https://es.mathworks.com/help/matlab/ref/tic.html?lang=en (accessed on 16 April 2023).
Linux. gettimeofday—Linux Manual Page. Available online: https://man7.org/linux/man-pages/man2/gettimeofday.2.html (accessed on 16 April 2023).
MATLAB. Creating More Accurate Time Values. Available online: https://es.mathworks.com/company/newsletters/articles/improvements-to-tic-and-toc-functions-for-measuring-absolute-elapsed-time-performance-in-matlab.html (accessed on 16 April 2023).
STD. steady_clock—Chrono Manual Page. Available online: https://en.cppreference.com/w/cpp/chrono/steady_clock (accessed on 16 April 2023).
Linux. clock_gettime—Linux Manual Page. Available online: https://linux.die.net/man/3/clock_gettime (accessed on 16 April 2023).
NVIDIA. CUDA Event Manual Page. Available online: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html (accessed on 16 April 2023).
Artunedo, A.; Villagra, J.; Godoy, J. Real-time motion planning approach for automated driving in urban environments. IEEE Access 2019, 7, 180039–180053. [Google Scholar] [CrossRef]
Medina-Lee, J.; Artuñedo, A.; Godoy, J.; Villagra, J. Merit-Based Motion Planning for Autonomous Vehicles in Urban Scenarios. Sensors 2021, 21, 3755. [Google Scholar] [CrossRef]
Thrust. Permutation Iterator Documentation. Available online: https://thrust.github.io/doc/classthrust_1_1permutation__iterator.html (accessed on 15 April 2023).
Thrust. Parallel Copy Documentation. Available online: https://thrust.github.io/doc/group__copying_ga3e43fb8472db501412452fa27b931ee2.html (accessed on 15 April 2023).
Thrust. The C++ Parallel Algorithms Library. Available online: https://github.com/NVIDIA/thrust (accessed on 15 April 2023).
NVIDIA. CUDA Programming Guide. Available online: https://docs.nvidia.com/cuda/archive/9.2/cuda-c-programming-guide/index.html#atomicadd (accessed on 15 May 2023).
NVIDIA. cuSPARSE Documentation. Available online: https://docs.nvidia.com/cuda/archive/9.2/cusparse/index.html#cusparse-lt-t-gt-csrgemm (accessed on 10 April 2022).
Jiménez, V.; Godoy, J.; Artuñedo, A.; Villagra, J. Object-level Semantic and Velocity Feedback for Dynamic Occupancy Grids. IEEE Trans. Intell. Veh. 2023, 8, 1–18. [Google Scholar] [CrossRef]
Godoy, J.; Jiménez, V.; Artuñedo, A.; Villagra, J. A grid-based framework for collective perception in autonomous vehicles. Sensors 2021, 21, 744. [Google Scholar] [CrossRef]
Li, X.; Ying, X.; Chuah, M.C. Grip++: Enhanced graph-based interaction-aware trajectory prediction for autonomous driving. arXiv 2019, arXiv:1907.07792. [Google Scholar]
Trentin, V.; Ma, C.; Villagra, J.; Al-Ars, Z. Learning-enabled multi-modal motion prediction in urban environments. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV), Dearborn, MI, USA, 4–7 June 2023; pp. 1–7. [Google Scholar]

Figure 1. Multi-modal interaction-aware motion prediction representing the predictions at 4.0 s in a given time step of a dataset.

Figure 2. Interaction-aware motion prediction (IAMP) flowchart.

Figure 3. Example of a vehicle’s corridors. The vehicle can either turn left (magenta), go straight (red) or turn right (blue).

Figure 4. Dynamic Bayesian network used for intention estimation [22].

Figure 5. Array of structs conforming the particle data structure.

Figure 6. Low variance resampling.

Figure 7. Grid-stride memory access.

Figure 8. Parallel reduction sequential addressing.

Figure 9. Maps used in experimentation.

Figure 10. Number of corridors per map during the dataset execution.

Figure 11. Autopia autonomous vehicle.

Figure 13. Particle filter stage time comparison with 20,000 particles.

Figure 14. Time of execution of the particle filter with an increasing number of particles in the different scenarios.

Figure 15. Average time of execution of the motion prediction computation in the four different scenarios.

Figure 16. Execution time of the parallel resampling comparison.

Figure 17. Execution time of the parallel output generation comparison.

Figure 18. Execution time of the memory-aligned matrix combination comparison.

Figure 19. Qualitative comparison of the prediction output using cuSPARSE matrix multiplication and the proposed method. Blue rectangle represents the studied vehicle and the blue oval-shaped shadows represent its probabilistic motion predictions.

Figure 20. Execution time of the algorithm running on a Nvidia Drive PX2.

Figure 21. Effective population of the algorithm running on a Nvidia Drive PX2.

Figure 22. Position error over the predicted horizon.

Figure 23. Comparison of the predictions: IAMP (blue), baseline (red) and ground truth (black). For IAMP, the intensity of the predictions represents the probability of the corridor.

Table 1. Average accumulated weight of the particle filter in four different scenarios.

Implementation	Case 1	Case 2	Case 3	Case 4
C++	0.068	0.137	0.028	0.011
CUDA	0.152	0.250	0.174	0.135

Table 2. Overall time speedup of GPU code compared to Matlab and C++.

Scenario	Matlab	C++	CUDA
Case 1	24.536×	4.841×	1×
Case 2	41.280×	4.570×	1×
Case 3	57.341×	4.664×	1×
Case 4	115.098×	6.131×	1×

Table 3. Complete loop predictions per second implementation comparison.

Scenario	Matlab [Hz]	C++ [Hz]	CUDA [Hz]
Case 1	0.199	1.013	4.906
Case 2	0.207	1.870	8.549
Case 3	0.105	1.300	6.065
Case 4	0.050	1.938	5.755

Table 4. Prediction metrics.

Approach	Metric	Case 1	Case 2	Case 3	Case 4
IAMP	ADE	1.08	1.13	1.53	1.84
IAMP	FDE	2.42	2.80	4.20	4.98
GRIP	ADE	1.82	2.38	4.81	3.95
GRIP	FDE	4.63	6.06	12.18	10.31

Table 5. IAMP—GRIP motion prediction module execution time comparison.

Scenario	IAMP [ms]	GRIP [ms]
Case 1	103	65
Case 2	68	65
Case 3	76	64
Case 4	87	68

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hortelano, J.L.; Trentin, V.; Artuñedo, A.; Villagra, J. GPU-Accelerated Interaction-Aware Motion Prediction. Electronics 2023, 12, 3751. https://doi.org/10.3390/electronics12183751

AMA Style

Hortelano JL, Trentin V, Artuñedo A, Villagra J. GPU-Accelerated Interaction-Aware Motion Prediction. Electronics. 2023; 12(18):3751. https://doi.org/10.3390/electronics12183751

Chicago/Turabian Style

Hortelano, Juan Luis, Vinicius Trentin, Antonio Artuñedo, and Jorge Villagra. 2023. "GPU-Accelerated Interaction-Aware Motion Prediction" Electronics 12, no. 18: 3751. https://doi.org/10.3390/electronics12183751

APA Style

Hortelano, J. L., Trentin, V., Artuñedo, A., & Villagra, J. (2023). GPU-Accelerated Interaction-Aware Motion Prediction. Electronics, 12(18), 3751. https://doi.org/10.3390/electronics12183751

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GPU-Accelerated Interaction-Aware Motion Prediction

Abstract

1. Introduction

2. Related Work

3. Architecture Overview

3.1. Corridors

3.2. Relations

3.3. Intention Estimation

3.4. Motion Prediction

4. Real-Time Framework

4.1. GPU Programming Model

4.2. Parallel Particle Filter

4.2.1. Predict

4.2.2. Update

4.2.3. Resample

4.2.4. Output Generation

4.3. Parallel Motion Prediction

4.3.1. Memory-Aligned Matrix Combination

5. Experiments

5.1. Experimental Setup

5.1.1. inD and openDD

5.1.2. Autopia Test Track

5.1.3. Timing Computation

5.2. Particle Filter

5.2.1. Effective Population

5.2.2. Scalability

5.3. Motion Prediction

5.4. Complete Loop

5.5. Baseline Comparison

5.6. Nvidia Drive PX2 Tests

6. Prediction Evaluation

7. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI