Spatial-Temporal Attentive LSTM for Vehicle-Trajectory Prediction

Jiang, Rui; Xu, Hongyun; Gong, Gelian; Kuang, Yong; Liu, Zhikang

doi:10.3390/ijgi11070354

Open AccessArticle

Spatial-Temporal Attentive LSTM for Vehicle-Trajectory Prediction

by

Rui Jiang

¹

,

Hongyun Xu

^1,*,

Gelian Gong

^2,3,

Yong Kuang

¹ and

Zhikang Liu

¹

School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China

²

State Key Laboratory of Isotope Geochemistry, Guangzhou Institute of Geochemistry, Chinese Academy of Sciences, Guangzhou 510640, China

³

CAS Center for Excellence in Deep Earth Science, Guangzhou 510640, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2022, 11(7), 354; https://doi.org/10.3390/ijgi11070354

Submission received: 10 May 2022 / Revised: 16 June 2022 / Accepted: 20 June 2022 / Published: 21 June 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Vehicle-trajectory prediction is essential for intelligent traffic systems (ITS), as it can help autonomous vehicles to plan a safe and efficient path. However, it is still a challenging task because existing studies have mainly focused on the spatial interactions of adjacent vehicles regardless of the temporal dependencies. In this paper, we propose a spatial-temporal attentive LSTM encoder–decoder model (STAM-LSTM) to predict vehicle trajectories. Specifically, the spatial attention mechanism is used to capture the spatial relationships among neighboring vehicles and then obtain the global spatial feature. Meanwhile, the temporal attention mechanism is designed to distinguish the effects of different historical time steps on future trajectory prediction. In addition, the motion feature of vehicles is extracted to reveal the influence of dynamic information on vehicle-trajectory prediction, and is combined with the local and global spatial features to represent the integrated features of the target vehicle at each historical moment. The experiments were conducted on public highway trajectory datasets—US-101 and I-80 in NGSIM—and the results demonstrate that our model achieves state-of-the-art prediction performance.

Keywords:

trajectory prediction; spatial-temporal attention mechanisms; autonomous driving; LSTM

1. Introduction

In recent years, autonomous driving has been an emerging and developing field. It requires self-driving vehicles to be capable of adjusting their behaviors in real time under different traffic environments, such as switching lanes and slowing down to avoid collision risks. Precise trajectory prediction is a crucial step to satisfying the above requirements; it helps intelligent vehicles better understand their surrounding agents and take the safe and effective action in the next step. Traditionally, early works mostly used a simple dynamical model to generate future trajectories based on the position, speed, and acceleration of the target vehicle, such as the constant velocity (CV) [1] and Kalman filter-based (KF) [2] models. However, these approaches are only suitable for short-term trajectory prediction and relatively simple traffic scenarios, since they mainly focus on the individual historical information of each vehicle and ignore the complex social interactions between vehicles.

Since machine-learning-based methods can effectively capture interactions and learn the non-linear relationships from real-world trajectory data, many studies [3,4] have applied them to vehicle-trajectory-prediction tasks, and the experimental results have shown that these methods have an improvement in performance compared with the traditional models. For example, Tran et al. [5] applied a three-dimensional Gaussian process regression model [6] to identify the maneuvers of vehicles and utilized the Monte Carlo method to generate future trajectories. However, most of these models highly rely on hand-crafted features, which aim to capture the interactions among vehicles under expected scenes. Therefore, the prediction accuracy would decrease significantly in uncertain traffic scenarios.

With the rapid development of deep learning, the ability to automatically extract features by learning plentiful data can solve the above problem. In particular, Long Short-Term Memory (LSTM) [7] has shown great success in capturing the long-range dependencies for sequence generation or prediction tasks [8]. Therefore, several studies [9,10] have used LSTM as the backbone network to predict future trajectories of agents. For example, Alahi et al. [11] proposed the Social-LSTM model to generate pedestrian trajectories in crowded scenes and used a fully connected pooling network to share information from a single LSTM. Similarly, Deo et al. [12] introduced the convolutional social pooling layer to capture the spatial interactions of surrounding vehicles and generated maneuver-based future trajectories of vehicles. However, these models still have some limitations because the pooling strategy is restricted by the spatial proximity of agents. To address this issue, many works [13,14,15] have applied the attention mechanism to calculate the relative importance of adjacent agents. However, these approaches are mainly concerned with the spatial interactions among agents at a single time step, and the temporal dependencies between these local spatial features throughout the whole historical trajectories are ignored.

In this paper, we propose a novel spatial-temporal attentive LSTM encoder–decoder model (STAM-LSTM) for vehicle-trajectory prediction. Compared with existing methods, our model can effectively capture the spatial–temporal interactions and motion features of the target vehicle. It uses the comprehensive feature information of each past time step in the encoding stage and selectively utilizes the valuable features of historical trajectories to generate future trajectories in the decoding stage. The main contributions of this paper can be summarized as follows:

We use a spatial attention mechanism to measure the spatial relationships of nearby vehicles and obtain the global spatial feature of the target vehicle.
We introduce a temporal attention mechanism to assign different weights to the outputs of the encoder, which can capture the relative impacts of different historical moments on future trajectory prediction.
The motion feature is extracted by using velocity and acceleration information. Meanwhile, we aggregate it with local and global spatial features into a comprehensive feature representation of the target vehicle.
Extensive experiments on NGSIM datasets show that our model can improve the accuracy of vehicle trajectory prediction, achieving state-of-the-art performance on the RMSE metric.

The remainder of this paper is organized as follows. In Section 2, some related works are reviewed and discussed in detail. Our proposed STAM-LSTM model is described in Section 3. Then, the contents and results of the experiments are presented in Section 4. Finally, the conclusion is provided in Section 5.

2. Related Works

2.1. Classical Methods for Trajectory Prediction

The classical methods of vehicle-trajectory prediction can be roughly divided into physics-based and maneuver-based models [16]. The former usually depend on the physical and dynamic features of the target vehicle to predict future trajectories, such as constant velocity (CV), constant acceleration (CA), and Kalman filter-based [2] models. Although these models are suitable for short-term (e.g., less than one second) trajectory prediction, they might encounter difficulty in predicting future trajectories caused by drivers’ maneuvers or surrounding vehicles’ actions. Therefore, the latter takes into account the influence of the drivers’ maneuvers (e.g., turn left or right, keep straight) in vehicle-trajectory prediction. It firstly needs to recognize the vehicle’s maneuver. This step can be described as a classification problem using historical positions and motion states of the target vehicle as features; this classifier usually adopts hidden Markov models [3,17], multi-layer perceptrons (MLPs) [4], Bayesian networks [18], and the support vector machine (SVM) [19]. Then, the regression trajectory prediction module is designed to generate maneuver-based trajectories of the vehicles, which includes Gaussian mixture models [3], Monte-Carlo methods [5] and polynomial fitting [20]. However, the performance of these methods is limited due to the fact that they neglect the social interactions between surrounding vehicles.

2.2. LSTM-Based Methods for Trajectory Prediction

LSTM network is one of the variants of Recurrent Neural Networks (RNNs) [21], which can not only capture the long-range dependencies in processing sequence data but also solve the gradient vanishing and exploding problems during the training process [22]. Therefore, LSTM networks have been widely used for sequence tasks, such as activity recognition [23], traffic prediction [24], and trajectory prediction [25,26]. For example, Altché et al. [27] introduced an LSTM network to predict future trajectories of vehicles on highways. Its results showed that this network achieved better performance than traditional models. However, it neglected the information of neighboring vehicles. Therefore, Deo et al. [28] proposed a maneuver LSTM (M-LSTM) model, which encodes the past tracks of neighboring vehicles and outputs multi-modal trajectory predictions by decoding the encoded context vector and maneuver-encoding vector. Although M-LSTM takes all trajectories as the input of this model, it fails to consider the different influence of neighbors on the target vehicle. To address this issue, the CS-LSTM model [12] used the convolutional social pooling layer to generate a social tensor, which includes the spatial interactions of surrounding vehicles in the same scene. Similarly, Zhao et al. [10] proposed a Multi-Agent Tensor Fusion (MATF) model, where the fully convolutional layer was introduced to learn the social interactions between multiple vehicles and scene context. However, these methods are only concerned with the spatial feature of the last time step and neglect the temporal relevance of these spatial interactions. Therefore, our STAM-LSTM model extracts the comprehensive feature of vehicles from the spatial and temporal dimensions, which effectively captures the influence of temporal dependencies on generating future trajectories of vehicles.

2.3. Attention-Based Methods for Trajectory Prediction

The attention mechanism was first proposed in [29] for machine translation tasks, which can effectively improve the model explainability. Inspired by this idea, attention-based models have been widely used in various areas, such as image generation [30], recommendation systems [31], and sequence prediction [32]. Since a trajectory prediction problem is a typical time series task, some works [33,34] have introduced the attention mechanism to assign appropriate weights to the LSTM hidden states that represent the feature vector of different moments in historical trajectories. Recently, the multi-head attention mechanism was proposed based on the Transformer [35] architecture, which has been widely used in many fields. Among them, Messaoud et al. [14] used the multi-head attention mechanism to evaluate the relative importance between nearby vehicles and extracted different types of social relationships. Conversely, Wu et al. [36] applied the multi-head attention mechanism to capture the complex temporal correlation of each agent independently. More recently, motivated by the fact that the graph convolutional network (GCN) [37] can capture the relative influence and the potential spatial relationships in traffic scenarios, the graph attention network (GAT) [38] has been used in trajectory prediction [39,40,41], extracting the spatial interaction among neighboring agents by assigning different importance to neighbors around the target agent. In this paper, we design a spatial attention layer based on the GAT to extract the social interactions among vehicles and use a temporal attention layer to capture the temporal relationships according to the self-attention mechanism [42].

3. Methods

3.1. Problem Definition

In this paper, the vehicle-trajectory prediction task can be formulated as predicting future trajectories of the target vehicle by using historical trajectories of the target vehicle and its neighbors in a scene. To facilitate the modeling of this problem, we firstly introduce some notations.

The past trajectories of the i-th vehicle from time

t = 1

to

t = t_{o b s}

can be described by

X_{i} = {p_{i}^{1}, p_{i}^{2}, . . ., p_{i}^{t_{o b s}}}

(1)

where,

p_{i}^{t} = (x_{i}^{t}, y_{i}^{t}, v_{i}^{t}, a_{i}^{t})

(2)

is the state vector of vehicle i at time step t, which includes the position coordinates

(x_{i}^{t}, y_{i}^{t})

, speed

v_{i}^{t}

, and acceleration

a_{i}^{t}

.

The future trajectories of the i-th vehicle can be expressed as

Y_{i} = {o_{i}^{t_{o b s} + 1}, o_{i}^{t_{o b s} + 2}, . . ., o_{i}^{t_{o b s} + t_{p r e d}}}

(3)

where

o_{i}^{\tilde{t}} = (x_{i}^{\tilde{t}}, y_{i}^{\tilde{t}})

(4)

is the predicted coordinates of vehicle i at time step

\tilde{t} = t_{o b s} + 1, t_{o b s} + 2, . . . t_{o b s} + t_{p r e d}

,

t_{p r e d}

is the time length of vehicle trajectories in the future.

3.2. Overall Model

Figure 1 illustrates the architecture of the proposed STAM-LSTM model, which consists of two main modules: the feature extraction module and the attention-based encoder–decoder module. In the feature extraction module, the Spatial Attention (SA) layer is designed to extract the global spatial feature of vehicles. Furthermore, the Feature Aggregation Layer (FAL) is used to generate a comprehensive feature representation of the target vehicle by using the local and global spatial features and the motion feature. The attention-based encoder–decoder module can be further divided into three components, i.e., the encoder module, the temporal attention layer, and the decoder module. The encoder module encodes the aggregated feature of historical trajectories to a high-dimension hidden state tensor. The temporal attention layer assigns different weight to the outputs of the encoder and calculates the weighted context vector

c_{i}

. The decoder module takes the weighted context vector and random noise r as inputs for predicting the future trajectories of the target vehicle.

3.3. Feature Extraction Module

Firstly, we embed the location coordinates

(x_{i}^{t}, y_{i}^{t})

into a high-dimensional space by:

e_{i}^{t} = Γ (x_{i}^{t}, y_{i}^{t}; W_{e})

(5)

where

Γ (\cdot)

is the embedding function with

L e a k y R e l u

non-linearity,

W_{e} \in R^{2 \times D_{1}}

is the weight matrix of this function, and

D_{1}

is the dimension of this feature.

e_{i}^{t}

denotes the local spatial feature of the i-th vehicle at time step t.

Considering the effects of the surrounding agents on the behaviors of vehicles in road networks and the fact that each vehicle can be abstractly represented as a graph node [43], we introduce a spatial attention layer to capture the different interactive relationships among neighboring vehicles based on the GAT [38].

As shown in Figure 2, we take these local spatial features

{e_{1}^{t}, e_{2}^{t}, \dots, e_{n}^{t}}

as the input of this layer, where n is the number of vehicles. Then, we calculate the attention weight

α_{i, j}^{t}

to represent the relative importance of the vehicle pair

(i, j)

as follows:

α_{i, j}^{t} = \frac{e x p (L e a k y R e l u (a^{T} [W_{s} e_{i}^{t} ⨁ W_{s} e_{j}^{t}]))}{\sum_{j = 1}^{n} e x p (L e a k y R e l u (a^{T} [W_{s} e_{i}^{t} ⨁ W_{s} e_{j}^{t}]))}

(6)

where ⨁ denotes concatenation operation.

W_{s}

is the weight parameter, which transforms the local spatial feature

e_{j}^{t}

into a high-dimensional space.

a^{T}

is the weight vector of a single feed-forward network, which will be used to calculate attention coefficients.

L e a k y R e l u

is a non-linear activation function, and its negative input slope is set to 0.1.

e x p

represents the exponential function that conducts the

s o f t m a x

normalization.

After obtaining the attention coefficients, we can integrate them with local spatial features to compute

E_{i}^{t}

, which means the global spatial feature of vehicle i at time step t. It is given by:

E_{i}^{t} = ϕ (\sum_{j = 1}^{n} α_{i, j}^{t} e_{j}^{t})

(7)

where

ϕ (\cdot)

is a non-linear activation function.

Secondly, the literature [44] has reported that the relative dynamics are also important to represent the social effects of vehicles in high-speed driving scenes. Furthermore, we believe that each vehicle adjusts its own behaviors (e.g., acceleration or deceleration, switching direction) according to the global spatial interactions, which include the different effects of surrounding vehicles on its next action. Therefore, we extracted the motion feature

m_{i}^{t}

of the target vehicle by using the velocity and acceleration information.

m_{i}^{t} = Θ (v_{i}^{t}, a_{i}^{t}; W_{m})

(8)

where

Θ (\cdot)

is a fully connected network,

W_{m} \in R^{2 \times D_{2}}

is the network parameter, and

D_{2}

is the dimension of motion feature space.

Finally, the global spatial feature and the motion feature are defined as inputs of the feature aggregation layer. Meanwhile, the local spatial feature is another input, which enables the output of this layer to include local information. The aggregated feature representation

z_{i}^{t}

of vehicle i at historical time step t is described as Equation (9).

z_{i}^{t} = [E_{i}^{t} ⨀ e_{i}^{t}] ⨁ m_{i}^{t}

(9)

where ⨀ represents element-wise multiplication.

3.4. Attention-Based Encoder–Decoder Module

In this paper, we choose an LSTM network as the feature extractor of the encoder–decoder module. As shown in Figure 3, the hidden state h is controlled by the forget gate f, input gate i, output gate o, and cell state c. The working of the LSTM cell can be expressed as follows:

\begin{matrix} f_{t} & = σ (W_{X f} X_{t} + W_{h f} h_{t - 1} + W_{c f} c_{t - 1} + b_{f}) \end{matrix}

(10)

\begin{matrix} i_{t} & = σ (W_{X i} X_{t} + W_{h i} h_{t - 1} + W_{c i} c_{t - 1} + b_{i}) \end{matrix}

(11)

\begin{matrix} c_{t} & = f_{t} c_{t - 1} + i_{t} t a n h (W_{X c} X_{t} + W_{h c} h_{t - 1} + b_{c}) \end{matrix}

(12)

\begin{matrix} o_{t} & = σ (W_{X o} X_{t} + W_{h o} h_{t - 1} + W_{c o} c_{t} + b_{o}) \end{matrix}

(13)

\begin{matrix} h_{t} & = o_{t} t a n h (c_{t}) \end{matrix}

(14)

where

X_{t}

is the input vector, and

W_{*}

and

b_{*}

, respectively, denote the weight matrix and bias vector.

σ (\cdot)

is the

s i g m o i d

activation function.

3.4.1. Encoder Module

The LSTM encoder takes the aggregated feature

z_{i}^{t}

of vehicle i at time step t as input and converts it to the hidden state representation of this encoder

h_{i, e n c}^{t}

. It can be defined as follows:

h_{i, e n c}^{t} = E N C_L S T M (h_{i, e n c}^{t - 1}, z_{i}^{t}; W_{e n c})

(15)

where

W_{e n c} \in R^{(D_{1} + D_{2}) \times D_{e n c}}

is the weight matrix of this LSTM network and is shared among all the vehicles, and

D_{e n c}

is the dimension of LSTM network.

3.4.2. Temporal Attention Layer

Traditionally, the decoder directly takes the output of the encoder at the last time step as the input and generates the future trajectories by using this encoded vector. However, the motion of vehicles is composed of consecutive actions in the real-world scenes, and the current state depends on the comprehensive effect of all previous actions. In other words, the feature of each time step in past trajectories plays a different role in the prediction of future trajectory. Therefore, we introduce a temporal attention layer to measure the unequal impacts of different moments in historical trajectories.

We first use

H_{i}

to represent the temporal context sequences of vehicle i, which can be described as

H_{i} = {h_{i, e n c}^{1}, h_{i, e n c}^{2}, \dots, h_{i, e n c}^{t_{o b s}}}

(16)

where

h_{i, e n c}^{t}

is obtained by Equation (15), representing the hidden state of vehicle i at time step t.

Then, as illustrated in Figure 4, we take

H_{i}

as the input of the temporal attention layer and calculate the attention weight vector

β_{i}

as follows:

β_{i} = s o f t m a x (t a n h (W_{t} {(H_{i})}^{T}))

(17)

where

\cdot^{T}

means transposition,

t a n h

is an activation function, and

W_{t}

is the weight matrix. The weight vector

β_{i}

includes the different effects of historical information from the temporal dimension, and these computed weights sum up to 1 due to the

s o f t m a x

function.

Finally, the weighted temporal context vector

c_{i}

of vehicle i can be calculated by

c_{i} = β_{i} H_{i}

(18)

3.4.3. Decoder Module

The LSTM decoder and the prediction layer are applied to generate the future trajectories of the target vehicle. In the first step, the decoder concatenates the temporal context vector

c_{i}

and a random noise r, which can boost the robustness of trajectory prediction. The prediction layer uses the hidden state

h_{i, d e c}^{\tilde{t}}

of this decoder to generate predicted coordinates

o_{i}^{\tilde{t}}

of vehicle i at time step

\tilde{t}

. Then, the results

o_{i}^{\tilde{t}}

of the latest time step will be input to this decoder to generate the prediction results

o_{i}^{\tilde{t} + 1}

.

\begin{matrix} h_{i, d e c}^{\tilde{t}} & = D E C_L S T M (h_{i, d e c}^{\tilde{t} - 1}, c_{i} ⨁ r; W_{d e c}) \end{matrix}

(19)

\begin{matrix} o_{i}^{\tilde{t}} & = Ψ (h_{i, d e c}^{\tilde{t}}; W_{o}) \end{matrix}

(20)

where

W_{d e c} \in R^{(D_{e n c} + D_{r}) \times D_{d e c}}

are learned parameters and

D_{d e c}

is the dimension of this LSTM network,

Ψ (\cdot)

is a fully connected network, and

W_{o} \in R^{D_{d e c} \times 2}

is the weight matrix of this network.

3.5. Loss Function

In this paper, we use the smallest value of the distance between predicted coordinates

o_{i, p r e d}^{t}

and observed coordinates

o_{i, t r u e}^{t}

of vehicle i at the time step t to calculate loss. The loss function for training can be written as:

L_{t r a i n} = \frac{1}{N * T} \sum_{i = 1}^{N} \sum_{t = t_{o b s} + 1}^{t_{o b s} + t_{p r e d}} {∥o_{i, p r e d}^{t} - o_{i, t r u e}^{t}∥}_{2}^{2}

(21)

where N is the size of trainsets.

T = t_{p r e d}

is the length of the future trajectory.

4. Experiments and Results

4.1. Datasets

In this section, we demonstrate the use of the proposed STAM-LSTM model on a public vehicle trajectory dataset in the Next-Generation Simulation (NGSIM), which includes two subsets passing through the US-101 and the I-80 highway, respectively. In each subset, the trajectories of all vehicles were recorded at a frequency of 10 Hz for a total of 45 min. Furthermore, each 45 min dataset is divided into three 15 min subsections under mild, moderate, and crowded traffic conditions.

To make compare with baselines, we follow the same experimental settings in [12] to divide the raw trajectories into segments of 8 s, in which use the first 3 s historical trajectories to predict vehicle trajectories in the following 5 s. The processed trajectory datasets are randomly split, where 70% of the data are selected as the training set, and the remaining 30% of the data are used for verification and testing.

4.2. Metrics

In this paper, we choose the Root of the Mean Squared Error (RMSE) as a metric to evaluate the prediction performance of our STAM-LSTM model. The smaller the value of RMSE, the smaller distance between the predicted trajectory and the ground-truth trajectory. The RMSE is computed as

R M S E = \sqrt{\frac{1}{M} \sum_{i = 1}^{M} {(x_{i}^{T} - {\tilde{x}}_{i}^{T})}^{2} + {(y_{i}^{T} - {\tilde{y}}_{i}^{T})}^{2}}

(22)

where M means the total number of testing datasets. (

x_{i}^{t}, y_{i}^{t}

) and (

{\tilde{x}}_{i}^{t}, {\tilde{y}}_{i}^{t}

), respectively, are the ground-truth coordinates and the predicted coordinates of vehicle i at time step t. T is the prediction horizion and varies from 1s to 5s in our experiment.

4.3. Implementation Details

During the training process, we set the dimension of the LSTM encoder (

D_{e n c}

), the LSTM decoder (

D_{d e c}

), the local spatial feature (

D_{1}

), the motion feature (

D_{2}

), and the random noise (

D_{r}

) to 64, 128, 32, 32, and 8, respectively. The batch size is set to 128, and Adam [45] is the adopted optimizer. The learning rate is set to 0.0001. The entire model is conducted with the PyTorch framework and on a NVIDIA GTX 1080Ti GPU.

4.4. Ablation Study

This section describes an ablation study that was conducted to analyze the components of our STAM-LSTM model, including the spatial attention layer (SA), the temporal attention layer (TA), and the motion feature (MF). To testify the effect of SA, TA, and MF blocks, we tested two variants SA-LSTM (without the TA) and STA-LSTM (without the MF) and our STAM-LSTM model, respectively, on the NGSIM datasets. Meanwhile, the root mean squared error (RMSE) is used to evaluate these models. According to the quantitative results in Table 1, we can observe that the SA-LSTM model has greatly improved the prediction performance compared with the V-LSTM model, which only uses historical trajectories of the target vehicle, especially in long-term trajectory prediction. It shows that the information about surrounding agents is beneficial to improve performance in the vehicle-trajectory-prediction task, and the SA module can effectively extract the spatial interactions among nearby vehicles. Moreover, it proves that the graph attention network (GAT) is suitable for capturing the social relationship between vehicles on the highway. Then, the STA-LSTM model can further improve the prediction accuracy, which indicates that the TA module can capture the temporal relevance of vehicles and assign proper weight to different timestamps of historical trajectories. Finally, our STAM-LSTM model, which includes SA, TA, and MF, has achieved better performance. The RMSE value from 1 to 5 s decreases, respectively, by 23.21%, 15.04%, 13.51%, 9.89%, and 8.95%, suggesting that the dynamic information is an effective feature in vehicle-trajectory prediction and can better represent the multi-modal features of vehicles under different traffic scenarios.

4.5. Baselines

Constant Velocity (CV) [1]: This model uses the constant-velocity Kalman filter to predict the deterministic trajectory of each vehicle.
Vanilla-LSTM (V-LSTM): The model is based on the LSTM encoder–decoder model, which only uses past trajectories of the target vehicle to predict future trajectories.
Maneuver-LSTM (M-LSTM) [28]: This model applies the encoder to encode historical trajectories of the target vehicle and its neighbors, and the decoder generates the multi-modal trajectory predictions according to the output of the encoder and the maneuver-encoding vector.
Social-LSTM (S-LSTM) [11]: The model uses the fully connected network as the social pooling layer for sharing information and generates the uni-modal future trajectory.
Convolutional Social LSTM (CS-LSTM) [12]: This model utilizes the convolutional social pooling to capture the spatial interactions, and the encoder–decoder module is used to generate the multi-modal trajectory distributions of vehicles.
Multi-Agent Tensor Fusion (MATF) [10]: This model extracts a multi-agent tensor, which includes the scene context and the historical trajectories of multiple vehicles. Then, the GAN network is used to predict the future trajectories of agents.
Multi-Head Attention LSTM (MHA-LSTM) [14]: This model applies the multi-head attention mechanism to capture the high-order spatial-temporal interactions of vehicles.
MHA-LSTM(+f): The model takes the velocity, acceleration, and class information as additional features based on the MHA-LSTM model.

4.6. Quantitative Analysis

The results of our comparison with some existing methods are shown in Table 2. All models were tested on the NGSIM dataset and evaluated by RMSE metric. According to the quantitative results, it can firstly be seen that the deep-learning-based models outperformed the CV model, which demonstrates the deep learning fits the model trajectory prediction task. Meanwhile, these methods considering the influence of surrounding vehicles are better than the V-LSTM model. This indicates that the information on neighboring agents is helpful to improve the performance for vehicle-trajectory prediction. Secondly, these interactive methods (S-LSTM, CS-LSTM, MATF, MHA-LSTM, MHA-LSTM(+f), STAM-LSTM) have better prediction performance than non-interactive models (V-LSTM, M-LSTM), which demonstrates that the social interactions between vehicles are essential to generate more accurate vehicle trajectory. Among them, the attention-based models (MHA-LSTM, MHA-LSTM(+f), STAM-LSTM) have lower RMSE values compared with other models using the pooling schema (S-LSTM, CS-LSTM, and MATF), suggesting that the attention mechanism is more suitable for extracting the social relationships between vehicles on a highway. Thirdly, we notice that the additional information (velocity and acceleration) provides effective features and can be beneficial to better represent the comprehensive state of vehicles in trajectory prediction task. The models (MHA-LSTM(+f) and STAM-LSTM) perform better than other approaches without this type of information.

Finally, compared with the state-of-the-art method MHA-LSTM(+f), our model has slightly higher RMSE value in 1 s. We think that the random noise brings the uncertainty into the model and further influences short-term trajectory prediction. However, our STAM-LSTM can achieve higher prediction accuracy in 2∼5 s prediction horizons. The reasons can be summarized as follows:

The spatial attention mechanism can extract better spatial interactions between neighboring vehicles, and the graph-based mechanism is suitable for capturing the social relationship in vehicle-trajectory prediction.
The temporal attention mechanism can effectively capture the different influences of different historical time steps and assigns appropriate weight to the relevant feature representation learned by the encoder. Therefore, the decoder can utilize more valuable information for generating the future trajectories of vehicles, especially in long-term trajectory prediction.

4.7. Qualitative Analysis

In this section, we visualize several predicted trajectories under mild, moderate, and crowded traffic scenarios to qualitatively demonstrate the prediction performance of our proposed model. All the results are sampled from the NGSIM dataset. As shown in Figure 5, the yellow car is the target car and the gray cars are its neighboring vehicles. The blue lines represent the historical trajectories of each vehicle. The red lines are the ground-truth trajectories, and the green lines are the predicted trajectories by model. Overall, our STAM-LSTM model can effectively learn the motion pattern of vehicles under different traffic conditions.

As shown in Figure 5a, the vehicles are driving at a high speed in a mild-traffic scenario. The predicted trajectories are almost the same as the ground truth. In the moderate-traffic condition (Figure 5b), the cars move relatively slowly due to the surrounding neighbors, while in crowded traffic, the motion of vehicles is more complex. The predicted trajectories within the first 3 s are close to the ground-truth trajectories, and our model can learn the trend of the vehicle trajectory for the longer term (4 and 5 s), as shown in Figure 5c. The cars in the above traffic conditions are keeping lanes. To testify the prediction performance in more difficult scenarios, as shown in Figure 5d, our model can predict the intention of switching lanes and have a high accuracy in about first the 2 s.

5. Conclusions

In this paper, we have proposed a STAM-LSTM model for vehicle-trajectory prediction, which can effectively extract the spatial-temporal features and motion feature representation. With our method, the spatial attention mechanism is used to model the social interactions between the target vehicle and the surrounding neighbors. The temporal attention mechanism is designed to capture the temporal relevance by assigning proper weight to the different historical time steps. In addition, we add the dynamic information (velocity and acceleration) to the feature vectors, which are beneficial to extract better multi-modal features of vehicles and make more accurate trajectory predictions. Extensive experiments on public datasets demonstrate that our model has better prediction performance compared with state-of-the-art methods.

In future work, we would like to verify the prediction performance of our model in more complex traffic scenarios, such as roundabouts. Furthermore, we will apply graph neural network to represent the topology structure among all vehicles and reduce the complexity of the whole model.

Author Contributions

Conceptualization, Rui Jiang; methodology, Rui Jiang; software, Rui Jiang; validation, Rui Jiang, Hongyun Xu, Yong Kuang and Zhikang Liu; formal analysis, Rui Jiang; investigation, Rui Jiang, Yong Kuang and Zhikang Liu; resources, Hongyun Xu; writing—original draft preparation, Rui Jiang; writing—review and editing, Rui Jiang, Hongyun Xu and Gelian Gong; visualization, Rui Jiang; supervision, Hongyun Xu and Gelian Gong; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 61272403.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm, accessed on 20 November 2021.

Conflicts of Interest

The authors declare no conflict of interest.

References

Schneider, N.; Gavrila, D.M. Pedestrian path prediction with recursive bayesian filters: A comparative study. In Proceedings of the German Conference on Pattern Recognition, Saarbrücken, Germany, 3–6 September 2013; pp. 174–183. [Google Scholar]
Elnagar, A. Prediction of moving objects in dynamic environments using Kalman filters. In Proceedings of the 2001 IEEE International Symposium on Computational Intelligence in Robotics and Automation (Cat. No. 01EX515), Banff, AB, Canada, 29 July–1 August 2001; pp. 414–419. [Google Scholar]
Deo, N.; Rangesh, A.; Trivedi, M.M. How would surround vehicles move? A unified framework for maneuver classification and motion prediction. IEEE Trans. Intell. Veh. 2018, 3, 129–140. [Google Scholar] [CrossRef] [Green Version]
Yoon, S.; Kum, D. The multilayer perceptron approach to lateral motion prediction of surrounding vehicles for autonomous vehicles. In Proceedings of the 2016 IEEE Intelligent Vehicles Symposium (IV), Gothenburg, Sweden, 19–22 June 2016; pp. 1307–1312. [Google Scholar]
Tran, Q.; Firl, J. Online maneuver recognition and multimodal trajectory prediction for intersection assistance using non-parametric regression. In Proceedings of the 2014 IEEE Intelligent Vehicles Symposium Proceedings, Dearborn, MI, USA, 8–11 June 2014; pp. 918–923. [Google Scholar]
Williams, C.K.; Rasmussen, C.E. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006; Volume 2. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Tang, J.; Shu, X.; Yan, R.; Zhang, L. Coherence constrained graph LSTM for group activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 44, 636–647. [Google Scholar] [CrossRef] [PubMed]
Xue, H.; Huynh, D.Q.; Reynolds, M. SS-LSTM: A hierarchical LSTM model for pedestrian trajectory prediction. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1186–1194. [Google Scholar]
Zhao, T.; Xu, Y.; Monfort, M.; Choi, W.; Baker, C.; Zhao, Y.; Wang, Y.; Wu, Y.N. Multi-agent tensor fusion for contextual trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 12126–12134. [Google Scholar]
Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 961–971. [Google Scholar]
Deo, N.; Trivedi, M.M. Convolutional social pooling for vehicle trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1468–1476. [Google Scholar]
Messaoud, K.; Yahiaoui, I.; Verroust-Blondet, A.; Nashashibi, F. Non-local social pooling for vehicle trajectory prediction. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 975–980. [Google Scholar]
Messaoud, K.; Yahiaoui, I.; Verroust-Blondet, A.; Nashashibi, F. Attention based vehicle trajectory prediction. IEEE Trans. Intell. Veh. 2020, 6, 175–185. [Google Scholar] [CrossRef]
Peng, Y.; Zhang, G.; Shi, J.; Xu, B.; Zheng, L. SRAI-LSTM: A Social Relation Attention-based Interaction-awared LSTM for Human Trajectory Prediction. Neurocomputing 2021, 490, 258–268. [Google Scholar] [CrossRef]
Lefèvre, S.; Vasquez, D.; Laugier, C. A survey on motion prediction and risk assessment for intelligent vehicles. ROBOMECH J. 2014, 1, 1. [Google Scholar] [CrossRef] [Green Version]
Firl, J.; Stübing, H.; Huss, S.A.; Stiller, C. Predictive maneuver evaluation for enhancement of car-to-x mobility data. In Proceedings of the 2012 IEEE Intelligent Vehicles Symposium, Madrid, Spain, 3–7 June 2012; pp. 558–564. [Google Scholar]
Schreier, M.; Willert, V.; Adamy, J. Bayesian, maneuver-based, long-term trajectory prediction and criticality assessment for driver assistance systems. In Proceedings of the 17th international ieee conference on intelligent transportation systems (ITSC), Qingdao, China, 8–11 October 2014; pp. 334–341. [Google Scholar]
Aoude, G.S.; Desaraju, V.R.; Stephens, L.H.; How, J.P. Driver behavior classification at intersections and validation on large naturalistic data set. IEEE Trans. Intell. Transp. Syst. 2012, 13, 724–736. [Google Scholar] [CrossRef]
Houenou, A.; Bonnifait, P.; Cherfaoui, V.; Yao, W. Vehicle trajectory prediction based on motion model and maneuver recognition. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 4363–4369. [Google Scholar]
Pineda, F.J. Generalization of back-propagation to recurrent neural networks. Phys. Rev. Lett. 1987, 59, 2229. [Google Scholar] [CrossRef]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; pp. 1310–1318. [Google Scholar]
Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
Yao, H.; Tang, X.; Wei, H.; Zheng, G.; Li, Z. Revisiting spatial-temporal similarity: A deep learning framework for traffic prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 5668–5675. [Google Scholar]
Park, S.H.; Kim, B.; Kang, C.M.; Chung, C.C.; Choi, J.W. Sequence-to-sequence prediction of vehicle trajectory via LSTM encoder-decoder architecture. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 1672–1678. [Google Scholar]
Zhao, X.; Chen, Y.; Guo, J.; Zhao, D. A spatial-temporal attention model for human trajectory prediction. IEEE CAA J. Autom. Sin. 2020, 7, 965–974. [Google Scholar] [CrossRef]
Altché, F.; de La Fortelle, A. An LSTM network for highway trajectory prediction. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; pp. 353–359. [Google Scholar]
Deo, N.; Trivedi, M.M. Multi-modal trajectory prediction of surrounding vehicles with maneuver based lstms. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 1179–1184. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Gregor, K.; Danihelka, I.; Graves, A.; Rezende, D.; Wierstra, D. Draw: A recurrent neural network for image generation. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1462–1471. [Google Scholar]
Zhou, G.; Zhu, X.; Song, C.; Fan, Y.; Zhu, H.; Ma, X.; Yan, Y.; Jin, J.; Li, H.; Gai, K. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1059–1068. [Google Scholar]
Xiao, Y.; Yin, H.; Zhang, Y.; Qi, H.; Zhang, Y.; Liu, Z. A dual-stage attention-based Conv-LSTM network for spatio-temporal correlation and multivariate time series prediction. Int. J. Intell. Syst. 2021, 36, 2036–2057. [Google Scholar] [CrossRef]
Chen, K.; Song, X.; Ren, X. Modeling social interaction and intention for pedestrian trajectory prediction. Phys. Stat. Mech. Appl. 2021, 570, 125790. [Google Scholar] [CrossRef]
Wang, R.; Cui, Y.; Song, X.; Chen, K.; Fang, H. Multi-information-based convolutional neural network with attention mechanism for pedestrian trajectory prediction. Image Vis. Comput. 2021, 107, 104110. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Wu, Y.; Chen, G.; Li, Z.; Zhang, L.; Xiong, L.; Liu, Z.; Knoll, A. HSTA: A Hierarchical Spatio-Temporal Attention Model for Trajectory Prediction. IEEE Trans. Veh. Technol. 2021, 70, 11295–11307. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Xu, Y.; Ren, D.; Li, M.; Chen, Y.; Fan, M.; Xia, H. Tra2Tra: Trajectory-to-Trajectory Prediction With a Global Social Spatial-Temporal Attentive Neural Network. IEEE Robot. Autom. Lett. 2021, 6, 1574–1581. [Google Scholar] [CrossRef]
Yang, J.; Sun, X.; Wang, R.G.; Xue, L.X. PTPGC: Pedestrian trajectory prediction by graph attention network with ConvLSTM. Robot. Auton. Syst. 2022, 148, 103931. [Google Scholar] [CrossRef]
Kosaraju, V.; Sadeghian, A.; Martín-Martín, R.; Reid, I.; Rezatofighi, S.H.; Savarese, S. Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. arXiv 2019, arXiv:1907.03395. [Google Scholar]
Lin, Z.; Feng, M.; Santos, C.N.D.; Yu, M.; Xiang, B.; Zhou, B.; Bengio, Y. A structured self-attentive sentence embedding. arXiv 2017, arXiv:1703.03130. [Google Scholar]
Chen, J.; Chen, G.; Li, Z.; Wu, Y.; Knoll, A. Attention Based Graph Convolutional Networks for Trajectory Prediction. In Proceedings of the 2021 6th IEEE International Conference on Advanced Robotics and Mechatronics (ICARM), Chongqing, China, 3–5 July 2021; pp. 852–857. [Google Scholar]
Ding, W.; Chen, J.; Shen, S. Predicting vehicle behaviors over an extended horizon using behavior interaction network. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8634–8640. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Architecture of the proposed STAM-LSTM model.

Figure 2. Illustration of the spatial attention layer.

Figure 3. Structure of LSTM unit.

Figure 4. Illustration of the temporal attention layer.

Figure 5. Visualized prediction results under different traffic scenarios. The historical trajectory, ground-truth trajectory and predicted trajectory are denoted by blue, red, and green lines, respectively.

Table 1. The results of model variants. SA-LSTM and STA-LSTM indicate the results without the temporal attention layer and the motion feature, respectively. To measure prediction results in NGSIM, the evaluation metric RMSE is used here.

Model Variants	RMSE (m)
Model Variants	1s	2s	3s	4s	5s
V-LSTM	0.68	1.65	2.91	4.46	6.27
SA-LSTM	0.61	1.24	2.01	2.80	3.72
STA-LSTM	0.56	1.13	1.85	2.63	3.53
STAM-LSTM	0.43	0.96	1.60	2.37	3.24

Table 2. A comparison of RMSE between baselines and our STAM-LSTM model in the 5 s prediction horizon on the NGSIM dataset (the low numerical results are better and the best is in bold).

Dataset	Prediction Horizon(s)	RMSE (m)
Dataset	Prediction Horizon(s)	CV	V-LSTM	M-LSTM	S-LSTM	CS-LSTM	MATF	MHA-LSTM	MHA-LSTM(+f)	STAM-LSTM
	1	0.73	0.68	0.58	0.65	0.61	0.67	0.56	0.41	0.43
	2	1.78	1.65	1.26	1.31	1.27	1.34	1.22	1.01	0.96
NGSIM	3	3.13	2.91	2.12	2.16	2.09	2.08	2.01	1.74	1.60
	4	4.78	4.46	3.24	3.25	3.10	2.97	3.00	2.67	2.37
	5	6.68	6.27	4.66	4.55	4.37	4.13	4.25	3.83	3.24

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, R.; Xu, H.; Gong, G.; Kuang, Y.; Liu, Z. Spatial-Temporal Attentive LSTM for Vehicle-Trajectory Prediction. ISPRS Int. J. Geo-Inf. 2022, 11, 354. https://doi.org/10.3390/ijgi11070354

AMA Style

Jiang R, Xu H, Gong G, Kuang Y, Liu Z. Spatial-Temporal Attentive LSTM for Vehicle-Trajectory Prediction. ISPRS International Journal of Geo-Information. 2022; 11(7):354. https://doi.org/10.3390/ijgi11070354

Chicago/Turabian Style

Jiang, Rui, Hongyun Xu, Gelian Gong, Yong Kuang, and Zhikang Liu. 2022. "Spatial-Temporal Attentive LSTM for Vehicle-Trajectory Prediction" ISPRS International Journal of Geo-Information 11, no. 7: 354. https://doi.org/10.3390/ijgi11070354

APA Style

Jiang, R., Xu, H., Gong, G., Kuang, Y., & Liu, Z. (2022). Spatial-Temporal Attentive LSTM for Vehicle-Trajectory Prediction. ISPRS International Journal of Geo-Information, 11(7), 354. https://doi.org/10.3390/ijgi11070354

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatial-Temporal Attentive LSTM for Vehicle-Trajectory Prediction

Abstract

1. Introduction

2. Related Works

2.1. Classical Methods for Trajectory Prediction

2.2. LSTM-Based Methods for Trajectory Prediction

2.3. Attention-Based Methods for Trajectory Prediction

3. Methods

3.1. Problem Definition

3.2. Overall Model

3.3. Feature Extraction Module

3.4. Attention-Based Encoder–Decoder Module

3.4.1. Encoder Module

3.4.2. Temporal Attention Layer

3.4.3. Decoder Module

3.5. Loss Function

4. Experiments and Results

4.1. Datasets

4.2. Metrics

4.3. Implementation Details

4.4. Ablation Study

4.5. Baselines

4.6. Quantitative Analysis

4.7. Qualitative Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI