SIT: A Spatial Interaction-Aware Transformer-Based Model for Freeway Trajectory Prediction

: Trajectory prediction is one of the core functions of autonomous driving. Modeling spatial-aware interactions and temporal motion patterns for observed vehicles are critical for accurate trajectory prediction. Most recent works on trajectory prediction utilize recurrent neural networks (RNNs) to model temporal patterns and usually need convolutional neural networks (CNNs) addi-tionally to capture spatial interactions. Although Transformer, a multi-head attention-based network, has shown its notable ability in many sequence-modeling tasks (e.g., machine translation in natural language processing), it has not been explored much in trajectory prediction. This paper presents a Spatial Interaction-aware Transformer-based model, which uses the multi-head self-attention mechanism to capture both interactions of neighbor vehicles and temporal dependencies of trajectories. This model applies a GRU-based encoder-decoder module to make the prediction. Besides, different from methods considering the spatial interactions only among observed trajectories in both encoding and decoding stages, our model will also consider the potential spatial interactions between future trajectories in decoding. The proposed model was evaluated on the NGSIM dataset. Compared with other baselines, our model exhibited better prediction precision, especially for long-term prediction.


Introduction
In the past few years, there has been increasing interest in autonomous driving, as automated vehicles have the potential to eliminate human error from car accidents, which will help protect drivers and passengers and reduce economic damage. However, there remains a long way to go for autonomous driving to replace human driving completely. The road environment is highly dynamic and complicated due to the interactions among road agents, such as cars, trucks, and pedestrians. For safe and efficient driving, autonomous vehicles need to detect and identify other objects and anticipate and react to how these objects behave in the short-term future as humans do. Therefore, predicting the trajectory of other road agents is fundamental for the autonomous vehicle to make wise decisions.
Trajectory prediction is a rather challenging problem for the following reasons. Firstly, there is an interdependency among vehicles where the behaviors of a vehicle affect that of others [1]. For example, a human driver will usually slow down his or her car when the front vehicle is braking. Therefore, to precisely predict a vehicle's trajectory, a trajectory prediction model should also anticipate this vehicle's neighbors' trajectories and consider the potential future interactions among them. Second, the accumulation of errors. Trajectory prediction models usually predict a vehicle's next position based on its current and previous positions; as a result, the model will accumulate errors in each step, leading to poor performance in long-term trajectory prediction. Third, the trajectory tends to be highly nonlinear over time due to the driver's decisions [2], which poses a severe challenge for both traditional dynamic models and machine learning models.
Most of the recent studies on trajectory prediction use deep learning methods. To model the interactions among vehicles, previous studies have attempted to represent the spatial information of vehicles as lane-based social tensors or graph structures and apply pooling layers to obtain the social context encoding. Although these methods capture the spatial interaction of historical trajectories of the target vehicle and its neighbors in the encoding stage, they only predict the target vehicle's future trajectory when decoding and ignore the potential future interactions between the target vehicle and its neighbors. While Transformer [3], a multi-head attention-based network, has shown its more notable ability in many sequence-modeling tasks (e.g., machine translation in natural language processing) than RNNs, it has not been explored much in trajectory prediction. Moreover, previous works usually use two Transformer layers to separately model the temporal dependency of trajectory and the spatial interdependency of vehicles [4,5].
In this paper, we present a spatial interaction-aware Transformer-based model. Unlike the standard Transformer layer containing only one multi-head self-attention module, the novel spatial interaction-aware Transformer (SIT) contains two multi-head self-attention modules. Specifically, these two attention modules have two different attention masks, one for capturing temporal dependencies of trajectories and another for modeling spatial interactions among vehicles. The proposed SIT provides a neat and efficient solution to integrate temporal and spatial context information based on the self-attention mechanism only. By stacking multiple SIT layers, our model can capture more complex and abstract temporal and spatial information. Moreover, the proposed model contains a GRU-based encoder-decoder module on top of SIT layers for making the final prediction. When decoding, for each time step, the decoder will access its last output hidden states of all observed vehicles and use a multi-head self-attention module to guide the message-passing and model the potential future interactions among these vehicles.
We evaluate our method on the public NGSIM US-101 and I-80 datasets. The experimental results show that our method outperforms other baselines with substantial performance improvement. We further conduct ablation studies to demonstrate the superiority of our method over its variants that use the standard Transformer layers or standard GRU encoder-decoder.
The main contributions of this work are summarized as follows: • A spatial interaction-aware Transformer-based model is proposed to efficiently capture and integrate temporal dependencies of trajectories and spatial interactions among vehicles. • A decoder that considers message passing for all vehicles is applied to model the potential future interactions among observed vehicles.
Transformers, based on attention mechanisms, have dominated Natural Language Processing (NLP) in recent years [22][23][24][25][26]. Due to the absence of recurrence, this architecture is more capable of long-term dependency modeling and parallelization training than RNNs. Yu et al. [4] apply two separate Transformers to, respectively, extract spatial and temporal interactions among pedestrians. However, the Transformer architecture has not been explored much in vehicle trajectory prediction.

Spatial Interaction Modeling
Conventional approaches [27][28][29] usually predict the future trajectory of the target object only based on its current state and track history. However, in a crowded road environment, relying only on the trajectory history of the target may lead to inaccurate prediction results, especially for long-term predictions [1]. To model the spatial interaction among vehicles or pedestrians, some studies feed the track history of the target and its surrounding objects to the predictor and use CNNs [2,18,19], attention mechanism [4,18,30,31] or GNNs [8,20,21] to implement message passing among these objects.
Alahi et al. [13] connect neighboring LSTMs through the social-pooling strategy, which allows spatially proximal LSTMs to share information with each other. Deo et al. [2] represent neighboring objects by a social tensor and propose a convolutional social pooling to improve the social pooling method proposed in [13].
Compared to the pooling methods, the attention mechanism can estimate the importance of different neighbors to a given object. Zhang et al. [14] propose a motion gate and a pedestrian-wise attention module to adaptively focus on the most useful neighboring information and guide the message passing. Yu et al. [4] capture spatio-temporal interactions by two separate spatial and temporal Transformers.
In a driving environment, we can regard the vehicles or pedestrians and their interactions as a graph in which the nodes and edges, respectively, represent the objects and the spatial interactions among them. As GNNs naturally fit for graph-structured data, they are also applied to address spatial interaction modeling. Li et al. [20] use a graph to represent the interactions of neighboring objects and apply several graph convolutional blocks to extract features. Yu et al. [4] and Pang et al. [5] utilize a spatial Transformer to model the neighboring objects as a graph and apply a Transformer-based message-passing graph convolution to capture the social interactions. Peng et al. [32] utilize social relation attentions to model spatial interactions based on the relative positions of pedestrians. To avoid modeling multi-agent trajectories in the time and social dimensions separately, Yuan et al. [33] propose an Agent-aware Transformer to leverage a sequence representation of multi-agent trajectories by flattening trajectory features across time and agents.
Although these studies recognize the interactions among neighboring objects by modeling their spatial relationships, they only consider the interactions among the observed trajectories and ignore the potential interactions between the future trajectories of the target vehicle and its neighbors in the prediction phase.

Problem Formulation
This work formulates the trajectory prediction problem as predicting the future trajectories of all objects in an observed scene based on their historical trajectories. Considering that it is easier to predict the velocity of an object than to predict its location [20], we feed historical locations and velocities into our model and let the model predict the future velocities. Then, we accumulate the predicted velocities and the last observed locations to get the final location predictions.
As described above, the inputs X of our model are historical trajectories and velocities of all observed vehicles over t h time steps: where represents the coordinates (x, y) and velocities (u, v) of all vehicles in the observed scene at time t; n is the number of observed vehicles. The outputs Y of our model are the predicted future velocities of all observed vehicles from time step t h+1 to t h + t f , and t f is the predicted horizon: where Following [2,20], the vehicles are observed within 90 feet from the center of the target vehicle. Given a traffic scene with t h observed frames, it first preprocesses the raw trajectory data into the input representation X ∈ R n×t h ×c . After the following two operations: Embedding and Positional Encoding, we use the proposed SIT layers to capture the temporal dependency and the spatial interaction. Then, a GRU-based encoder-decoder module is used to make the final predictions. For each decoding step, the decoder allows the messagepassing between all objects to capture the potential interactions.

Input Representation
Following [20], for subsequent efficient computation, we do not directly feed the raw trajectory data of objects into our model. Given a traffic scene, assuming there are n objects observed in the past t h time steps, we preprocess the raw data into a three-dimensional tensor X ∈ R n×t h ×c , as shown in Figure 1. We set c = 4 to mark an object's coordinate (x, y) and velocity (u, v) at a time step, and normalize all coordinates and velocities to the range of (−1, 1).

Spatial Graph Construction
In traffic scenarios, a vehicle's movement is greatly affected by that of its surrounding vehicles. Therefore, we think it is efficient to represent the interdependencies among vehicles as undirected graphs. Specifically, for each observed time step t, we construct an undirected graph G t = {V t , E t }, in which the nodes V t and the edges E t , respectively, represent the objects and the spatial interactions among them. The node set at time step t is defined as V t = {v i t |i = 1, 2, ..., n}, while the edge set E t at time step t is denoted as At each time step t, we consider a spatial interaction only happens when the current distance between two objects is shorter than a threshold T close and these two objects are on the neighboring lanes, e.g., abs For computation efficiency, we can represent E t as an adjacency matrix A t ∈ R n×n . Thus, at each time step t, where n is the number of observed vehicles. Given n vehicles' observed trajectories with a length of t h time steps, we can obtain the adjacency matrices where W e is the embedding weight. This paper uses a multiple layer perceptron (MLP) as the embedding network φ.

Positional Encoding
Although the Transformer architecture can capture longer sequence dependencies and obtain massive speed-up when training by avoiding the RNNs' method of recurrence mechanism, it does not have any sense of order for each element in a sequence. Consequently, it is vital to incorporate the order of the input elements into the Transformer model, especially when we handle time-series data, e.g., trajectory data. Therefore, in this paper, each input embedding e i t is time-stamped with its time t by adding a positional encoding vector pos t to form h i t . Both e i t and pos t have the same dimensionality of d model . For simplicity, we initialize the positional encoding vectors as a matrix P ∈ R t h ×d model , in which each row vector P[t] represents the positional encoding vector of time step t. Thus, . This ensures a unique time stamp for each historical location of an object. The matrix P will be optimized in company with the model when training.
By performing the above two operations on each x i t for i ∈ [1, n] and t ∈ [1, t h ], we can obtain H ∈ R n×t h ×d model , which is the input of the first spatial interaction-aware Transformer layer.
Unlike the standard Transformer encoder layer, which only fits in modeling the temporal dependency, the proposed spatial interaction-aware Transformer (SIT) layer can capture and integrate temporal dependencies of trajectories and spatial interactions among vehicles. As shown in Figure 2, compared to the standard Transformer layer, our SIT also contains a Spatial Graph Multi-Head Attention Network, which is used to capture the spatial interactions between closing vehicles based on the obtained adjacency matrices A. The following content describes how an SIT layer models temporal dependencies of trajectories and spatial interactions among vehicles using the Temporal Multi-Head Attention module and the Spatial Graph Multi-Head Attention Network.

Temporal Multi-Head Attention Module
Similar to the standard Transformer encoder layer, SIT uses a masked multi-head attention module to capture each vehicle's trajectory's temporal dependency independently. This masked attention module prevents steps from attending subsequent steps. Given the input H ∈ R n×t h ×d model , the attention module first compute the query matrices Then, we compute the masked attention for vehicle i at time step t as follows: shows the current step can only access its previous steps. Similarly, we can obtain the masked multi-head attention (k heads) for vehicle i for time step t: where head j = Attention j (t) (10) where f O is a fully connected layer that merges the k heads' information. After calculating the multi-head attention T i t for each vehicle i ∈ [1, n] and each time step t ∈ [1, t h ], we obtain T ∈ R n×t h ×d model , which contains the extracted temporal information from the historical trajectories. ...
The spatial message-passing is used in the Spatial Graph Multi-head Attention Network, which only allows the message-passing to happen between neighboring vehicles at each step.

Spatial Graph Multi-Head Attention Network
Based on the obtained T ∈ R n×t h ×d model and adjacency matrices A, a spatial graph multi-head attention network is applied to extract the spatial interactions among the observed vehicles.
The self-attention mechanism can be regarded as message passing on an undirected fully connected graph. For a time step t, we can get n vehicles' features {h i t } n i=1 ∈ R n×d model from T and represent its corresponding query vector, key vector and value vector, respectively, as and define the message passing from vehicle j to i in the fully connected graph as then the attention at time step t can be calculated as However, it is inefficient to regard the spatial interactions among vehicles as a fully connected graph. Therefore, we use the adjacency matrices A to replace the fully connected graph above, which ensures the message passing from vehicle j to i at a time step t only when the current distance of these two vehicles is shorter than a threshold T close and the two vehicles are on the neighboring lanes, as shown in Figure 3b. Then we can rewrite the attention calculation of vehicle i at time step t: where N(i) = {j|A t [i, j] = 1, j ∈ [1, n]} represents a neighbor set of vehicle i. Similarly, we can obtain the multi-head attention (k heads) of vehicle i for time step t: where head j = Attention j (i) (16) where f O is a fully connected layer that merges the k heads' information. After calculating the multi-head attention S i t for each vehicle i ∈ [1, n] and each time step t ∈ [1, t h ], we obtain S ∈ R n×t h ×d model , which contains the extracted interaction information among the observed vehicles. We stack multiple SIT layers to capture more complex and abstract temporal and spatial information.

Trajectory Prediction Module
We apply a GRU-based encoder-decoder module to predict the future trajectories of all observed vehicles. The outputs of the last SIT layer will be fed into the GRU encoder. At the first decoding step, both the hidden feature of the encoder and the velocities of all objects at the last observed time step are fed into the decoder to predict vehicles' velocities. For the following decoding steps, the decoder takes both the hidden feature of itself and the predicted velocities of all objects at the previous time step as inputs to make the prediction.
However, such decoding processes ignore the potential interactions among the future trajectories of observed vehicles. To model those potential interactions, for each decoding step, our decoder will access the previous step's hidden features of vehicles and use a multi-head self-attention module to guide the message-passing among those vehicles. Then, the decoder takes the interacted hidden features instead of the origin hidden features as input to make the final prediction, as shown in Figure 1.

Implementation Details
Following Li et al. [8], we process a traffic scene within ±90 feet and all vehicles in this scene will be observed and predicted in the future.
When constructing the adjacency matrices A, we set T close = 50 feet. In spatial interaction-aware Transformer layers, we let d model = 128; the number of head of multihead attention modules is 4; and the number of SIT layers is 2.
In the GRU-based encoder-decoder module, both the encoder and decoder are a twolayer GRU. We set the number of hidden units of GRUs equals to 60 and apply a tanh activation function to rescale the output of decoder to range of (−1, 1).
Our code is implemented using PyTorch Library [34], we train our model as a regression task. The overall loss can be calculated as: where t f is the number of time step to be predicted in the future, Y pred t and Y gold t are predicted positions and ground truth at time step t, respectively. We train the model using Adam [35] optimizer with η = 0.001, β 1 = 0.9, and β 2 = 0.999. The learning rate is 0.0001. We set batch_size = 32 during training. We apply Teacher Forcing in training to accelerate convergence.

Experimental Setting
This section presents the evaluation of the proposed model. For a fair comparison with other methods, our model was trained and evaluated on two publicly available datasets. We perform the experiments on a desktop running Ubuntu 18.04 with 2.50GHZ Intel Xeon E5-2678 CPU, 32 GB Memory, and an NVIDIA 1080Ti Graphics Card.

Dataset
The proposed model was trained and evaluated using the public NGSIM US-101 and I-80 datasets. Both datasets were captured at 10 Hz over 45 min and split into three periods of 15 min. These periods represent mild, moderate, and congested traffic conditions. These two datasets consist of vehicles' trajectories on real freeway traffic. Each vehicle's trajectory was divided into segments of 8 s, where the first 3 s are used as observed track history and the remaining 5 s are the prediction horizon. Following Deo et al. [2], the trajectory data were down-sampled for 10 Hz to 5 Hz, i.e., five frames per second. The two datasets above are merged into one dataset, which is randomly shuffled and divided into the training set, validation set, and test set at a ratio of 7:1:2. The following experimental evaluations are conducted on the test set. The code for data preprocessing and dataset segmentation can be downloaded at GitHub (https://github.com/nachiket92/conv-social-pooling, accessed on 10 October 2021).

Evaluation Metrics
We use the same evaluation metrics as other methods [2,18] and report our evaluation results in terms of the root of the mean squared error (RMSE) of the predicted future trajectories for each time step within the 5 seconds prediction horizon. The RMSE at time step t can be calculated as follows: where m is the number of vehicles in the test dataset, Y pred t and Y gold t are predicted positions and ground truth at time step t, respectively.

Ablation Experiments on Neighboring Thresholds
As mentioned in Section 4.1.2, we introduce two thresholds to construct the neighboring graph: the neighboring distance threshold T close and the lane difference limit T lane .
In this subsection, we conduct two experiments to present the impacts of different T close and various T lane on our model SIT-ID. The range of T lane we apply in our ablation experiments is [0, 2], while the T close values we select are 0, 30, 50, 70, and 90 feet. As shown in Figure 4a, when we fix T lane = 1, T close = 50 performs better than other neighboring distance thresholds. From Figure 4b, we can see that the optimal lane difference limit is 1 if T close = 50. Therefore, considering too many neighboring vehicles or not considering neighboring vehicles at all will degrade model performance. Based on these observations, in this paper, we set T close = 50 feet and T lane = 1 as our default setting unless specified otherwise.

Ablation Experiments on the Proposed Model
In this subsection, we perform three ablation experiments on the proposed model SIT-ID. First, we compare the proposed SIT layers and the standard Transformer (ST) layer to verify whether our Spatial Graph Multi-head Attention Network can improve precision by capturing the spatial interaction. ST-GD and SIT-GD both use a standard GRU encoderdecoder module to make predictions. The ST layer used here can only capture the temporal dependency of each vehicle's historical trajectory. As shown in Table 1, the SIT-GD model performs better than the ST-GD model in terms of RMSE values, especially in long-term future predictions. The SIT layers reduced the RMSE 5s value by 25.8% compared to the standard Transformer layers. This result shows that the proposed SIT layer can capture more useful information for trajectory prediction by using the Spatial Graph Multi-head Attention Network to model the interactions among neighboring vehicles, which verifies the importance of the spatial interactions among vehicles in trajectory prediction. Second, to check the effectiveness of the GRU encoder in our framework, we compare these two models: SIT-GD and SIT-WoE. SIT-WoE is the model without the GRU encoder, and its GRU decoder directly takes as input the hidden state of the last step of SIT layers. SIT-GD uses a standard GRU encoder-decoder to make predictions. As shown in Table 1, SIT-GD is slightly better than SIT-WoE, the RMSE 5s values of the two models are 4.40 and 4.48, respectively. This result confirms the effectiveness of the GRU encoder. However, we think the GRU encoder can be removed if we can find a better way to utilize the hidden states of SIT layers, such as the adoption of attention mechanisms or pooling methods. We leave it for future study.
Third, to validate the effect of considering potential interactions among the observed vehicles' future trajectories in decoding, we contrast the proposed interaction-aware GRU decoder and the standard GRU decoder. SIT-GD and SIT-ID both use two SIT layers to capture temporal and spatial dependencies, but the former use the standard GRU encoderdecoder to make predictions, while the latter applies a standard GRU encoder and a spatial interaction-aware GRU decoder. As shown in Table 1, the latter improves the RMSE values of long-term future predictions (e.g., RMSE 4s and RMSE 5s ) still further, which substantiates that considering the potential interactions among vehicles in decoding is also essential to trajectory prediction, especially long-term trajectory prediction.
To highlight the importance of modeling the spatial interaction, we report the results of these three models on congested traffic scenes. We think a traffic scene is congested when the number of its observed vehicles is equal or greater to 12. From Tables 1 and 2, we can see that the models considering the spatial interaction, i.e., SIT-GD and SIT-ID, widened the gap with ST-GD in congested traffic scenes, compared to non-congested traffic scenes. In congested traffic scenes, SIT-GD further widened the gap from 25.8% to 38.6%, while SIT-ID widened from 31.7% to 40.3%.

Compared Models
We compare the proposed model to the following baselines: Dynamic and static context-aware attention network (DSCAN) [18]: This method utilizes an attention mechanism to decide which surrounding vehicles are more importance to the target vehicle and considers the environment information by using a constraint network. Table 3 presents the RMSE values of the models compared. We perceive that CV and V-LSTM yield much higher RMSE values than other models. These two models only use the target vehicle's track history, while other models utilize the surrounding vehicles' motion information. This result demonstrates that considering inter-vehicle interactions is essential to trajectory prediction. We note that CS-LSTM(M) leads to higher RMSE values than CS-LSTM. As mentioned in [2], this could be partly due to misclassified maneuvers.

Compared Results
We also note that our SIT-ID produces lower RMSE values compared to the S-LSTM, CS-LSTM and DSCAN, especially for long-term predictions, e.g., RMSE 4s and RMSE 5s . S-LSTM, CS-LSTM, and DSCAN do not consider the potential interactions in decoding. This result shows that considering the potential interactions among vehicles in decoding also significantly impacts trajectory prediction, especially for long-term trajectory predictions.

Visualization of Prediction Results
We visualize a good and a bad prediction case selected from the test set, in Figure 5a and Figure 5b, respectively. After observing 3 seconds of history trajectories, our SIT-ID predicts the trajectories over 5 seconds in the future. We use different colors to distinguish different vehicles; the solid line represents the observed trajectories, while the markers "+" and "•" represent the ground truth in the future and the predicted results, respectively. The red colors correspond to the cars located in middle which is the target that CS-LSTM [2] and DSCAN [18] try to predict. The good case shows that our model can precisely predict the trajectories of all vehicles in an observed scene simultaneously. But, as can be seen from the bad case, our model performs poorly in case of an emergency lane change that happens right after the observation stage. We think this is mainly because the samples in the NGSIM dataset containing emergency lane changes are too few. Therefore, in the near future, we would like to evaluate our model on other datasets, e.g., the Apollo dataset [10], in which data is captured not only on a highway but also from urban areas.
(a) (b) Figure 5. Visualization of SIT-ID's prediction results. (a) a well predicted example; (b) a poorly predicted example. Different colors represent different vehicles; the solid line represents the observed trajectories, while the markers "+" and "•" represent the ground truth in the future and the predicted results, respectively. The red colors correspond to the cars located in middle which is the target that CS-LSTM [2] and DSCAN [18] try to predict.

Attention Distribution Analysis
The Temporal Multi-Head Attention (TMHA) module and the Spatial Graph Multi-Head Attention Network (SGMA) are based on the attention mechanism. Attention in deep learning can be broadly interpreted as a vector of importance weights, which reflect one element how strongly it is correlated with other elements. Therefore, to further analyze the mechanism of our model, we visualize the attention distributions produced by the TMHA and SGMA of the last SIT layer of our model. Figure 6 shows a sample of temporal attention distributions calculated by the TMHA module. We use k-head attention mechanisms in both the TMHA and SGMA and set k = 4, so there are four different distributions corresponding to different attention heads, respectively. Inspecting the attention distribution of head 2 in Figure 6, we note that for each time step, its attention is mainly distributed to the current and the previous few steps, and the farther away in time, the lower the attention weight. This mechanism is similar to humans. When driving, a human driver anticipates the motion of a neighboring vehicle, usually based on the recent locations of this vehicle and does not consider its locations of a long time ago. We use masks to prevent steps from attending subsequent steps, so the attentions between a step and subsequent steps are masked to 0. Figure 7 represents a sample of spatial attention distributions calculated by the SGMA. The values in the grid are the Euclidean distance between the corresponding vehicles in the foot unit. We note that the attention weights tend to be slightly symmetrical along the diagonal. Besides, these weights are linearly related to the Euclidean distances, i.e., a smaller distance usually has a more significant attention weight. This attention distribution is also similar to humans; given a time step, a human driver should pay more attention to vehicles closer to him.
The above analysis shows that the TMHA and SGMA used in our proposed SIT can effectively capture temporal dependencies of trajectories and spatial interactions of vehicles.

Conclusions
In highly dynamic traffic scenes, the vehicle's subsequent movements are affected by the interactions of its surrounding vehicles. Considering the interactions among vehicles, both in the historical trajectory encoding and the future trajectory decoding stages, is essential to trajectory prediction. Thus, this paper proposes a spatial interaction-aware Transformer-based model. In the encoding stage, the proposed Spatial Interaction-aware Transformer (SIT) layers are utilized to obtain useful context information for trajectory prediction. The SIT layer contains two key modules: the Temporal Multi-Head Attention module and the Spatial Graph Multi-Head Attention Network, which are applied to capture temporal dependencies of trajectories and spatial interactions among vehicles, respectively. In the decoding stage, a GRU-based encoder-decoder module is applied to make the final predictions. To consider the future potential interactions, for each decoding step, the decoder first access the last states of all observed vehicles and control the message-passing among them based on the multi-head attention mechanism, then make a prediction for each vehicle.
The proposed model was evaluated using the public NGSIM US-101 and I-80 datasets. The main advantages of the proposed model are summarized as follows: • The proposed SIT-based model can predict the trajectory more accurately than other baselines, especially for long-term prediction and in highly interactive situations. Because it considers interactions among vehicles both in the encoding and the decoding stages.
• The proposed SIT layers can effectively capture and integrate temporal dependencies of trajectories and spatial interactions among vehicles when encoding. In the ablation study, the SIT layers reduced the RMSE 5s value by 25.8% compared to the standard Transformer layers.
Due to the datasets used in the work consisting of only highway sections, which are more simple than typical traffic scenes, e.g., urban traffic scenes, our results have certain limitations in generalization. Considerably more work will need to be done to adapt to complex environments and incorporate traffic information, such as lane types and traffic lights.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://ops.fhwa.dot.gov/trafficanalysistools/ngsim.htm, accessed on 25 September 2021.

Conflicts of Interest:
The authors declare no conflict of interest.