MSTT: A Multi-Spatio-Temporal Graph Attention Model for Pedestrian Trajectory Prediction

Zhang, Qingrui; Zhang, Xuxiu; Ye, Zilang; Mi, Jing

doi:10.3390/s25154850

Open AccessArticle

MSTT: A Multi-Spatio-Temporal Graph Attention Model for Pedestrian Trajectory Prediction

¹

School of Automation & Electrical Engineering, Dalian Jiaotong University, Dalian 116028, China

²

School of Mechanical Engineering, Dalian Jiaotong University, Dalian 116028, China

³

Department of Mathematics, University of Padova, 35100 Padova, Italy

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(15), 4850; https://doi.org/10.3390/s25154850

Submission received: 23 June 2025 / Revised: 3 August 2025 / Accepted: 4 August 2025 / Published: 7 August 2025

(This article belongs to the Special Issue AI-Driving for Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Accurate prediction of pedestrian movements is vital for autonomous driving, smart transportation, and human–computer interactions. To effectively anticipate pedestrian behavior, it is crucial to consider the potential spatio-temporal interactions among individuals. Traditional modeling approaches often depend on absolute position encoding to discern the positional relationships between pedestrians. Unfortunately, this method overlooks relative spatio-temporal relationships and fails to simulate ongoing interactions adequately. To overcome this challenge, we present a relative spatio-temporal encoding (RSTE) strategy that proficiently captures and analyzes this essential information. Furthermore, we design a multi-spatio-temporal graph (MSTG) modeling technique aimed at modeling and characterizing spatio-temporal interaction data across several individuals over time and space, with the goal of representing the movement patterns of pedestrians accurately. Additionally, an attention-based MSTT model has been developed, which utilizes an end-to-end approach for learning the structure of the MSTG. The findings indicate that an understanding of an individual’s preceding trajectory is crucial for forecasting the subsequent movements of other individuals. Evaluations using two challenging datasets reveal that the MSTT model markedly outperforms traditional trajectory-based modeling methods in predictive performance.

Keywords:

pedestrian trajectory prediction; attentional mechanism; graph structure learning

1. Introduction

With the rapid advancement of autonomous driving technology, accurately predicting pedestrian trajectories has become a critical task for ensuring the safety of both vehicles and pedestrians [1,2,3]. In the complex and unpredictable urban environment, the variability and unpredictability of pedestrian interactions pose significant challenges to the perception and decision-making systems of autonomous vehicles. Precise trajectory prediction has been shown to reduce collision rates of autonomous vehicles in high-density urban traffic by 10% to 20% [4] in closed-loop testing.

Predicting the future movements of pedestrians in complex environments presents a significant challenge due to the high subjectivity and randomness of human interactions. Empirical methods explicitly model interactions to predict crowd motion, e.g., a rule-based model [5], a force-based model [6] and an energy-based model [7]. However, due to the inability to precisely fit the observed data in dynamic and changing environments, the model exhibits poor generalization, resulting in a decrease in predictive accuracy during closed-loop testing. In contrast, various methods based on deep neural networks have been proposed for pedestrian interaction modeling by employing social pooling layers [8,9,10,11], graph neural networks (GNNs) [12,13,14,15], and attention mechanisms [16]. While they demonstrate strong expressive power in open-loop testing and exhibit some generalization ability in closed-loop testing, the black-box nature of neural networks limits their interpretability. Exploring the trade-off between model explainability and prediction capability remains a challenging task.

The integration of graph neural networks with Transformers has been rigorously studied in this field [17,18]. While this convergence has led to performance improvements, further research is needed to better understand its connection to pedestrian behavior. Figure 1 illustrates that earlier approaches [15] typically assumed that pedestrian interactions were based on geographical correlations from prior encounters. However, the current focus on spatio-temporal interactions at specific time points—either past or future—renders these methods insufficient for accurately capturing pedestrian interactions. Moreover, most models rely on Transformer-based absolute position encoding to integrate pedestrian data, which limits adaptability, as these models cannot easily adjust their network parameters. This issue is exacerbated by the frequent appearance and disappearance of pedestrians. Some studies [19,20] have adopted an evolutionary strategy for multi-agent trajectory prediction to improve adaptability. However, these approaches are often formulated at discrete time intervals, neglecting the continuous nature of pedestrian interactions and failing to fully represent the dynamic behavior of pedestrians.

We introduce a novel trajectory prediction model, the Multiple-Relative Spatio-Temporal Graph Transformer (MSTT), which is designed to model the complex interactions among pedestrians in the spatio-temporal domain. This model integrates relative spatio-temporal encoding with advanced spatio-temporal modeling techniques to predict pedestrian trajectories accurately, as shown in Figure 2. Unlike traditional Transformer models that rely on absolute position encoding [21], the MSTT employs a relative spatio-temporal encoding strategy, which captures the dynamic spatio-temporal dependencies across various pedestrian nodes during encoding. This approach enables the model to better understand intricate pedestrian interactions, including behaviors that may be hidden or absent, thereby enhancing the model’s flexibility and adaptability. Ultimately, the MSTT introduces a multi-graph fusion technique that captures continuous spatio-temporal interdependence among pedestrians, ensuring that all interactions are modeled simultaneously, rather than treating spatial and temporal relationships separately.

The proposed methodology was evaluated on several challenging datasets, including ETH [22], UCY [23], SDD [24], and the SportVU NBA sports dataset. The experimental results demonstrated that our model significantly improved the accuracy of pedestrian trajectory prediction. Ablation studies of various model components further substantiated its effectiveness. The primary aim of this study was to model the dynamic interactions and temporal dependencies that enhance the accuracy of pedestrian trajectory prediction techniques.

The main contributions of this study are summarized as follows:

We introduce a comprehensive model for predicting pedestrian trajectories, known as the MSTT, which harnesses the dynamic interactions among pedestrians within varied spatio-temporal frameworks to project their forthcoming movements.

We propose a relative spatio-temporal coding methodology that employs the encoding of periodic traits to encapsulate the cyclic nature of spatio-temporal interactions by illustrating the relative distinctions between temporal and spatial dimensions. This enables the model to adeptly manage diverse spatio-temporal intervals and effectively reduce bias towards specific nodes.

We have developed an advanced spatio-temporal graph modeling methodology that evaluates pedestrian interaction links across varied spatio-temporal intervals. This is achieved through the superimposition of spatial graphs from different time points, application of dynamic thresholding using multimodal data, and the ultimate creation of numerous spatio-temporal graphs. These graphs are produced by filtering pivotal occurrences predicated on the intensity of interactions.

Comprehensive experiments conducted on publicly accessible pedestrian trajectory datasets substantiate that our proposed algorithm surpasses the performance of several baseline methodologies, including state-of-the-art algorithms.

2. Related Work

2.1. Multi-Intelligent Body Trajectory Prediction

The relationship between autonomous cars and pedestrians has attracted a lot of interest lately, especially when it comes to mixed urban traffic situations [21]. To improve road safety and maximize the operational effectiveness of autonomous driving systems in these intricate, dynamic situations, it is critical to be able to anticipate pedestrian behaviour and how they will interact with vehicles.To address this challenge, Alahi et al. [15] utilized long short-term memory (LSTM) networks to deduce latent states, which were then disseminated to adjacent pedestrians. PITF [25] incorporates pedestrian behavior and interaction modules, effectively incorporating visual information into its feature set. STGAT [26] is the inaugural model that integrates graph attention networks (GATs) with LSTM to model pedestrian movements. PECNet [27] employs conditional variational autoencoders (CVAEs) [28] to deduce trajectory endpoints, enhancing the precision and dependability of trajectory forecasts.

Moreover, with the rapid advancement of autonomous driving technology, interactions between pedestrians and autonomous vehicles have become a central focus of research. For example, the IAMPDM method [29] combines deep learning with decision-making frameworks to optimize autonomous vehicle behavior through pedestrian intent recognition. In parallel, cooperative decision-making models [30] leverage cooperative control strategies and model predictive control (MPC) to predict pedestrian motion and adjust vehicle behavior accordingly, thereby improving adaptability and enhancing traffic safety. Furthermore, due to the frequent interactions between vehicles and pedestrians, recent work has focused on optimizing model size to accelerate inference. For instance, DERGCN [19] employs evolving graph convolutional networks to reduce model parameters, thus improving the inference speed. DSDrive [31] introduces a waypoint-based dual-head coordination module to synchronize data structures, optimization objectives, and training procedures.

2.2. Dynamic Graph Neural Networks

In recent years, substantial progress has been achieved in the study of dynamic graph neural networks (DGNNs) and Transformers [32,33]. Dynamic graphs are characterized by the emergence, disappearance, or reconnection of nodes and edges at various intervals of time. DGNNs represent a specialized neural network architecture crafted for dynamic networks; they are adept at capturing the temporal evolution of nodes and edges while encoding the attributes of neighboring nodes throughout this process [34,35,36,37]. Currently, DGNNs conventionally integrate graph neural networks (GNNs) to encapsulate the structural attributes of the graph, alongside Transformers or temporal neural networks to discern temporal patterns. DGNNs can be categorized into discrete and continuous types depending on how they manage graph structures. Discrete approaches employ static GNNs at each time interval to extract information from the dynamic graph, whereas continuous methods integrate the temporal dimension directly into the representation learning of dynamic graphs, thereby facilitating the dynamic learning of spatio-temporal interactions.

2.3. Transformer-Based Trajectory Prediction

Given the proven success of the Transformer architecture in natural language processing (NLP) and computer vision (CV), an increasing number of studies have adopted this framework for trajectory prediction tasks. For example, Yu et al. [16] utilized a spatial Transformer to model spatial relationships among pedestrians, along with a temporal Transformer to capture sequential dependencies. Similarly, Zhou [9] calculated similarity coefficients between node pairs, applied supervised adjacency matrices, and employed a group-aware spatio-temporal Transformer for model training.

In this research context, Transformer-based models such as the knowledge-aware graph Transformer [38], MGFormer [39], and STGSTN [40] effectively handle spatio-temporal interactions via attention mechanisms, demonstrating their potential in complex dynamic environments. While these graph-based Transformers are proficient in modeling long-range dependencies and complex inter-agent relationships, they fall short in capturing the social and spatio-temporal nuances of pedestrian interactions. To address this limitation, we propose the MSTT model, which integrates pedestrian social information and spatio-temporal graph structures with relative positional encoding. This approach significantly enhances trajectory prediction accuracy.

3. Approach

3.1. Problem Formulation

Consider a scenario where there are N pedestrians

p_{i}

, with

i \in {1, 2, \dots, N}

. At a specific time t, the position of pedestrian

p_{i}

is

l_{i}^{t} = (x_{i}^{t}, y_{i}^{t})

. The objective is to predict the future trajectory

l_{i}^{t}

of the pedestrian at future times

t \in {t_{n + 1}, t_{n + 2}, \dots, t_{p r e d}}

based on the observed positions

l_{i}^{t}

during the time interval

t = 1, \dots, t_{n}

and the interactions between pedestrians.

3.2. Trajectory Coding for a Pedestrian

The proposed MSTT model first computes a set of motion-related features for each pedestrian

p_{i}

at time t, defined as

f_{i}^{t} = [l_{i}^{t}, Δ x_{i}^{t}, Δ y_{i}^{t}, v_{i}^{t}, α_{i}^{t}, θ_{i}^{t}]

(1)

where

l_{i}^{t}

signifies the spatial coordinates of pedestrian

p_{i}

at time t, whereas

Δ x_{i}^{t}

and

Δ y_{i}^{t}

indicate the lateral and longitudinal displacements, respectively.

v_{i}^{t}

denotes the instantaneous velocity;

α_{i}^{t}

marks the acceleration, and

θ_{i}^{t}

represents the current bearing. The encoder handles the MLP input subsequent to its conversion into a fixed-length feature vector,

e_{i}^{t}

, for further processing by downstream models. This transformation is described as follows:

e_{i}^{t} = σ (W_{2} σ (W_{1} f_{i}^{t} + b_{1}) + b_{2})

(2)

where

W_{1}

and

W_{2}

denote weight matrices,

b_{1}

and

b_{2}

represent bias terms, and

σ

signifies the ReLU activation function.

3.3. Relative Spatio-Temporal Coding

The positional encoding mechanism inherent in Transformer models is employed to enhance the modeling of spatio-temporal dependencies in pedestrian trajectory prediction. We propose a relative spatio-temporal encoding strategy derived from this mechanism, which concurrently captures the temporal and spatial relationships among pedestrians, as illustrated in Figure 3. This facilitates interactions among nodes with disparate timestamps and geographic locations, significantly improving the model’s capacity to represent spatio-temporal information.

Specifically, for two pedestrians with different temporal and spatial positions, they are defined as two nodes, i and j, and their relative time interval can be expressed as

Δ T_{i j} = T_{i} - T_{j}

, which serves as an index for deriving the relative spatio-temporal coding

R T E_{Δ T i j}

. Note that the training dataset does not cover all possible time gaps; thus, the relative temporal encoding should be capable of generalizing unseen times and time gaps. Due to the periodicity and smoothness properties of sine and cosine functions, we base our design on the positional encoding mechanism of Transformers and propose a method that combines fixed sinusoidal functions with an adaptable linear projection, denoted as T-Linear*

: R^{d} \to R^{d}

. This combination forms the basis for the relative time encoding of RTE, and the equation is presented as follows:

Base (Δ T (i, j), 2 k) = sin (\frac{Δ T (i, j)}{10000^{(2 k / d)}})

(3)

Base (Δ T (i, j), 2 k + 1) = cos (\frac{Δ T (i, j)}{10000^{(2 k + 1 / d)}})

(4)

RTE (Δ T (i, j)) = T-Linear (Base (Δ T (i, j)))

(5)

where d represents the dimensionality of the positional encoding, k denotes the index of the positional encoding dimension, and Base is the base encoding function that uses sine and cosine functions to encode the time difference

Δ T (i, j)

, capturing spatio-temporal relationships at different frequencies through these encodings.

Similarly, for the source node i and the destination node j, their spatial coordinates are denoted as

P_{s} = (x_{i}, y_{i})

and

P_{n} = (x_{j}, y_{j})

, respectively. Based on these coordinates, we define the relative spatial encoding (RSE) using the same sinusoidal encoding scheme as in the temporal case, followed by a learnable linear projection. The RSE is computed as

RSE (P_{i}, P_{j}) = T-Linear (Base (P_{i} - P_{j}))

(6)

Finally, the temporal and spatial encodings are added to the original motion feature vector to form the final representation of the pedestrian node

E_{i}^{t}

, where

E_{i}^{t} = e_{i}^{t} + RTE (Δ T (i, j)) + RSE (P_{i}, P_{j})

(7)

3.4. Multiple-Spatio-Temporal-Map Modeling

To thoroughly encapsulate the spatio-temporal interaction data of pedestrians, we propose a methodology for modeling multi-spatio-temporal graphs that amalgamates both temporal and spatial interactions.

We characterize the multimodal spatio-temporal graph using a binary adjacency matrix that delineates the relationship between current observations and historical events as a directed acyclic graph. At time

t = 2

, the directed edges depicted in Figure 1b suggest a possible unilateral interaction between the pedestrians represented by the source node and those identified by the destination node. This interaction may extend throughout a time interval

τ = {0, \dots, t - 1}

, demonstrating how one individual’s historical behavior can affect another’s subsequent choices. Our hypothesis asserts that each pedestrian

x^{t}

is influenced by their prior behavior

x^{0 : t - 1}

as well as by other pedestrians at possibly varying temporal instances.

3.4.1. Spatio-Temporal Graph Construction

Based on the relative spatio-temporal encoding constructed above, the pedestrian node

E_{i}^{t}

can represent the interactions between pedestrians as a graph structure at any time t, where E serves as the node and

R_{i j}^{t}

denotes the edge. Thus, the graph structure can be expressed as

G_{t} = (E_{i}^{t}, R_{i j}^{t})

. Ultimately,

G_{t}

is transformed into the matrix

A_{t}

and fed into the model for training.

At time t, the graph

G_{t}

is converted into a pedestrian interaction matrix

A_{t}

. Traditional methods rely on the

L_{2}

norm to measure the distance between pedestrians; however, this approach is susceptible to the influence of distant pedestrians and thus fails to accurately capture the true strength of interactions. To address this limitation, we adopt the inverse of the Euclidean distance as a more effective metric to define the interaction intensity between pedestrians.The following are the formulas for both the

L_{2}

norm and the inverse of the Euclidean distance:

A_{L_{2}}^{i j} = \{\begin{matrix} {∥V_{t}^{i} - V_{t}^{j}∥}_{2}, & if {∥V_{t}^{i} - V_{t}^{j}∥}_{2} \neq 0 \\ 0, & otherwise \end{matrix}

(8)

A_{t}^{i j} = \{\begin{matrix} \frac{1}{{∥V_{t}^{i} - V_{t}^{j}∥}_{2}}, & if {∥V_{t}^{i} - V_{t}^{j}∥}_{2} \neq 0 \\ 0, & otherwise \end{matrix}

(9)

where

{∥V_{t}^{i} - V_{t}^{j}∥}_{2}

denotes the Euclidean distance between two adjacent pedestrians, i and j, at a specific time t. By applying the extension of Equation (9) to the entire scene at time t, the interactions among all pedestrians can be systematically assessed in relation to the threshold, thereby facilitating the derivation of the corresponding adjacency matrix of the spatio-temporal map.

3.4.2. Local Judgment

To construct the multi-spatial map, we first encode the pedestrian’s trajectory features using a triplet of feature vectors, analogous to the standard Transformer framework. For pedestrian

p^{i}

at time t, the trajectory information is represented using a query vector q, a key vector k, and a value vector v. These vectors are defined as follows:

q_{i}^{t} = f_{q} (h_{i}^{t}), k_{i}^{t} = f_{k} (h_{i}^{t}), v_{i}^{t} = f_{v} (h_{i}^{t})

(10)

where

q_{i}^{t}

denotes the query vector,

k_{i}^{t}

denotes the key vector, and

v_{i}^{t}

signifies the value vector. Additionally, the functions

f_{q}

,

f_{k}

, and

f_{v}

map to the query, key, and value vectors. Similar to the approach used in self-attention mechanisms, the interaction strength

a_{t t^{'}}^{i j}

between pedestrian

p_{i}

at time t and pedestrian

p_{j}

at time

t^{'}

is expressed by the following formula:

a_{{tt}^{'}}^{i j} = \frac{q_{i}^{t} {(k_{j}^{t^{'}})}^{T}}{\sqrt{d_{k}}} v_{i}^{t}

(11)

where

d_{k}

represents the dimensionality of the vector.

During the local evaluation phase, a dynamic threshold

θ

is used to determine significant interactions among pedestrians. The threshold

θ

is adaptively adjusted based on the mean and standard deviation of interaction influences across all pedestrians, as shown below:

θ = μ + w \cdot σ

(12)

where

μ

denotes the mean interaction influence,

σ

is the standard deviation, and w is a coefficient controlling the threshold’s sensitivity.

Extending Equation (11) to the entire scenario, interactions between all pedestrians at time

t^{'}

and pedestrian i at time t are represented in matrix form as

A_{t}^{i j}

. The element

a_{t t^{'}}^{i j}

denotes the interaction strength between pedestrian i at time t and pedestrian j at time

t^{'}

.

Through the local judgment formula, the interaction strength between any two pedestrians at different spatio-temporal points can be evaluated, enabling the determination of interactions across varying spatio-temporal contexts. The local judgment formula is given as follows:

A_{t t^{'}}^{i j} = \{\begin{matrix} 1, & if a_{t t^{'}}^{i j} > θ \\ 0, & otherwise \end{matrix}

(13)

3.4.3. Global Judgment

During the global judgement phase, we assess the cumulative impact of all pedestrians at time t on pedestrian

p^{i}

at time

t^{'}

:

{Impact}_{i}^{t} = \sum_{j = 1}^{N} A_{{tt}^{'}}^{i j}

(14)

where N denotes the total number of pedestrians at time

t^{'}

, and

A_{t t^{'}}^{i j}

represents the influence of pedestrian

p^{i}

at time t on pedestrian

p^{j}

at time

t^{'}

. The time instances are scored based on their influence, and the top L moments with the highest impact are selected. Subsequently, the graphs corresponding to these L moments are superimposed to form a multi-spatial graph consisting of L layers. This process yields an augmented adjacency matrix

G_{L}

, as illustrated in Figure 4.

3.5. Neural Networks for Multi-Spatial Graphs

The primary function of this module is to integrate information between previously connected nodes through the use of graph attention networks (GATs), continuously updating the node features. It can be conceptualized as a message-passing architecture within an undirected graph. These networks operate by calculating attention weights for each node relative to its neighbors, thereby enabling the extraction of additional information from the overall structure of the graph.

The multi-spatial graph neural network consists of multiple graph attention layers. Each graph attention layer processes the node features

E_{i}^{t}

and the augmented adjacency matrix

G_{L}

as inputs, generating the expected trajectory features shown in Figure 5. The model employs multi-head attention, characterized by four distinct heads represented by dashed, solid, dotted, and dash-dot lines. Upon obtaining various node features, each head consolidates these features to derive the final trajectory feature

h^{S T G}

.

Concurrently, to examine the temporal dependencies of pedestrians and variations in their intentions, the trajectory attributes of an individual pedestrian are recorded over time using a separate Transformer. This Transformer sequentially inputs the trajectory features

E_{i}^{t}

to generate future trajectory features

h^{T}

. Subsequently,

h^{S T G}

is combined with

h^{T}

to yield the final feature

h^{'}

, which is then fed into the MLP decoder to be translated into real-world trajectory coordinates.

However, as the complexity of the model increases, particularly with the application of multi-layer graph attention mechanisms, the potential problem of overfitting arises. To address this, the present study employs the dropout method, where 10% of the neurons are randomly dropped in each layer of the neural network, thereby enhancing the model’s generalization capability. Additionally, early stopping is applied during training to ensure that the model ceases training once the performance on the validation set stops improving.

4. Experiments

4.1. Datasets and Metrics

To assess the proposed algorithms and models, we performed experimental validation on four datasets and conducted comprehensive analyses of the results. These datasets include ETH/UCY [23,23], SDD [24], and the SportVU NBA sports dataset which focused on the NBA game data from the 2015–2016 season. Given the large size of the original dataset, one of its sub-datasets named “Rebounding” was selected for benchmark testing, containing 257,230 twenty-frame trajectories. We then executed simulation experiments and ablation studies using the leave-one-out cross-validation protocol [41] to verify efficacy and deepen our insights. The ETH and UCY capture densely populated settings such as hotels and streets. The SDD dataset supplies pedestrian and vehicle trajectories from complex overhead views. Finally, the SportVU NBA dataset provides detailed movement trajectories of NBA basketball players, featuring extensive player interactions and strategic movements, which are used to rigorously evaluate model performance in a highly dynamic and complex sports environment.

We selected STGformer as the baseline model for our overall performance evaluation, allowing for a comprehensive comparison with state-of-the-art spatio-temporal graph Transformer methods. Since MSTT is built upon the STAR architecture through the addition of new modules, we chose the original STAR model as the baseline for our ablation studies. In this setting, we incrementally added each proposed module to STAR to quantify its individual contribution to the overall performance.

To assess the precision of the trajectories forecasted by the different components and the model, we utilize the following metrics: the average displacement error (ADE) and the final displacement error (FDE). The ADE is characterized as the mean Euclidean distance between the actual position at each predicted point in the trajectory, represented as

\hat{x_{i}^{t}}

, and the projected value

x_{i}^{t}

. The formula for the ADE is expressed as follows:

σ_{ADE} = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{T} \sum_{t = 1}^{T} |\hat{x_{i}^{t}} - x_{i}^{t}|

(15)

Similarly, the FDE represents the Euclidean distance between the actual endpoint location

\hat{x_{i}^{T}}

and the estimated endpoint location

x_{i}^{T}

, articulated as

σ_{FDE} = \frac{1}{N} \sum_{i = 1}^{N} |\hat{x_{i}^{T}} - x_{i}^{T}|

(16)

4.2. Experimental Details

To ensure comparability with prior work, such as the classical methods Social-GAN [14], STAR [16], and Trajectron++ [10], while maintaining sufficient trajectory information and avoiding the introduction of redundancy, this study employs the first 8 frames as the observation sequence and the subsequent 12 frames as the prediction target sequence. Training utilizes historical data from four datasets, while the fifth dataset is reserved as the test set for evaluating the model’s correctness. This procedure is reiterated to guarantee that each dataset functions as the test set a single time. To ensure equity, all baseline models adhere to an identical training protocol and are assessed on an Nvidia GTX4050Ti GPU.

The model utilizes the Adam optimizer with a batch size of 16 and a learning rate of 0.0015, and it is trained for 300 epochs, with each batch comprising around 256 pedestrian trajectory data points from various time frames. The threshold w in Equation (12) is set to 0.5 m, a value determined through ablation experiments to optimally balance adjacency sensitivity and robustness. The number of layers L in the multi-relational spatio-temporal graph module is fixed at 5, based on a trade-off analysis between performance and computational cost. All Transformer layers employ an input feature dimension of 32. Furthermore, ablation studies on the attention head count in both GAT and Transformer modules demonstrate that utilizing four attention heads achieves optimal ADE/FDE metrics, consequently improving overall model performance.

4.3. Quantitative Evaluation

Table 1 presents a comparative analysis between the proposed model and existing models across benchmark datasets. Our model demonstrates superior performance in terms of both ADE and FDE metrics, attributable to its multi-relational spatio-temporal network architecture and relative spatio-temporal encoding scheme. These mechanisms effectively capture dynamic pedestrian interactions, yielding a more precise representation of inter-agent influences. On the ZARA2 dataset, our model achieves the optimal results. While ranking second in terms of the ADE and FDE compared to STGformer, it exhibits a more balanced overall performance.

We compare the proposed model with state-of-the-art baselines on the SDD and SportVU NBA datasets. As shown in Table 2, our model achieves the best results on both datasets. On SDD, we obtain an ADE of 3.16 and an FDE of 5.12, representing 41.3% and 42.6% improvements over STGformer [41], respectively. For the Rebounding subset of the NBA data, our method achieves an ADE/FDE of 11.36/13.42, outperforming STGformer by 8.5% and 13.4%, respectively. These results demonstrate our model’s superior capability in capturing complex spatio-temporal interactions across both pedestrian and sports movement scenarios.

In practical applications, particularly within the field of autonomous driving where the accurate prediction of pedestrian behavior is of utmost significance, the inference time of a model is critically important. To assess the effectiveness of the proposed model, a comparison is conducted with the PECNET, STGformer, STAR, and SocialCircle+ models based on the number of parameters and inference time. The proposed model exhibits an increase of

3.8 \times 10^{3}

parameters relative to PECNET while achieving a reduction of 0.006 s in inference time. When compared to STGformer, the suggested model contains

2.0 \times 10^{3}

fewer parameters and achieves a decrease in inference time by 0.015 s. In relation to the STAR model, the proposed model integrates an additional

8.3 \times 10^{3}

parameters, resulting in a 0.015 s increase in inference duration. Lastly, in comparison to the STGformer model, the proposed model features

5.6 \times 10^{3}

fewer parameters and demonstrates a reduction in inference time of 0.028 s.

Table 3 presents a comparison of five trajectory prediction models on the ETH/UCY dataset in terms of the parameter count, memory usage, and per-trajectory inference latency. Among them, STGformer has the largest number of parameters, corresponding to the highest inference latency, while STAR has the smallest parameter count and achieves the lowest latency. In contrast, the proposed method maintains a relatively low inference latency (0.158) with moderate memory usage (13.5), demonstrating a favorable balance between computational efficiency and resource consumption.

Figure 6 presents a comparison of inference latency between the MSTT and STGformer models under varying numbers of agents. As the number of agents increases from 4 to 128, MSTT exhibits a modest latency increase from 12 ms to 18 ms, while the latency of STGformer rises sharply from 12 ms to 34 ms. This trend demonstrates the superior scalability and stability of MSTT in densely populated scenarios.

4.4. Qualitative Analysis

Pedestrian mobility involves individual interactions that result in intricate behaviors such as following, collision evasion, and navigation. Therefore, precise modeling of these interactions is essential. We conducted a thorough investigation of the MSTT model’s predictive efficacy across various motion patterns and collision avoidance scenarios. Ten example scenarios were selected to assess the MSTT model’s performance compared to the STGformer model, particularly in complex pedestrian interaction scenarios, as shown in Figure 7.

The experimental results demonstrate that both the MSTT and STGformer models efficiently capture pedestrian interactions and generate comparable trajectories in most cases. However, the trajectories forecasted by the MSTT model align more closely with observed behaviors. The MSTT model employs a multi-spatial modeling strategy that thoroughly analyzes interactions among diverse pedestrians and explores social-temporal relationships in more detail, yielding trajectories that are more coherent and fluid than those generated by the STGformer model.

Figure 8 shows the connection between weight visualizations in the two methodologies. Notably, the MSTT model generally allocates higher weights to pedestrians located both temporally and spatially closer. Furthermore, the weight values in the MSTT model display more variability compared to STGformer. This variability arises from the MSTT model’s ability for relative spatio-temporal coding and multiple-spatio-temporal modeling, which allows it to recognize important interactions with pedestrians more accurately, resulting in varied weight assignments.

To analyze the limitations of MSTT, we perform a qualitative error analysis by examining representative failure cases and visualizing them in Figure 9. In Figure 9a,b, when pedestrians approach the boundary of the scene, where insufficient historical trajectory and environmental context hinders intent inference, the model produces significantly deviated predictions. In Figure 9c,d, sudden, nonlinear changes in direction or speed similarly degrade the predictive accuracy of the model.

4.5. Ablation Experiments

To evaluate the effectiveness of the relative spatio-temporal encoding (RSTE) and multi-spatio-temporal graph modeling (MSTG), we design incremental ablation experiments based on the STAR model, gradually integrating each component to compare performance. The original STAR model is first used as a baseline, without incorporating any complex encoding or graph modeling. Then, two models are constructed: STAR-R, which includes only RSTE, and STAR-M, which includes only MSTG. All models are tested using the same number of samples. As shown in Table 4, RSTE reduces the ADE and FDE by 0.04 and 0.08, respectively, while MSTG reduces them by 0.03 and 0.10, indicating that both components significantly contribute to performance improvement.

To evaluate the effectiveness of the proposed strategy, a comprehensive set of ablation experiments was conducted to examine the performance of each submodule within the ETH and UCY datasets. This was achieved while maintaining uniform configurations for the remaining modules in relation to the final model. The experimental results are presented in Table 4, where the underlined elements indicate the combinations implemented in the final model. The removal of any component leads to a decline in the efficacy of pedestrian trajectory prediction.

We demonstrate that parameter configurations exert a substantial influence on prediction accuracy. As presented in Table 5, increasing the number of attention heads from 1 to 4 reduces the average

ADE / FDE

from 0.24/0.35 to 0.20/0.31, representing a marked improvement. However, increasing the number of heads to eight reduces performance to 0.22/0.37, indicating that four heads achieve the optimal balance between representational capacity and generalization. For the hyperparameter w, the highest accuracy of 0.20/0.31 is achieved when

w = 0.5

, while values of 0.25 or 0.75 lower performance to 0.24/0.36 and 0.22/0.33, suggesting that a moderate weighting coefficient better balances local and global relational information. In terms of distance metrics, constructing the adjacency matrix using the inverse of the Euclidean distance yields a performance of 0.20/0.31, surpassing the

L_{2}

-norm method with 0.22/0.33, confirming that the inverse of the Euclidean distance more effectively captures the interaction intensity among pedestrians. These parameter refinements consistently improve performance across all five evaluation scenarios.

4.6. Optimal Graph Stacking Number Analysis

To evaluate the effect of the spatial layer count L in the multi-space module described in the Section 3.4.3, we conducted an ablation study by varying the number of spatial layers. This parametric analysis aimed to identify the optimal hierarchical configuration that balances predictive performance with computational efficiency. Specifically, while keeping all other parameters fixed, we varied only the number of spatial layers L and recorded two key metrics

σ_{A D E}

and

σ_{F D E}

for each value of L. At the same time, we monitored the corresponding computational resource consumption to determine the most effective graph stacking depth.

As illustrated in Figure 10a, both the

σ_{A D E}

and

σ_{F D E}

metrics exhibit a consistent decreasing trend with increasing layer counts. It is essential to recognize that an increase in L leads to a quadratic expansion of the adjacency matrix, substantially increasing the time and processing resources required for training and testing. Our data indicates that while the error continuously diminishes, the rate of that diminution reduces with the addition of more layers. This indicates that the performance improvements from incorporating additional spatial layers do not warrant the increased processing requirements. Therefore, to achieve a compromise between the computing economy and performance enhancement, we determine that five spatial layers are optimal for this investigation, offering a significant performance increase without imposing excessive processing demands.

To assess the model’s ability to generalize, the sample size was systematically increased, with the experimental outcomes presented in Figure 10b, which shows that an increase in the parameter K correlates with a decrease in both the ADE and FDE metrics. Moreover, for an equivalent sample size, the proposed model demonstrates superior performance compared to the STGformer model, indicating that it requires fewer samples to achieve a similar error rate. These results suggest that integrating the multi-relational spatio-temporal module effectively captures pedestrian interactions across various spatio-temporal contexts, reducing variance in predicted trajectories and improving both prediction accuracy and generalization performance.

5. Summary

We propose a pedestrian trajectory prediction model based on multi-relational spatio-temporal graphs (MSTT), which addresses the shortcomings of traditional approaches in capturing dynamic and complex human interactions. Conventional methods often rely on static interaction structures, rendering them insufficient for modeling evolving spatio-temporal dependencies. To overcome this, MSTT integrates relative spatio-temporal encoding with multi-relational graph modeling, thereby enhancing the model’s capacity to represent dynamic inter-agent behaviors. Experimental results show that MSTT achieves substantial accuracy improvements on the SDD and NBA datasets. Although there remains room for further improvement on the ETH and UCY datasets, MSTT exhibits clear advantages over STGformer in terms of inference efficiency, scalability, and the parameter economy. Comparative and ablation studies further validate the effectiveness and balanced design of the proposed model.

Author Contributions

Conceptualization, X.Z.; methodology, Q.Z. and J.M.; software, Q.Z.; validation, Z.Y. and Q.Z.; formal analysis, X.Z.; investigation, Z.Y.; resources, Z.Y.; writing—original draft preparation, Q.Z.; writing—review and editing, J.M., X.Z., and Q.Z.; visualization, Q.Z.; supervision, X.Z. and J.M.; project administration, Z.Y.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (grant number U24B20159) and the Department of Science and Technology of Liaoning Province (grant number 2022020594-JH1/108).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shi, L.; Wang, L.; Long, C.; Zhou, S.; Zhou, M.; Niu, Z.; Hua, G. SGCN: Sparse graph convolution network for pedestrian trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8994–9003. [Google Scholar]
Liu, Y.; Qi, X.; Sisbot, E.A.; Oguchi, K. Multi-agent trajectory prediction with graph attention isomorphism neural network. In Proceedings of the 2022 IEEE Intelligent Vehicles Symposium (IV), Aachen, Germany, 4–9 June 2022; pp. 273–279. [Google Scholar]
Quan, R.; Zhu, L.; Wu, Y.; Yang, Y. Holistic LSTM for pedestrian trajectory prediction. IEEE Trans. Image Process. 2021, 30, 3229–3239. [Google Scholar] [CrossRef] [PubMed]
Stoler, B.; Navarro, I.; Jana, M.; Hwang, S.; Francis, J.; Oh, J. SafeShift: Safety-Informed Distribution Shifts for Robust Trajectory Prediction in Autonomous Driving. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV), Jeju Island, Republic of Korea, 2–5 June 2024. [Google Scholar]
Hong, J.; Sapp, B.; Philbin, J. Rules of the Road: Predicting Driving Behavior with a Convolutional Model of Semantic Interactions. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhang, W.; Cheng, H.; Johora, F.T.; Sester, M. ForceFormer: Exploring Social Force and Transformer for Pedestrian Trajectory Prediction. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV), Anchorage, AK, USA, 4–7 June 2023. [Google Scholar]
Pang, B.; Cao, J.; Zhou, H.; Mori, G.; Sigal, L. Trajectory Prediction with Latent Belief Energy-Based Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11814–11824. [Google Scholar]
Sighencea, B.I.; Stanciu, R.I.; Căleanu, C.D. A review of deep learning-based methods for pedestrian trajectory prediction. Sensors 2021, 21, 7543. [Google Scholar] [CrossRef] [PubMed]
Zhou, H.; Ren, D.; Xia, H.; Fan, M.; Yang, X.; Huang, H. Ast-gnn: An attention-based spatio-temporal graph neural network for interaction-aware pedestrian trajectory prediction. Neurocomputing 2021, 445, 298–308. [Google Scholar] [CrossRef]
Salzmann, T.; Ivanovic, B.; Chakravarty, P.; Pavone, M. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part XVIII; Springer: Cham, Switzerland, 2020; pp. 683–700. [Google Scholar]
Kong, W.; Liu, Y.; Li, H.; Wang, C.; Tao, Y.; Kong, X. GSTA: Pedestrian trajectory prediction based on global spatio-temporal association of graph attention network. Pattern Recognit. Lett. 2022, 160, 90–97. [Google Scholar] [CrossRef]
Messaoud, K.; Yahiaoui, I.; Verroust-Blondet, A.; Nashashibi, F. Attention based vehicle trajectory prediction. IEEE Trans. Intell. Veh. 2020, 6, 175–185. [Google Scholar] [CrossRef]
Lin, L.; Li, W.; Bi, H.; Qin, L. Vehicle trajectory prediction using LSTMs with spatial–temporal attention mechanisms. IEEE Intell. Transp. Syst. Mag. 2021, 14, 197–208. [Google Scholar] [CrossRef]
Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social-gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2255–2264. [Google Scholar]
Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 961–971. [Google Scholar]
Yu, C.; Ma, X.; Ren, J.; Zhao, H.; Yi, S. Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII; Springer: Cham, Switzerland, 2020; pp. 507–523. [Google Scholar]
Xu, P.; Hayet, J.B.; Karamouzas, I. Socialvae: Human trajectory prediction using timewise latents. In Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part IV; Springer: Cham, Switzerland, 2022; pp. 511–528. [Google Scholar]
Guo, H.; Liu, Y.; Meng, Q.; Li, J.; Chen, H. Goal-Oriented Pedestrian Trajectory Prediction Considering Spatial-Temporal Interactions. IEEE Trans. Instrum. Meas. 2024, 73, 2532316. [Google Scholar] [CrossRef]
Mi, J.; Zhang, X.; Zeng, H.; Wang, L. DERGCN: Dynamic-evolving graph convolutional networks for human trajectory prediction. Neurocomputing 2024, 569, 127117. [Google Scholar] [CrossRef]
Lin, X.; Liang, T.; Lai, J.; Hu, J.F. Progressive pretext task learning for human trajectory prediction. In Computer Vision—ECCV 2024: 18th European Conference, Milan, Italy, 29 September–4 October 2024, Proceedings, Part XXX; Springer: Cham, Switzerland, 2024; pp. 197–214. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Pellegrini, S.; Ess, A.; Schindler, K.; Van Gool, L. You’ll never walk alone: Modeling social behavior for multi-target tracking. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 261–268. [Google Scholar]
Lerner, A.; Chrysanthou, Y.; Lischinski, D. Crowds by example. Comput. Graph. Forum 2007, 26, 655–664. [Google Scholar] [CrossRef]
Robicquet, A.; Sadeghian, A.; Alahi, A.; Savarese, S. Learning Social Etiquette: Human Trajectory Understanding in Crowded Scenes. In Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part VIII; Springer: Cham, Switzerland, 2016; pp. 549–565. [Google Scholar]
Liang, J.; Jiang, L.; Niebles, J.C.; Hauptmann, A.G.; Fei-Fei, L. Peeking into the future: Predicting future person activities and locations in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 5725–5734. [Google Scholar]
Huang, Y.; Bi, H.; Li, Z.; Mao, T.; Wang, Z. Stgat: Modeling spatial-temporal interactions for human trajectory prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Long Beach, CA, USA, 16–17 June 2019; pp. 6272–6281. [Google Scholar]
Mangalam, K.; Girase, H.; Agarwal, S.; Lee, K.H.; Adeli, E.; Malik, J.; Gaidon, A. It is not the journey but the destination: Endpoint conditioned trajectory prediction. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part II; Springer: Cham, Switzerland, 2020; pp. 759–776. [Google Scholar]
Kingma, D.P.; Mohamed, S.; Rezende, D.J.; Welling, M. Semi-Supervised Learning with Deep Generative Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 3581–3589. [Google Scholar]
Varga, B.; Brand, T.; Schmitz, M.; Hashemi, E. Interaction-Aware Model Predictive Decision-Making for Socially-Compliant Autonomous Driving in Mixed Urban Traffic Scenarios. arXiv 2025, arXiv:2503.01852. [Google Scholar]
Varga, B.; Yang, D.; Hohmann, S. Cooperative Decision-Making in Shared Spaces: Making Urban Traffic Safer through Human-Machine Cooperation. arXiv 2023, arXiv:2306.14617. [Google Scholar]
Liu, W.; Liu, P.; Ma, J. DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning. arXiv 2025, arXiv:2505.05360. [Google Scholar]
Xing, Z.; Song, R.; Teng, Y.; Xu, H. DynHEN: A heterogeneous network model for dynamic bipartite graph representation learning. Neurocomputing 2022, 508, 47–57. [Google Scholar] [CrossRef]
Dai, J.; Yuan, W.; Bao, C.; Zhang, Z. DGNN: Denoising graph neural network for session-based recommendation. In Proceedings of the 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA), Shenzhen, China, 13–16 October 2022; pp. 1–8. [Google Scholar]
Peng, H.; Wang, H.; Du, B.; Bhuiyan, M.Z.A.; Ma, H.; Liu, J.; Wang, L.; Yang, Z.; Du, L.; Wang, S.; et al. Spatial temporal incidence dynamic graph neural networks for traffic flow forecasting. Inf. Sci. 2020, 521, 277–290. [Google Scholar] [CrossRef]
Guan, M.; Iyer, A.P.; Kim, T. Dynagraph: Dynamic graph neural networks at scale. In Proceedings of the 5th ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), Philadelphia, PA, USA, 12 June 2022; pp. 1–10. [Google Scholar]
Luo, W.; Zhang, H.; Yang, X.; Bo, L.; Yang, X.; Li, Z.; Qie, X.; Ye, J. Dynamic heterogeneous graph neural network for real-time event prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 3213–3223. [Google Scholar]
Zhou, L.; Yang, D.; Zhai, X.; Wu, S.; Hu, Z.; Liu, J. GA-STT: Human trajectory prediction with group aware spatial-temporal transformer. IEEE Robot. Autom. Lett. 2022, 7, 7660–7667. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Y.; Li, K.; Worrall, S.; Qiao, Y.; Li, Y.F.; Kong, H. Knowledge-aware Graph Transformer for Pedestrian Trajectory Prediction. In Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC), Bilbao, Spain, 24–28 September 2023. [Google Scholar]
Chen, H.; Xu, Z.; Yeh, C.M.; Lai, V.; Zheng, Y.; Xu, M.; Tong, H. MGFormer: Masked Graph Transformer for Large-Scale Recommendation. arXiv 2024, arXiv:2405.04028. [Google Scholar]
Chen, W.; Sang, H.; Wang, J.; Zhao, Z. STIGCN: Spatial–temporal interaction-aware graph convolution network for pedestrian trajectory prediction. J. Supercomput. 2024, 80, 10695–10719. [Google Scholar] [CrossRef]
Wang, H.; Chen, J.; Pan, T.; Dong, Z.; Zhang, L.; Jiang, R.; Song, X. STGformer: Efficient Spatiotemporal Graph Transformer for Traffic Forecasting. arXiv 2024, arXiv:2410.00385. [Google Scholar]
Wong, C.; Xia, B.; Zou, Z.; You, X. Socialcircle+: Learning the angle-based conditioned interaction representation for pedestrian trajectory prediction. arXiv 2024, arXiv:2409.14984. [Google Scholar]

Figure 1. (a) depicts the trajectories recorded for several agents during the time from t = 0 to t = 2. Most conventional models presume a static structure; (b) illustrates a multi-spatio-temporal graph framework employed to forecast the trajectories of many actors at time t.

Figure 2. The MSTT model consists of two principal components: the encoder and the decoder. The encoder integrates modules for relative spatio-temporal encoding and multi-spatio-temporal fusion. The module for relative spatio-temporal encoding takes as input the position of pedestrians, their identification, and the frame number. The resultant encoded output is processed by a temporal Transformer to encapsulate pedestrian temporal dynamics and is subsequently introduced into the multi-spatio-temporal fusion module to model interactions and derive the adjacency matrix. The multi-spatio-temporal GAT along with the temporal Transformer are then employed to train and generate interaction state information. The decoder leverages a multi-layer perceptron (MLP) to interpret the interaction data, yielding the corresponding real-world 2D coordinates.

Figure 3. The relative spatio-temporal coding method models the temporal and spatial information of intelligence, with increased representations subsequently input into the temporal Transformer and the multi-spatial GNN following RTE and RSE processing.

Figure 4. In multiple-spatio-temporal mapping, various spatio-temporal maps are generated by layering maps from distinct time points. This entails computing the weights of pedestrians at different intervals concerning the current target pedestrians, conducting a localized assessment to discern any influences, and selecting L moments with the most significant interactions with the present moment following a global evaluation to execute the multiple-spatio-temporal-map modelling.

Figure 5. We employ a GAT and a Transformer network to jointly capture pedestrian interaction dynamics. The GAT processes spatio-temporal interactions among pedestrians to generate

h^{S T G}

, while the Transformer focuses on learning long-term interactions and produces

h^{T}

. Finally,

h^{S T G}

and

h^{T}

are concatenated and fed into the decoder to generate real-world 2D coordinates.

Figure 5. We employ a GAT and a Transformer network to jointly capture pedestrian interaction dynamics. The GAT processes spatio-temporal interactions among pedestrians to generate

h^{S T G}

, while the Transformer focuses on learning long-term interactions and produces

h^{T}

. Finally,

h^{S T G}

and

h^{T}

are concatenated and fed into the decoder to generate real-world 2D coordinates.

Figure 6. The inference latency of two models (MSTT and STGformer). The red curve represents the STGformer model, while the blue curve represents the MSTT model.

Figure 7. Qualitative results for the ETH/UCY dataset. For each pedestrian, we show the path history (green line), the future true path (red line), the STGformer prediction results (yellow dashed line), and the prediction results of our MSTT (blue dashed line).

Figure 8. Graph representation of pedestrian interactions. Nodes denote individual pedestrians. Edge colors ranging from purple to yellow correspond to interaction weights computed by the GAT, with the color bar indicating the weight scale. Green dashed lines display historical trajectories across 8 frames while red dashed lines show ground truth trajectories spanning 12 frames.

Figure 9. Qualitative evaluation reveals four characteristic failure modes: (a) peripheral observation artifacts, (b) occlusion-induced deviations, (c) collective motion discontinuities, and (d) high-curvature path miscalculations, where red, green and blue trajectories denote ground truth, observed data, and model predictions respectively, and the prediction results of our MSTT (blue dashed line).

Figure 10. (a) Analysis of various graph stacking numbers and their effect on the MSTT model’s efficacy; (b) evaluation of differing sampling numbers in relation to ADEs (black line) and FDEs (red line) for two models.

Table 1. Performance comparison (ADE/FDE in meters) across methods. Bold values indicate the best-performing model on each dataset. The model uses 8 observed time steps to predict the next 12.

Model	Year	ETH	HOTEL	UNIV	ZARA1	ZARA2	Avg
Social-GAN [14]	2018	0.73/1.48	0.49/1.01	0.41/0.84	0.27/0.56	0.33/0.70	0.45/0.91
PECNet [27]	2019	0.81/1.52	0.72/1.61	0.60/1.26	0.34/0.69	0.42/0.84	0.58/1.18
Trajectron++ [10]	2020	0.67/1.18	0.18/0.28	0.30/0.54	0.25/0.41	0.18/0.32	0.32/0.55
STAR [16]	2020	0.36/0.65	0.21/0.36	0.31/0.62	0.26/0.55	0.22/0.46	0.27/0.53
DERGCN [18]	2023	0.54/1.01	0.23/0.42	0.30/0.63	0.22/0.44	0.20/0.42	0.30/0.58
PPT [16]	2024	0.35/0.51	0.15/0.25	0.13/0.24	0.22/0.39	0.18/0.31	0.21/0.34
STIGCN [40]	2024	0.42/0.58	0.14/0.23	0.17/0.29	0.26/0.45	0.21/0.37	0.24/0.38
SocialCircle+ [42]	2024	0.25/0.42	0.10/0.15	0.24/0.42	0.23/0.38	0.18/0.24	0.20/0.32
STGformer [41]	2024	0.27/0.56	0.11/0.17	0.18/0.23	0.16/0.30	0.17/0.21	0.18/0.29
MSTT(Ours)	-	0.24/0.49	0.18/0.29	0.19/0.28	0.21/0.30	0.16/0.19	0.20/0.31

Table 2. Results of trajectory prediction for SDD and Rebounding using ADE/FDE metrics in pixels comparing our proposed method (MSTT) with related methods from the literature. The lower error indicates better performance, and the best performance is marked in bold.

Dataset	Social-GAN [14]	STAR [16]	DERGCN [18]	STGformer [41]	Ours
SDD	27.23/41.44	7.85/11.85	8.21/10.22	5.38/8.92	3.16/5.12
Rebounding	30.54/47.68	15.65/19.21	14.06/17.63	12.42/15.49	11.36/13.42

Table 3. Analysis of model inference duration and parameter count. The inference time denotes the duration required for one trajectory in the UCY dataset.

Model Name	Parameter Count (M)	Memory Usage (GB)	Inference Latency (ms)
PECNet [27]	25.0	10.5	0.164
SocialCircle+ [42]	30.8	12.6	0.173
STAR [16]	20.5	14.4	0.143
STGformer [41]	34.4	16.4	0.186
Ours	28.8	13.5	0.158

Table 4. Ablation experiments for the model. Each entry contains two numerical values representing the ADE/FDE of the predicted outcomes. Observations are recorded for 8 time steps, while the predicted values pertain to the subsequent 12 time steps.

Model	ETH	HOTEL	ZARA1	ZARA2	UNIV	Avg
STAR	0.36/0.65	0.22/0.36	0.27/0.56	0.32/0.55	0.28/0.68	0.27/0.53
STAR-R	0.33/0.52	0.22/0.34	0.25/0.50	0.22/0.45	0.24/0.58	0.25/0.48
STAR-M	0.34/0.58	0.18/0.28	0.24/0.41	0.25/0.48	0.27/0.54	0.26/0.46
STAR-R-M	0.24/0.49	0.18/0.29	0.19/0.28	0.21/0.30	0.16/0.19	0.20/0.31

Table 5. Accuracy comparison across experimental variants. Values underlined indicate optimal component configurations and bold values mark overall best performance. The evaluation protocol uses 8 observed frames to predict subsequent 12-frame trajectories.

Method	Variants	ADE/FDE
Method	Variants	ETH	HOTEL	UNIV	ZARA1	ZARA2	Avg
w	0	0.42/0.67	0.32/0.45	0.31/0.39	0.28/0.48	0.26/0.24	0.32/0.45
	0.25	0.32/0.54	0.26/0.30	0.21/0.36	0.22/0.35	0.18/0.23	0.24/0.36
	0.5	0.24/0.49	0.18/0.29	0.19/0.28	0.21/0.30	0.16/0.19	0.20/0.31
	0.75	0.28/0.42	0.21/0.32	0.23/0.33	0.23/0.35	0.17/0.22	0.22/0.33
Multi-head	w/o	0.38/0.65	0.28/0.42	0.30/0.45	0.32/0.40	0.26/0.34	0.31/0.45
	1	0.29/0.54	0.21/0.26	0.23/0.45	0.24/0.26	0.21/0.26	0.24/0.35
	2	0.26/0.51	0.18/0.26	0.20/0.46	0.20/0.32	0.17/0.23	0.20/0.36
	4	0.24/0.49	0.18/0.29	0.19/0.28	0.21/0.30	0.16/0.19	0.20/0.31
	8	0.27/0.52	0.20/0.26	0.19/0.48	0.25/0.34	0.18/0.25	0.22/0.37
WeightA	w/o	0.30/0.56	0.25/0.36	0.24/0.32	0.26/0.33	0.18/0.23	0.25/0.36
	$A_{L 2}$	0.26/0.53	0.20/0.30	0.20/0.31	0.24/0.30	0.18/0.22	0.22/0.33
	$A_{t}$	0.24/0.49	0.18/0.29	0.19/0.28	0.21/0.30	0.16/0.19	0.20/0.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Q.; Zhang, X.; Ye, Z.; Mi, J. MSTT: A Multi-Spatio-Temporal Graph Attention Model for Pedestrian Trajectory Prediction. Sensors 2025, 25, 4850. https://doi.org/10.3390/s25154850

AMA Style

Zhang Q, Zhang X, Ye Z, Mi J. MSTT: A Multi-Spatio-Temporal Graph Attention Model for Pedestrian Trajectory Prediction. Sensors. 2025; 25(15):4850. https://doi.org/10.3390/s25154850

Chicago/Turabian Style

Zhang, Qingrui, Xuxiu Zhang, Zilang Ye, and Jing Mi. 2025. "MSTT: A Multi-Spatio-Temporal Graph Attention Model for Pedestrian Trajectory Prediction" Sensors 25, no. 15: 4850. https://doi.org/10.3390/s25154850

APA Style

Zhang, Q., Zhang, X., Ye, Z., & Mi, J. (2025). MSTT: A Multi-Spatio-Temporal Graph Attention Model for Pedestrian Trajectory Prediction. Sensors, 25(15), 4850. https://doi.org/10.3390/s25154850

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSTT: A Multi-Spatio-Temporal Graph Attention Model for Pedestrian Trajectory Prediction

Abstract

1. Introduction

2. Related Work

2.1. Multi-Intelligent Body Trajectory Prediction

2.2. Dynamic Graph Neural Networks

2.3. Transformer-Based Trajectory Prediction

3. Approach

3.1. Problem Formulation

3.2. Trajectory Coding for a Pedestrian

3.3. Relative Spatio-Temporal Coding

3.4. Multiple-Spatio-Temporal-Map Modeling

3.4.1. Spatio-Temporal Graph Construction

3.4.2. Local Judgment

3.4.3. Global Judgment

3.5. Neural Networks for Multi-Spatial Graphs

4. Experiments

4.1. Datasets and Metrics

4.2. Experimental Details

4.3. Quantitative Evaluation

4.4. Qualitative Analysis

4.5. Ablation Experiments

4.6. Optimal Graph Stacking Number Analysis

5. Summary

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI