MTGNet: Multi-Agent End-to-End Motion Trajectory Prediction with Multimodal Panoramic Dynamic Graph

Dai, Yinfei; Zhang, Yuantong; Zhou, Xiuzhen; Wang, Qi; Song, Xiao; Wang, Shaoqiang

doi:10.3390/app15105244

Open AccessArticle

MTGNet: Multi-Agent End-to-End Motion Trajectory Prediction with Multimodal Panoramic Dynamic Graph

by

Yinfei Dai

^1,2,

Yuantong Zhang

¹,

Xiuzhen Zhou

¹,

Qi Wang

¹,

Xiao Song

¹ and

Shaoqiang Wang

^1,*

¹

College of Computer Science and Technology, Changchun University, Changchun 130022, China

²

College of Computer Science and Technology, Jilin University, Changchun 130025, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5244; https://doi.org/10.3390/app15105244

Submission received: 22 February 2025 / Revised: 9 April 2025 / Accepted: 5 May 2025 / Published: 8 May 2025

(This article belongs to the Special Issue Pushing the Boundaries of Autonomous Vehicles)

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of autonomous driving technology, multi-agent trajectory prediction has become the core foundation of autonomous driving algorithms. Efficiently and accurately predicting the future trajectories of multiple agents is key to evaluating the reliability and safety of autonomous driving vehicles. Recently, numerous studies have focused on capturing agent interactions in complex traffic scenarios. While most methods adopt agent-centric scene construction, they often rely on fixed scene sizes and incur significant computational overhead. Based on this, we propose the multimodal transformer graph convolution neural network (MTGNet) framework. The MTGNet framework can not only construct a panoramic, fully connected dynamic traffic map for agents but also dynamically adjust the size of traffic scenes. Additionally, it enables accurate and efficient multi-modal multi-agent trajectory prediction. In addition, we utilize the graph convolutional neural network (GCN) to process graph-structured data. This approach not only captures global relationships but also enhances the focus on local features within the scene, thereby improving the model’s sensitivity to local information. Our framework was tested on the Argoverse 2.0 dataset and compared with nine state-of-the-art vehicle trajectory prediction methods, achieving the best performance across all three selected metrics.

Keywords:

multimodal; multi-agent trajectory prediction; graph neural network; transformer

1. Introduction

Agent trajectory prediction builds upon vehicle trajectory prediction by expanding the prediction scope to encompass all traffic agents, including pedestrians, motor vehicles, non-motor vehicles, and others. The modeling is performed based on the historical trajectory of the target agent, the traffic scene map, and the behavior of surrounding traffic participants to predict the future trajectories of multiple agents [1,2]. Since real-world traffic scenarios are dynamic and involve multiple interacting agents, autonomous vehicles need to consider not only basic traffic rules (such as traffic lights, lane configurations, speed limits, and lane restrictions) but also capture the interaction relationships among multiple agents in the scene, as well as the temporal dependencies of each agent’s motion, to predict their future trajectories. At the same time, accurately understanding the movement habits and patterns of surrounding intelligent agents is highly challenging, as the intentions and destinations of the multiple agents around them are unknown, and their actions influence each other [3]. Therefore, in complex traffic scenarios involving multiple agents, an agent’s behavior is primarily determined by its intricate interactions with other agents, which interfere with the fundamental laws of motion, making it extremely challenging to predict future trajectories [4].

In the field of traffic scene modeling, various methods demonstrate their own unique advantages. Among the non-deep learning methods, the prediction research based on physics cannot be ignored [5,6]. Such methods construct models centered around the kinematics and dynamics characteristics of vehicles. For example, the single-trajectory prediction method of the dynamic model proposed by Kaempchen et al. [6] has relatively high computational efficiency and is suitable for scenarios with fewer constraints. However, it has insufficient modeling of road-related elements, and due to the uncertainty of state estimation, the reliability of long-term prediction is poor. Machine learning methods have also been widely explored in traffic scene modeling. For instance, the support vector machine [7] can identify lane-changing maneuvers through features such as the vehicle’s steering wheel angle, coordinates, and acceleration. It performs remarkably well in classification problems but shows poor performance in prediction tasks. The hidden Markov model [8] starts from the perspective of driving behavior patterns, setting the driver’s intention as the hidden state and the vehicle’s motion parameters as the observation state. After being trained by the Baum–Welch algorithm to obtain the state transition probability matrix and the observation probability matrix, the Viterbi algorithm is used to decode the most likely driving intention sequence. It can support multimodal prediction, but unfortunately, it has limited ability to model complex interaction relationships [9].

Deep learning methods have received even more attention in recent years. In previous studies, many scholars have utilized CNN (convolutional neural network) approaches for traffic scene modeling [10,11,12,13,14,15]. Some studies have represented scenes as bird’s-eye views [11,13]. Although this method can model traffic scenes, its computational overhead is very large. When capturing the temporal dependency of the agent’s own motion, some studies use LSTM (long short-term memory network) [10,12] to capture temporal dependency, which is very suitable for application in the field of vehicle trajectory prediction. Kim et al. [12] proposed an LSTM-based method capable of predicting vehicle positions in the occupancy grid at fixed future time steps, thereby achieving trajectory prediction. Additionally, CNNs can extract spatial features, such as the interactions between traffic participants [10]. This inspired researchers such as Deo et al. [10] to utilize a combination of LSTM and CNN for processing both temporal and spatial information in trajectory prediction. Furthermore, Deo et al. [10] improved upon the work of Kim et al. [12] by enabling the model to output a continuous, multimodal probability distribution for predicting the vehicle’s location range within the next 5 s. However, this approach has a critical limitation: it assumes that traffic scenes conform to grid structures to capture temporal dependencies, while the spatial relationships between agents are non-Euclidean [16]. This makes it challenging for LSTM to handle complex spatial relationships in dynamic traffic scenarios. To address these challenges, some studies have proposed using vectorization [17,18] methods to model traffic scenarios. However, it is difficult to perform real-time predictions in traffic environments with high-speed intelligent agent flows. To address the issue that vectorization methods are usually not robust to translation and rotation of the reference frame, recent work has normalized the scene to an agent-centric coordinate system for reasoning [4,19,20], thereby maintaining rotation invariance relative to the agent’s global position and orientation [4]. When there are a large number of agents in the scene, re-standardizing the scene and calculating the characteristics of each agent incur extremely high computational overhead. The method proposed in [4] exploits the symmetry and hierarchical structure in the multi-agent motion prediction problem, divides the motion prediction task into multiple stages, and models these relationships hierarchically. It uses a Transformer to capture dependencies in long time series. However, when processing traffic scenes, this method divides the scene using a fixed size, which can easily result in unnecessary resource waste. Additionally, in the study of trajectory prediction based on graphs, some scholars have proposed using a GCN (graph convolutional neural network) [21,22] to model interactions between multiple agents. However, in these initial attempts, the connection intensity is regarded as a function determined by the distance between nodes in the graph [23]. Most studies only assign edge weights based on the distance between nodes. Moreover, the strength of interactions between agents also changes dynamically at different time steps. As shown in Figure 1, at Time Step 1, Agent 2 has a stronger influence on Agent 3 compared to Agent 1. In terms of distance, the distance between Agent 2 and Agent 3 is greater than that between Agent 1 and Agent 3. However, at Time Step 2, the distances between Agent 1 and Agent 3 and between Agent 2 and Agent 3 become almost identical. Notably, the influence of Agent 1 on Agent 3 increases significantly at this stage. Therefore, defining edge weights solely based on distance is clearly inconsistent with real-world scenarios [16]. In addition, within the same traffic scenario, there are significant differences in the degree of influence exerted by agents in the left and right lanes, as well as those ahead and behind in the same lane, on the trajectory of the target agent. Compared to the agent ahead in the same lane, the influence of the agent behind is significantly smaller. In real-world traffic scenarios, this method is clearly unsuitable for multi-agent future trajectory prediction and will also introduce noise interference [16]. At the same time, ref. [23] proposed an attention-based spatiotemporal graph convolutional network to construct a hierarchical structure consisting of spatial GCN and temporal GCN to address this issue. However, it employs GRU (gated recurrent unit) to predict the future trajectory of the agent. When learning the long-term dependencies of the agent’s historical trajectory, GRU is prone to gradient vanishing or gradient exploding problems, making it difficult for the model to accurately capture key information over long time spans in the historical trajectory.

To address these issues, we propose Multimodal Transformer Graph Convolutional Neural Network (MTGNet) for multimodal trajectory prediction in complex urban traffic scenarios. Firstly, a panoramic dynamic traffic graph is constructed based on multi-dimensional state information such as vehicle position and speed, and the high-interaction regions and sparse-interaction regions are dynamically divided. This dynamic scene division strategy can automatically adjust the scene size according to the traffic density. In high-interaction regions, the perception radius is reduced to decrease redundant calculations, and in sparse-interaction regions, the scope is expanded to capture potential long-distance interactions. The model adopts an encoder–decoder architecture to achieve end-to-end prediction output. The encoder is responsible for handling the features of the multimodal traffic graph, and the decoder directly generates trajectory predictions. The model employs a Transformer-based encoder to perform temporal alignment and noise filtering on the multimodal data. Compared with using time models like LSTM [3,10] to process temporal data, the self-attention mechanism of the Transformer can effectively capture the long-term dependencies in drivers’ behaviors, filter out the noise of atypical operations, and retain the features of basic driving behaviors. We uniformly process the time offsets of different targets. Subsequently, the positions are obtained, respectively, through a linear transformation and by adding a position encoder (pos_encoder). For local interaction modeling, MTG introduces an encoder based on a Graph Convolutional Network (GCN) to process the data after the noise reduction treatment by the Transformer encoder. Compared with the method of directly using fully connected layers [4], MTG uses a GCN in high-interaction regions to capture the local features among agents. The features of neighboring nodes are aggregated through the message-passing mechanism, and the key interaction relationships are dynamically focused on by combining the attention weights. MTG reduces the computational overhead in sparse-interaction regions while maintaining high-precision modeling in high-interaction regions. At the same time, in order to ensure the acquisition of the offset position representation after position encoding as accurately as possible, we use residual superposition to fuse it.

The main contributions of this paper are as follows:

A Multimodal Transformer Graph Convolutional Neural Network (MTGNet) is proposed. This network dynamically constructs an agent-centric scene based on vectorization. While capturing global features, it utilizes graph convolutional neural networks to focus on local features and employs a Res_Decoder layer to decode the constructed panoramic vector, which better confirms the established panoramic vehicle connection network;
A panoramic agent-based fully connected dynamic traffic graph is proposed to solve the resource waste problem caused by building a fixed scene size based on agent-centric methods [4,19,20], and to dynamically divide the traffic scene size. By updating the node features at the current time step through the message-passing mechanism, the changes in the relationships between each agent and other agents at the current time step can be captured, thereby better adapting to the changes in the dynamic environment and local features;
We propose using a Transformer-based encoder to capture the long-range dependencies among features and filter out noise by leveraging its self-attention mechanism. This approach circumvents the drawbacks of directly using the direct distance between agents as a weight factor to construct the adjacency matrix. It not only reduces the computational overhead of the model but also makes the adjacency relationships of nodes in the panoramic agent fully connected dynamic traffic graph more consistent with the interaction relationships among agents in real traffic scenarios;
The Argoverse 2.0 dataset [24] was used for testing. Argoverse 2.0 is a fully upgraded version of Argoverse 1.0. It has richer features than Argoverse 1.0. Argoverse 2.0 surpasses Argoverse 1.0 in terms of data scale, diversity, task support, and technical details.

2. Related Studies

Previous trajectory prediction methods can be categorized into physics-based methods [5,6], classical machine learning-based methods [25], and deep learning-based methods. Among these, early trajectory prediction methods using physics-based approaches typically rely on the kinematic and dynamic characteristics of vehicles (e.g., speed, acceleration). While they excel in short-term prediction, they lack the capability to perform well in complex traffic scenarios [9]. Machine learning-based methods for predicting lane change intentions depend on large amounts of data for training. Although they can effectively identify vehicle behavior, their ability to handle rare events is limited, and the models lack interpretability [9]. Deep learning-based methods optimize the loss function by automatically learning features, making them particularly effective in handling complex scenes and capturing long-term dependencies [9]. Current research demonstrates that deep learning-based models have not only achieved significant success in vehicle trajectory prediction but have also been extended to multi-agent trajectory prediction tasks. We can divide the multi-agent trajectory prediction task into two parts: motion prediction [4] and traffic scene modeling. Early researchers used RNN (recurrent neural network) [26] to predict vehicle trajectories. With the success of LSTM-based methods in capturing sequence data, researchers have widely adopted LSTM to predict the trajectories of vehicles and pedestrians [10,12,27,28], effectively extracting temporal features. Ref. [29] used the traditional three-layer encoder–decoder structure with LSTM as the computational core of the encoder–decoder. However, this structure is not sufficiently sensitive to long-term changes in time series, which may lead to unstable trajectory prediction results. Ref. [3] incorporated the data propagation characteristics of the bidirectional recurrent neural network (Bi-RNN) into the encoder–decoder to form a bidirectional encoder–decoder, improving the sensitivity of the encoder–decoder to time series. We categorize LSTM-based methods into two groups based on whether they can capture the complex interactions between agents. In terms of scene modeling, early LSTM-based methods [12,28] used a single-layer LSTM to process the historical trajectories of agents in the scene to predict their future trajectories, which largely ignored the mutual influence between agents. Ref. [30] proposed learning vehicle interaction relationships in traffic scenes by using a two-layer LSTM, allowing vehicles in the scene to share their states with each other. While the latter approach improves the accuracy of trajectory prediction, explaining and interpreting model decisions remains a challenge. To learn rich representations of traffic scene elements (e.g., lane boundaries, crosswalks, traffic lights), including HD maps and the agent’s past trajectories, many research works use rasterized scenes as model input [11,14,31]. These methods first extract map elements and then render the scene as a bird’s-eye view image using different colors. The historical trajectory of the agent can be processed by rasterizing it into an additional image channel [11] or by using a time series model [10]. However, rasterizing the scene not only incurs significant computational overhead but also makes it difficult to capture the complex interactions between agents using LSTM modeling grids, as the spatial relationships between agents are non-Euclidean.

With the introduction of the Transformer model [32], which was originally used to handle natural language processing (NLP) tasks, and its widespread application in various other fields (such as computer vision [33], target detection [34,35,36], reinforcement learning [37], and bioinformatics [38]), the Transformer model has gained significant attention. At the same time, due to the existence of the attention mechanism, the Transformer model can capture dependencies of both long-term and short-term features simultaneously, making it suitable for solving time series tasks [39]. The latest research proposes a Transformer-based multi-agent trajectory prediction approach [4,40,41] to capture long-term dependencies in trajectory prediction tasks. In terms of traffic scene modeling, recent studies have adopted a vectorization approach, using sparse coding (scenes are usually represented by a small number of “key elements” rather than relying on every detail) to efficiently represent information through key features, greatly reducing the computational overhead caused by rasterization. The vectorization method can represent the complex information in the scene more efficiently and accurately by converting the scene into a set of entities with semantic and geometric information and learning the relationships between them. In short, the method of vectorizing map scenes can effectively process and express the complex geometric shapes and semantic relationships in the scene. The latest research, the HiVT (Hierarchical Vector Transformer) model [4], introduces a hierarchical structure in multi-agent motion prediction, combining vectorized scene representation with the advantages of Transformer in processing long time series data. These advantages enable it to make efficient predictions over a longer time frame and therefore be applicable to complex real-time scenarios. At the same time, GCN performs very well in handling the interactive relationships in multi-agent trajectory prediction tasks in scenes similar to intersections. Therefore, we propose a new framework based on Transformer and GCN to complete the multi-agent trajectory prediction task.

3. Methodology

In this section, we first introduce the overall framework of MTGNet in Section 3.1, followed by the Motion Trajectory Encoder and Global Interaction Encoder in Section 3.2 and Section 3.3, respectively. Finally, we introduce the Decoder in Section 3.4.

3.1. Overview

Figure 2 shows the overall structure of the Multimodal Transformer Graph Convolutional Neural Network (MTGNet). First, in the encoder part, we use a Transformer-based encoder and the Global Interaction Encoder (based on GCN). We employ vectorized scene representation [4,17,18] to learn the relationships between the agent trajectory points and lane vector entities. The denoised agent trajectory features and map features are used to model a dynamic panoramic agent traffic map centered on the intelligent agent. The Global Interaction Encoder is utilized to focus on local features. Second, in the decoder part, the Residual Decoder [4] is used to decode the constructed panoramic vector, enabling the model to better confirm the established panoramic vehicle connection network. Finally, the output of the decoded model is converted from the local coordinate system back to the global coordinate system, and the shape of the output tensor is adjusted for subsequent processing. The result is then fitted with the original position data in the original dataset.

3.2. Motion Trajectory Encoder

As shown in Figure 3, the Motion Trajectory Encoder is composed of Linear_Embedding, Pos_Encoder1D, and Transformer_Layer.

When preprocessing data, although using absolute position [42] or relative position with the initial position as the coordinate origin to predict future trajectories [43] can achieve trajectory prediction within a certain period of time, we argue that using the angle of position transformation and the offset at each time step to predict the next position can provide richer information for our model compared to directly using position coordinates.

In the application of position transformation, our model design still requires selecting an initial point as the benchmark for distance calculation [44]. According to the concept of this paper, the movement transformation of the first 50 time steps is used to predict the movement of the next 60 time steps. Although it is technically feasible to set the starting position of the time step as the coordinate origin, we still choose to use the 49th time step as the coordinate origin. Such a setting not only helps the model understand the change process of each step more deeply but also provides strong guidance for subsequent predictions.

Linear_Embedding is used to perform a linear transformation on the input data, followed by the ReLU [45] activation function to introduce nonlinearity and map the data to the latent space.

Specifically, in terms of the model design, we first standardized the temporal offsets of different agents. We transformed the input sequence of position coordinates of agents into a vectorized sequence:

f_{t} = [Δ p_{t}, θ_{t}]

(1)

Here,

Δ p_{t}

represents the displacement vector between adjacent time steps, which captures the changes in motion speed and direction.

θ_{t}

represents the heading angle, which is defined as the angle between the displacement vector and the coordinate axis:

Δ p_{t} = p_{t} - p_{t - 1}

(2)

θ_{t} = a r c t a n 2 (Δ p_{t} [1], Δ p_{t} [0])

(3)

Then, we obtain the position information of the agent through a linear transformation and the addition of a position encoder (Pos_Encoder1D). At the same time, to ensure the offset position representation after position encoding as accurately as possible, we further use the residual superposition method to fuse information, thereby optimizing the effect of position representation. Among them, Pos_Encoder1D uses a fixed trigonometric function form for encoding. First, the number of channels is processed to ensure that it is an even number:

C^{'} = [C / 2] \times 2

(4)

Here, C is the original number of channels.

C′ is the number of channels after processing.

[

\cdot

] represents rounding down.

Then, the inverse frequency factor

i n v_f r e q_{i}

is calculated [32]:

inv_f r e q_{i} = \frac{1}{10000 \frac{2 i}{C'}}, i \in {0,2, \dots, C^{'} - 2}

(5)

Here,

i

represents the index for calculating the inverse-frequency factor. The value-range of i starts from 0, increments with a step size of 2, and goes up to

C^{'} - 2

.

inv_fre q_{i}

represents the i-th inverse-frequency factor.

\frac{2 i}{C^{'}}

is the exponent part, which determines the variation pattern of the inverse-frequency factor.

The position index p is scaled by the inverse frequency, and the position encoding matrix is constructed

P E_{p}

, The final shape is (x, C^′):

\sin_i n p_{p, i} = p * i n v_f r e q_{i}

(6)

Here,

\sin_i n p_{p, i}

represents the intermediate value calculated from the position index

p

and the inverse frequency factor

i n v_f r e q_{i}

. This intermediate value is used as an input for the subsequent calculation of the sine and cosine values to construct the position encoding matrix

P E_{p}

:

{P E}_{P} = [s i n ({s i n}_{{i n p}_{p, 0}}), c o s ({s i n}_{{i n p}_{p, 0}}), s i n ({s i n}_{{i n p}_{p, 1}}), c o s ({s i n}_{{i n p}_{p, 1}}), \dots]

(7)

Following the linear embedding layer and position encoding layer, the Transformer layer acts as the encoder layer to process the encoded features. Here, a traditional multi-head self-attention model is utilized for feature extraction. The goal of this step is to eliminate “noise” in the agent’s trajectory—specifically, to remove behaviors that do not align with fundamental actions. Thus, the module’s function is to reconstruct model features and reduce data noise, ensuring the input to subsequent graph convolution layers is refined and noise-filtered.

3.3. Global Interaction Encoder

As shown in Figure 4, the Global Interaction Encoder uses GCN to extract the denoised features and construct a fully connected dynamic traffic map for panoramic agents. We adopt an agent-centric approach to construct traffic scenarios. To avoid resource waste and enhance computational efficiency, we introduce a strategy to dynamically divide the size of traffic scenarios. This strategy enables the dynamic adjustment of the graph’s scale and connection mode based on the number of agents at different times or in various scenarios, thereby reducing unnecessary computational overhead.

We treat each extracted agent as a node and employ the GCN-based Global Interaction Encoder module to reconstruct the edges between different agents. Initially, we construct a fully connected subgraph for each node, and then progressively connect these subgraphs to form a complete traffic trajectory graph. Our goal is to build a fully connected graph based on a predefined connection sequence. This process not only efficiently utilizes computational resources but also ensures comprehensive modeling of the interactions between each node in the traffic scenario.

In the Global Interaction Encoder, we use the Dynamic Graph Builder to regard agents as node entities and extract their feature vectors (such as position, motion state, interaction attributes, etc.) as the attributes of the nodes. Specifically, for the traffic graph

G^{t}

at time step

t

:

G^{t} = (V^{t}, E^{t})

(8)

V^{t} = (v_{1}^{t}, v_{2}^{t}, \dots, v_{N}^{t})

(9)

Here,

V^{t}

represents the set of nodes, and each node represents a traffic participant.

N

represents the number of nodes in the graph at time step

t

.

E^{t}

represents the set of edges, which is used to characterize the interaction between nodes. We rely on the attention mechanism to determine these interactions, which enables us to more accurately capture the complex and changeable interaction situations among traffic participants. After completing the construction of the above graph structure, in order to describe the relationships between nodes and the features of edges in the graph in more detail, we further construct edge attributes for each edge in the established fully-connected traffic graph. These edge attributes can represent the relationships between nodes or the features of edges in the graph. We use the scaled dot-product attention to dynamically calculate the edge weights

e_{i, j}^{t}

:

e_{i, j}^{t} = \frac{c_{i}^{t} \cdot c_{j}^{t}}{\sqrt{d}} * \exp (- \frac{d i s t (i, j)}{50})

(10)

Here,

\sqrt{d}

represents the scaling factor and

d

is the dimension of the feature vector, which prevents problems such as gradient explosion in the subsequent calculation process.

\exp (- \frac{d i s t (i, j)}{L})

represents the spatial decay term.

d i s t (i, j)

represents the physical distance between the traffic participants represented by node

i

and node

j

.

c_{i}^{t} \cdot c_{j}^{t}

represents the dot-product operation of the feature vectors of node

i

and node

j

.

The symbol “

*

” represents the multiplication operation. For the feature vector of node

i

at time step

t

:

c_{i}^{t} = [x_{i}^{t}, y_{i}^{t}, v_{i}^{t}, a_{i}^{t}, θ_{i}^{t}, {l a n e}_{i}^{t}]

(11)

Here,

x_{i}^{t}

and

y_{i}^{t}

represents the coordinate of node

i

at time step

t

.

v_{i}^{t}

and

a_{i}^{t}

represent the velocity and acceleration of node

i

at time step t, respectively.

θ_{i}^{t}

represents the heading angle.

{l a n e}_{i}^{t}

represents the lane ID. In this way, the edge features can express the displacement and relative motion between agents, providing effective inputs for the feature extraction of the graph neural network.

We divide high-interaction scenarios and sparse-interaction scenarios according to the agent density

ρ^{t}

and interaction intensity

I^{t}

:

ρ^{t} = \frac{N}{L \times D}

(12)

I^{t} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{i \in N_{i}} e_{i, j}^{t}

(13)

Here,

ρ^{t}

represents the agent density within a region of

(L \times D)

m².

L

is the length and

D

is the width of the rectangular region used for calculating the agent density.

I^{t}

represents the strength of the interaction between agents.

e_{i, j}^{t}

represents the edge weight between node

i

node

j

at time step

t

.

When

ρ^{t}

exceeds a predefined density threshold

ρ^{t h r e s h o l d}

and

I^{t}

exceeds a predefined interaction intensity threshold

I^{t h r e s h o l d}

the scenario is defined as a high-interaction scenario. These thresholds

ρ^{t h r e s h o l d}

and

I^{t h r e s h o l d}

can be dynamically adjusted according to different data characteristics, enabling the model to adapt to various real-world traffic or agent-based scenarios more effectively.

e_{i j} = c_{j} - c_{i}

To handle the different weights and local relationships of nodes in the graph, we use a graph convolutional network (GCN) instead of a simple fully connected layer [4]. GCN can not only capture the global relationships between nodes but also emphasize more important node interactions in local neighborhoods, avoiding the practice of treating all nodes equally in the fully connected layer. The node features are combined with the edge features through the adjacency matrix to effectively extract local and global information. Each graph convolution layer propagates information through the message-passing mechanism to update the node features. It can capture the mutual relationships between each agent and other agents. As an important extension of the message-passing mechanism in graph neural networks, the Graph Attention Network (GAT) proposed by Veličković et al. [46] further emphasizes the attention mechanism on neighboring nodes during the message-passing process, which also reflects the diversity and development of the message-passing mechanism in the field of graph neural networks. The message transmission process between node i and its neighbor node j can be expressed as:

m_{i j} = σ (W_{i} \cdot [h_{i}, h_{j}, e_{i j}] ⊙ s o f t p l u s (W_{s} \cdot [h_{i}, h_{j}, e_{i j}]))

(14)

Here,

m_{i j}

represents the message from node i to node j.

W_{f}

and

W_{s}

are the linear transformation weight matrices for edge features and node features, respectively.

h_{i}

and

h_{j}

are the feature vectors of nodes i and j, respectively.

e_{i j}

is the edge feature between node i and node j.

e_{i j} = c_{j} - c_{i}

(15)

Among them,

c_{i}

and

c_{j}

are the feature vectors of nodes i and j, respectively. In this way, edge features can express the displacement and relative motion between agents, providing effective input for the feature extraction of graph neural networks.

After passing through the graph convolution layer, we can perform subsequent calculations on each node in turn and reconstruct it into a new panoramic vehicle connection network by reconnecting it, which displays the initial position and the offset of the final position.

3.4. Decoder

The Res_Decoder layer plays a pivotal role in the model as a decoder, tasked with decoding the constructed panoramic vectors and transforming the results from a local coordinate system to a global coordinate system. This process provides an accurate global reference for the connections and relative positions between vehicles, thereby optimizing the performance of the entire network. The Res_Decoder layer achieves this by processing the input panoramic vectors in multiple stages through Residual Prediction Modules, leveraging residual connections and feature transformations to ensure precise capture of inter-vehicle relationships and positional information.

Panoramic Vector Decoding and Multi-Stage Processing

In the workflow of the Res_Decoder layer, the input panoramic vectors are first processed in stages by multiple Residual Prediction Modules. Each module refines the decoding process through residual connections and feature transformations, ensuring that the output at each stage captures increasingly detailed feature information. This multi-stage design not only enhances the model’s representational capacity but also makes the decoding process more robust and efficient. The core of each Residual Prediction Module is the residual connection, expressed as:

h_{i} = f (h_{i - 1}) + h_{i - 1}

(16)

where:

h_{i - 1}

is the input to the i-th module.

f (h_{i - 1})

represents the feature transformation obtained through linear transformations, normalization, and activation functions.

h_{i}

is the output of the ii-th module.

Each Residual Prediction Module includes a Group Normalization layer to stabilize and accelerate training. The normalization process is defined as:

y = \frac{x - μ}{σ} \cdot γ + β

(17)

where:

x

is the input feature.

μ

and

σ

are the mean and standard deviation computed over groups.

γ

and

β

are learnable scaling and shifting parameters.

y

is the normalized output.

Multi-Stage Processing

Assuming there are N Residual Prediction Modules, the decoding process can be described as:

h_{0} = x_{i n p u t}

(18)

h_{i} = ResidualPredictionModule (h_{i - 1}), i = 1, 2, . . . N

(19)

where:

h_{0}

: Represents the initial input, which is the panoramic vector received by the Res_Decoder layer.

h_{i}

: Represents the output of the ii-th Residual Prediction Module.

N: Represents the total number of Residual Prediction Modules.

During the decoding process, the DecoderResidual layer not only extracts and decodes features but also maps the decoded results from the local coordinate system to the global coordinate system by incorporating agent-specific transformations (e.g., translation and rotation). This mapping allows the model’s output to be directly aligned with the ground truth data in the original dataset, simplifying subsequent analysis and evaluation. By transforming the coordinates into a global reference frame, the DecoderResidual layer provides a unified framework for comparing and analyzing predictions directly in the global space. This design not only improves the model’s accuracy but also enhances the overall performance of the network, particularly in multi-agent collaborative tasks, where it enables precise descriptions of inter-vehicle relative positions and connections.

Furthermore, the output shape of the DecoderResidual layer is adjusted to facilitate downstream processing. In short, although we pass the transformation information of each agent during the decoding process, the goal of the loss function is to directly produce end-to-end results. Therefore, after recalculating the global positions, the results are aligned with the original positional data in the dataset. Leveraging the computational power of neural networks, the model can be efficiently deployed on edge devices without the need for post hoc global coordinate fitting.

4. Experimental Results and Analysis

In this section, Section 4.1 first explains the specific details of the experiment. Section 4.2 presents the comparison of the performance of this experiment with other models, including qualitative and quantitative evaluations. Finally, Section 4.3 conducts ablation experiments to study the criticality and effectiveness of each module.

4.1. Experimental Settings

4.1.1. Datasets

We evaluated the experimental results on the Argoverse 2.0 Motion Forecasting Dataset. Compared with Argoverse 1.0, Argoverse 2.0 adds more scenes and annotation data, adopts a richer HD map format, provides more detailed map information, and supports more complex autonomous driving tasks. The Argoverse 2.0 dataset contains 250k 11 s scenes with a sampling rate of 10 Hz. For the training set and validation set, the first five seconds of each scene are used as model input, and the remaining six seconds are used as the basis for model prediction. For the test set, only the first five seconds are provided [47].

4.1.2. Evaluation Indicators

MR (Miss Rate)

The Miss Rate (MR) is a metric employed to assess the accuracy of predicted trajectories. It is typically computed by determining whether the Final Displacement Error (FDE) between the predicted trajectory and the ground truth trajectory surpasses a predefined threshold. The mathematical formula for MR (k = i) is as follows:

M R = \frac{1}{N} \sum_{i = 1}^{N} I I ({F D E}_{b e s t}^{(i)} > d)

(20)

Here, N represents the total number of samples.

{F D E}_{b e s t}^{(i)}

represents the final displacement error (FDE) of the “best predicted trajectory” for the i-th sample. The definition of “best” depends on the selection criteria:

I I (\cdot)

is the indicator function, which equals 1 if the condition inside the parentheses is true and 0 otherwise.

d

represents the threshold for determining a miss (Argoverse 2.0 defaults to d = 2.0 m).

MR (k = 1) indicates that only the top predicted trajectory is considered. If the deviation between the predicted trajectory and the ground truth exceeds the threshold, it is classified as a miss.

MR (k = 6) denotes that the optimal trajectory is selected from six predicted trajectories. If the FDEs of all six trajectories exceed the threshold, it is classified as a miss.

Brier-minFDE (k = 6)

The Argoverse 2.0 [24] motion prediction challenge adopts brier-minFDE (k = 6) as the evaluation metric. Brier-minFDE is computed by multiplying the squared difference (1.0 − p)² by the endpoint L2 distance, where p represents the probability associated with the optimal predicted trajectory. This metric is designed to assess the quality of probabilistic predictions and is widely utilized to evaluate the reliability of multimodal trajectory prediction models. It integrates both the accuracy of probabilistic predictions (calibration) and the trajectory error (FDE). The mathematical formula for brier-minFDE (k = 6) is as follows:

B r i e r - m i n F D E (k = 6) = \frac{1}{N} \sum_{i = 1}^{N} (\sum_{k = 1}^{6} p_{k}^{(i)} \cdot {({F D E}_{k}^{(i)})}^{2})

(21)

Here, N represents the total number of samples.

$p_{k}^{(i)}$ denotes the normalized probability of the k-th trajectory for the i-th sample

{F D E}_{k}^{(i)}

is the Final Displacement Error (FDE) of the k-th trajectory for the i-th sample, computed as the Euclidean distance between the predicted endpoint and ground-truth endpoint (unit: meters).

The squared term

{({F D E}_{k}^{(i)})}^{2}

penalizes large errors more severely, encouraging the model to reduce significant deviations.

4.1.3. Experimental Details

The proposed model is implemented in the PyTorch 2.5.1 library and trained on an NVIDIA RTX 3080 GPU with 100 GB memory (NVIDIA, Santa Clara, CA, USA). MTGNet uses the training set of the Argoverse 2.0 Motion Forecasting Dataset [24] for training and employs the Adam algorithm [47] as the optimizer with a weight decay of 0.01 to optimize the network parameters. The Transformer encoder in the model is configured with four attention heads and a 128-dimensional latent feature space to encode contextual dependencies. The model is trained for 100 epochs. The learning rate is adjusted step-by-step: starting at 1 × 10⁻³, it is reduced to 1 × 10⁻⁴ at the 10th epoch, increased back to 1 × 10⁻³ at the 20th epoch, and then reduced to 1 × 10⁻⁴ again at the 68th epoch. Finally, the model generates 60 future prediction steps to achieve long-term trajectory prediction.

Our pre-training epoch is set to 10 and is conducted on the Train dataset.

The training progress over 10 epochs is illustrated in the Figure 5. As shown in Figure 5, both the pre-training loss and training loss exhibit a consistent downward trend, indicating that the model is effectively learning [48] and improving its performance. Initially, the pre-training loss is significantly higher than the training loss, reflecting the model’s need for optimization at the early stages. As training progresses, the pre-training loss decreases steadily, with the most notable reduction occurring in the first few epochs. Similarly, the training loss shows a gradual decline, reaching a much lower value by the final epoch. This consistent reduction in both pre-training and training losses demonstrates the model’s ability to converge effectively [48], suggesting that the training process successfully minimizes the loss function and enhances the model’s performance.

4.2. Performance Comparison

4.2.1. Quantitative Evaluation

To verify the effectiveness of the model, we compare MTGNet with nine trajectory prediction models. As shown in Table 1, MTGNet achieves the best performance in three evaluation metrics on the Argoverse 2.0 dataset, outperforming nine state-of-the-art multimodal trajectory prediction methods.

4.2.2. Qualitative Evaluation

As shown in Table 1 and Figure 6, our method outperforms nine state-of-the-art trajectory prediction methods and achieves the best performance in three evaluation metrics. Among them, the lower the values of MR (k = 6), MR (k = 1), and brier-minFDE (k = 6), the better the performance. THOMAS [2] uses a CNN-LSTM architecture, where LSTM processes time series data and convolutional social pooling captures the interdependencies of all vehicle movements in the scene. However, in complex traffic scenarios, using grids to model lanes is clearly unreasonable, and the limitations of LSTM lead to unsatisfactory results. The query-centric encoding paradigm of QCNet [19] establishes a local spacetime coordinate system for each scene element, enabling computation reuse and improving inference efficiency. However, it may insufficiently model global spacetime dependencies in complex interaction scenarios. Although the factorized attention mechanism reduces computational complexity from

O (A T^{2})

to

O (A T)

, significantly minimizing redundant calculations, it may also lead to losses in deep interaction information across time steps and agents, thereby affecting the model’s ability to capture long-term dependencies. For example, in scenarios with dense traffic flow or highly interactive agent trajectories, this local encoding strategy may fail to fully integrate dynamic correlations across the global scope, limiting prediction accuracy. In contrast, MTGNet uses Transformer to process time series data and reduce noise, and employs GCN instead of CNN to process graph-structured data in high-interaction scenarios, enabling it to better capture the interaction between agents. MTGNet (ours) achieves the same value in the

M R (k = 1)

metric and the best results in the

M R (k = 6)

and

B r i e r - m i n F D E (k = 6)

metrics.

As shown in Table 2, MTGNet (ours) outperforms HiVT [4] in both prediction accuracy and parameter efficiency. On the

B r i e r - m i n F D E (k = 6)

metric, MTGNet achieves 1.94, surpassing HiVT’s 2.01, which highlights its superior ability to capture critical interaction features. In terms of model size, MTGNet employs 2148 K parameters—20% fewer than HiVT’s 2546 K parameters—demonstrating a more compact architecture. In computational complexity analysis, HiVT reduces multi-agent interaction complexity to

O (N T^{2} + T N K + N L)

[4] via spatiotemporal decomposition and local region constraints, where N is the number of agents, T is the number of the time steps,

K (K < N)

is the restricted neighbor interaction range, and

L (L < T)

is the limited local time step range. However, its fixed-scenario design incurs redundant computations within predefined boundaries. By contrast, MTGNet adopts a dynamic strategy: first, a Transformer encoder denoises input data to filter out irrelevant driving behavior noise, ensuring subsequent interaction modeling focuses on critical dynamic features; it then narrows down to subsets of actually interacting agents (number n,

n < N

) and subgraph edges

(e)

achieving a complexity of

O (n D + e D)

(

d

is the feature dimension) to avoid non-discriminatory global calculations. Trained on a single NVIDIA RTX 3080 GPU for 128 epochs, MTGNet requires 18.4 min per epoch, significantly faster than HiVT’s 23.1 min per epoch—an efficiency gain attributed to its dynamic interaction modeling that minimizes redundant computations. Despite the faster training speed, MTGNet maintains superior prediction accuracy on the Argoverse 2.0 validation set, showcasing a balance of efficiency and performance through its denoising preprocessing and adaptive complexity design.

4.3. Ablation Test

For the different modules in our proposed model, we conducted ablation experiments to evaluate their effectiveness.

To further validate the effectiveness of each component in MTGNet, an ablation study is conducted, as shown in Table 3. When only employing the Transformer encoder and fully-connected layers (neither “Dynamic partition scene” nor “Fully connected → GCN” is adopted), the brier-minFDE (k = 6) value is 2.24. This baseline result reflects the performance of a conventional structure without these two optimizations. When “Dynamic partition scene” is adopted alone, the value drops to 2.15. This demonstrates that dynamic scene partitioning enables the model to focus on relevant scene elements, effectively filtering out irrelevant information and improving performance. When “Fully connected → GCN” is adopted alone, the value is 2.20. This indicates that replacing fully-connected layers with GCN introduces a more structured interaction modeling, yet without dynamic scene partitioning, its performance improvement is restricted due to handling unnecessary global interactions. Notably, when both “Dynamic partition scene” and “Fully connected → GCN” are adopted, the brier-minFDE (k = 6) reaches the lowest value of 1.99. This outcome highlights that the synergy of dynamic scene partitioning and GCN-based interaction modeling significantly enhances the model’s capability to capture critical features and minimize prediction errors, and verifies the indispensable role of these two components in optimizing the model’s prediction accuracy.

5. Conclusions

This paper presents the Multimodal Transformer Graph Convolutional Neural Network (MTGNet), a novel framework integrating Transformer and Graph Convolutional Network (GCN) to address multi-agent motion prediction in complex traffic scenarios by capturing long-term temporal dependencies and spatial interactions through a dynamic scene partitioning method based on panoramic fully connected agent traffic graphs, a vectorization strategy for reducing computational overhead, and a Transformer-based encoder for noise filtering. Taking agents’ historical trajectories, scene maps, and auxiliary information as input, MTGNet achieves state-of-the-art performance on the Argoverse 2.0 dataset.

The proposed framework demonstrates substantial practical value in autonomous driving systems, where it enables safe and efficient maneuver planning by accurately predicting surrounding agents’ motions; in intelligent traffic management, where it optimizes traffic flow dynamics to alleviate congestion; and in advanced traffic surveillance, where it supports precise trajectory tracking for accident prevention and violation detection. Through its hybrid architecture combining Transformer’s sequential modeling with GCN’s spatial reasoning, dynamic scene adaptation, and noise-robust feature extraction, MTGNet not only outperforms existing baselines across key metrics but also establishes a robust baseline for future research. It offers generalizable solutions for integrating heterogeneous data (trajectories, scene geometry, agent attributes) in spatio-temporal prediction tasks, inspiring innovations in intelligent transportation systems and multi-agent motion analysis.

Author Contributions

Y.D. conceived and designed the study, conducted a comprehensive literature review on multi-agent systems, developed the deep learning model, and drafted the initial manuscript. Y.Z. contributed to the study’s conception, was responsible for the selection and analysis of relevant datasets, and participated in manuscript writing. X.Z. handled data processing and optimization, as well as model evaluation and result analysis. X.S. reviewed and edited the manuscript, offering critical feedback on the methodology and interpretation of results. S.W. supervised the research and provided guidance on the development of the deep learning framework. Q.W. managed the project and secured the necessary resources. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Industrial Key Core Technology Research Project of the Jilin Provincial Department of Science and Technology (Grant No. 20220201154GX).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were used in this study. These data can be found here: https://www.argoverse.org/av2.html#download-link (accessed on 30 August 2024).

Acknowledgments

We would like to express our deepest gratitude to all those who have contributed to the completion of this research and the writing of this paper.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

MTGNet	Multimodal Transformer Graph convolution neural network
GCN	Graph Convolutional Neural network
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory network
RNN	Recurrent Neural Network
Bi-RNN	Bidirectional Recurrent Neural Network
NLP	Natural Language Processing
HiVT	Hierarchical Vector Transformer
GRU	Gated Recurrent Unit
MR	Miss Rate

References

Nayakanti, N.; Al-Rfou, R.; Zhou, A.; Goel, K.; Refaat, K.S.; Sapp, B. Wayformer: Motion forecasting via simple & efficient attention networks. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Gilles, T.; Sabatini, S.; Tsishkou, D.; Stanciulescu, B.; Moutarde, F. Thomas: Trajectory heatmap output with learned multi-agent sampling. arXiv 2021, arXiv:2110.06607. [Google Scholar]
Wei, C.; Hui, F.; Zhao, X.; Fang, S. Real-time Simulation and Testing of a Neural Network-based Autonomous Vehicle Trajectory Prediction Model. In Proceedings of the 2022 18th International Conference on Mobility, Sensing and Networking (MSN), Guangzhou, China, 14–16 December 2022; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
Zhou, Z.; Ye, L.; Wang, J.; Wu, K.; Lu, K. Hivt: Hierarchical vector transformer for multi-agent motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Lin, C.-F.; Ulsoy, A.G.; LeBlanc, D.J. Vehicle dynamics and external disturbance estimation for vehicle path prediction. IEEE Trans. Control Syst. Technol. 2000, 8, 508–518. [Google Scholar]
Kaempchen, N.; Schiele, B.; Dietmayer, K. Situation assessment of an autonomous emergency brake for arbitrary vehicle-to-vehicle collision scenarios. IEEE Trans. Intell. Transp. Syst. 2009, 10, 678–687. [Google Scholar] [CrossRef]
Mandalia, H.M.; Salvucci, M.D.D. Using support vector machines for lane-change detection. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Orlando, FL, USA, 26–30 September 2005; SAGE Publications: Los Angeles, CA, USA, 2005. [Google Scholar]
Berndt, H.; Emmert, J.; Dietmayer, K. Continuous driver intention recognition with hidden markov models. In Proceedings of the 2008 11th International IEEE Conference on Intelligent Transportation Systems, Beijing, China, 12–15 October 2008; IEEE: Piscataway, NJ, USA, 2008. [Google Scholar]
Huang, Y.; Du, J.; Yang, Z.; Zhou, Z.; Zhang, L.; Chen, H. A survey on trajectory-prediction methods for autonomous driving. IEEE Trans. Intell. Veh. 2022, 7, 652–674. [Google Scholar] [CrossRef]
Deo, N.; Trivedi, M.M. Convolutional social pooling for vehicle trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Chai, Y.; Sapp, B.; Bansal, M.; Anguelov, D. Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv 2019, arXiv:1910.05449. [Google Scholar]
Kim, B.; Kang, C.M.; Kim, J.; Lee, S.H.; Chung, C.C.; Choi, J.W. Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
Hong, J.; Sapp, B.; Philbin, J. Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Cui, H.; Radosavljevic, V.; Chou, F.-C.; Lin, T.-H.; Nguyen, T.; Huang, T.-K.; Schneider, J.; Djuric, N. Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Zamboni, S.; Kefato, Z.T.; Girdzijauskas, S.; Norén, C.; Dal Col, L. Pedestrian trajectory prediction with convolutional neural networks. Pattern Recognit. 2022, 121, 108252. [Google Scholar] [CrossRef]
Sadid, H.; Antoniou, C. Dynamic Spatio-temporal Graph Neural Network for Surrounding-aware Trajectory Prediction of Autonomous Vehicles. IEEE Trans. Intell. Veh. 2024, early access. [Google Scholar] [CrossRef]
Liang, M.; Yang, B.; Hu, R.; Chen, Y.; Liao, R.; Feng, S.; Urtasun, R. Learning lane graph representations for motion forecasting. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings Part II 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Gao, J.; Sun, C.; Zhao, H.; Shen, Y.; Anguelov, D.; Li, C.; Schmid, C. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhou, Z.; Wang, J.; Li, Y.-H.; Huang, Y.-K. Query-centric trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Ridel, D.; Deo, N.; Wolf, D.; Trivedi, M. Scene compliant trajectory forecast with agent-centric spatio-temporal grids. IEEE Robot. Autom. Lett. 2020, 5, 2816–2823. [Google Scholar] [CrossRef]
Mo, X.; Huang, Z.; Xing, Y.; Lv, C. Multi-agent trajectory prediction with heterogeneous edge-enhanced graph attention network. IEEE Trans. Intell. Transp. Syst. 2022, 23, 9554–9567. [Google Scholar] [CrossRef]
Su, Y.; Du, J.; Li, Y.; Li, X.; Liang, R.; Hua, Z.; Zhou, J. Trajectory forecasting based on prior-aware directed graph convolutional neural network. IEEE Trans. Intell. Transp. Syst. 2022, 23, 16773–16785. [Google Scholar] [CrossRef]
Li, H.; Ren, Y.; Li, K.; Chao, W. Trajectory Prediction with Attention-Based Spatial–Temporal Graph Convolutional Networks for Autonomous Driving. Appl. Sci. 2023, 13, 12580. [Google Scholar] [CrossRef]
Wilson, B.; Qi, W.; Agarwal, T.; Lambert, J.; Singh, J.; Khandelwal, S.; Pan, B.; Kumar, R.; Hartnett, A.; Pontes, J.K. Argoverse 2: Next generation datasets for self-driving perception and forecasting. arXiv 2023, arXiv:2301.00493. [Google Scholar]
Joseph, J.; Doshi-Velez, F.; Huang, A.S.; Roy, N. A Bayesian nonparametric approach to modeling motion patterns. Auton. Robot. 2011, 31, 383–400. [Google Scholar] [CrossRef]
Lee, N.; Choi, W.; Vernaza, P.; Choy, C.B.; Torr, P.H.; Chandraker, M. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–27 July 2017. [Google Scholar]
Ma, X.; Tao, Z.; Wang, Y.; Yu, H.; Wang, Y. Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transp. Res. Part C Emerg. Technol. 2015, 54, 187–197. [Google Scholar] [CrossRef]
Xin, L.; Wang, P.; Chan, C.-Y.; Chen, J.; Li, S.E.; Cheng, B. Intention-aware long horizon trajectory prediction of surrounding vehicles using dual LSTM networks. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Hou, L.; Xin, L.; Li, S.E.; Cheng, B.; Wang, W. Interactive trajectory prediction of surrounding road users for autonomous driving using structural-LSTM network. IEEE Trans. Intell. Transp. Syst. 2019, 21, 4615–4625. [Google Scholar] [CrossRef]
Salzmann, T.; Ivanovic, B.; Chakravarty, P.; Pavone, M. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings Part XVIII 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Online Event, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv 2020, arXiv:2111.14330. [Google Scholar]
Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling. Adv. Neural Inf. Process. Syst. 2021, 34, 15084–15097. [Google Scholar]
Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [Google Scholar] [CrossRef] [PubMed]
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in time series: A survey. arXiv 2022, arXiv:2202.07125. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Guo, C.; Fan, S.; Chen, C.; Zhao, W.; Wang, J.; Zhang, Y.; Chen, Y. Query-Informed Multi-Agent Motion Prediction. Sensors 2023, 24, 9. [Google Scholar] [CrossRef] [PubMed]
Bharilya, V.; Kumar, N. Machine learning for autonomous vehicle’s trajectory prediction: A comprehensive survey, challenges, and future research directions. Veh. Commun. 2024, 46, 100733. [Google Scholar] [CrossRef]
Min, H.; Xiong, X.; Wang, P.; Zhang, Z. A Hierarchical LSTM-Based Vehicle Trajectory Prediction Method Considering Interaction Information. Automot. Innov. 2024, 7, 71–81. [Google Scholar] [CrossRef]
Hu, Z.; Brilakis, I. Matching design-intent planar, curved, and linear structural instances in point clouds. Autom. Constr. 2024, 158, 105219. [Google Scholar] [CrossRef]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Wang, Y.; Zhou, H.; Zhang, Z.; Feng, C.; Lin, H.; Gao, C.; Tang, Y.; Zhao, Z.; Zhang, S.; Guo, J. Tenet: Transformer encoding network for effective temporal flow on motion prediction. arXiv 2022, arXiv:2207.00170. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, UK, 2016; Volume 1. [Google Scholar]
Cui, A.; Casas, S.; Wong, K.; Suo, S.; Urtasun, R. Gorela: Go relative for viewpoint-invariant motion forecasting. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
Zhang, C.; Sun, H.; Chen, C.; Guo, Y. Banet: Motion forecasting with boundary aware network. arXiv 2022, arXiv:2206.07934. [Google Scholar]

Figure 1. Illustration of dynamic interaction.

Figure 2. Multimodal Transformer Graph Convolution Neural Network (MTGNet).

Figure 3. (a) Structure of Motion Trajectory Encoder; (b) structure of Transformer_Layer.

Figure 4. (a) Structure of Global Interaction Encoder; (b) structure of CGConv module.

Figure 5. Train loss value during pre-training.

Figure 6. The results of MTGNet and nine advanced models on the Argoverse 2.0 validation dataset using three widely used evaluation metrics (MR (k = 6) ↓, MR (k = 1) ↓, Brier-minFDE (k = 6) ↓). “↓” mean that the larger the better, and the smaller the better [2,46,47,48].

Table 1. The results of three widely used evaluation metrics. kis the number of predicted trajectories for calculating the evaluation metrics (MR (k = 6) ↓, MR (k = 1) ↓, brier-minFDE (k = 6) ↓) on the Argoverse 2.0 test dataset. “↓” indicate that larger is better and smaller is better, and the best results are in bold.

Method & Rank	MR (k = 6) ↓	MR (k = 1) ↓	Brier-minFDE (k = 6) ↓
MTGNet (ours)	0.17	0.60	1.99
VI LaneIter	0.19	0.60	2.00
GoRela [49]	0.22	0.66	2.01
OPPred [50]	0.18	0.61	2.03
QCNet [19]	0.21	0.60	2.14
THOMAS [2]	0.20	0.64	2.16
Autowise. AI(GNA)	0.29	0.71	2.45
vilab	0.29	0.71	2.47
LGU	0.37	0.73	2.77
drivingfree	0.49	0.72	3.03

Table 2. Comparison of prediction accuracy (brier-minFDE (k = 6) ↓) and parameter count between MTGNet and HiVT (fully-connected) on the Argoverse 2.0 validation set. The best results are in bold.

Method & Rank	Brier-minFDE (k = 6) ↓	#Param
MTGNet (ours)	1.94	2148 K
HiVT [4]	2.01	2546 K

Table 3. Quantitative evaluation of ablation experiments. K is the number of predicted trajectories for calculating the evaluation metrics. “↓” mean that the larger the better.

Method	Dynamic Partition Scene	Fully Connected → GCN	Brier-minFDE (k = 6) ↓
MTGNet			2.24
	√		2.15
		√	2.20
	√	√	1.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dai, Y.; Zhang, Y.; Zhou, X.; Wang, Q.; Song, X.; Wang, S. MTGNet: Multi-Agent End-to-End Motion Trajectory Prediction with Multimodal Panoramic Dynamic Graph. Appl. Sci. 2025, 15, 5244. https://doi.org/10.3390/app15105244

AMA Style

Dai Y, Zhang Y, Zhou X, Wang Q, Song X, Wang S. MTGNet: Multi-Agent End-to-End Motion Trajectory Prediction with Multimodal Panoramic Dynamic Graph. Applied Sciences. 2025; 15(10):5244. https://doi.org/10.3390/app15105244

Chicago/Turabian Style

Dai, Yinfei, Yuantong Zhang, Xiuzhen Zhou, Qi Wang, Xiao Song, and Shaoqiang Wang. 2025. "MTGNet: Multi-Agent End-to-End Motion Trajectory Prediction with Multimodal Panoramic Dynamic Graph" Applied Sciences 15, no. 10: 5244. https://doi.org/10.3390/app15105244

APA Style

Dai, Y., Zhang, Y., Zhou, X., Wang, Q., Song, X., & Wang, S. (2025). MTGNet: Multi-Agent End-to-End Motion Trajectory Prediction with Multimodal Panoramic Dynamic Graph. Applied Sciences, 15(10), 5244. https://doi.org/10.3390/app15105244

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MTGNet: Multi-Agent End-to-End Motion Trajectory Prediction with Multimodal Panoramic Dynamic Graph

Abstract

1. Introduction

2. Related Studies

3. Methodology

3.1. Overview

3.2. Motion Trajectory Encoder

3.3. Global Interaction Encoder

3.4. Decoder

4. Experimental Results and Analysis

4.1. Experimental Settings

4.1.1. Datasets

4.1.2. Evaluation Indicators

4.1.3. Experimental Details

4.2. Performance Comparison

4.2.1. Quantitative Evaluation

4.2.2. Qualitative Evaluation

4.3. Ablation Test

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI