Goal-Guided Graph Attention Network with Interactive State Refinement for Multi-Agent Trajectory Prediction

The accurate prediction of the future trajectories of traffic participants is crucial for enhancing the safety and decision-making capabilities of autonomous vehicles. Modeling social interactions among agents and revealing the inherent relationships is crucial for accurate trajectory prediction. In this context, we propose a goal-guided and interaction-aware state refinement graph attention network (SRGAT) for multi-agent trajectory prediction. This model effectively integrates high-precision map data and dynamic traffic states and captures long-term temporal dependencies through the Transformer network. Based on these dependencies, it generates multiple potential goals and Points of Interest (POIs). Through its dual-branch, multimodal prediction approach, the model not only proposes various plausible future trajectories associated with these POIs, but also rigorously assesses the confidence levels of each trajectory. This goal-oriented strategy enables SRGAT to accurately predict the future movement trajectories of other vehicles in complex traffic scenarios. Tested on the Argoverse and nuScenes datasets, SRGAT surpasses existing algorithms in key performance metrics by adeptly integrating past trajectories and current context. This goal-guided approach not only enhances long-term prediction accuracy, but also ensures its reliability, demonstrating a significant advancement in trajectory forecasting.


Introduction
Trajectory prediction stands as one of the most challenging aspects of autonomous driving, requiring the model to accurately predict the trajectories of traffic participants (e.g., vehicles, pedestrians, cyclists, etc.) surrounding the autonomous vehicle [1].This process involves numerous potential variables.The emergence of high-definition maps (HD maps) and datasets from sensors has propelled research in this field [2,3].Combining map information and sensor data has become an effective strategy to improve prediction accuracy, albeit at the cost of increasing the complexity of the prediction process [4].Effectively leveraging this information poses a central challenge in the field of trajectory prediction.
To achieve rapid and accurate trajectory prediction, motion models based on vehicle physics characteristics have been widely adopted [5,6].These methods primarily rely on the vehicle's past motion states (position, velocity, acceleration, etc.) and employ filtering and optimization techniques to predict future maneuver strategies, such as Kalman filters (KFs), Dynamic Bayesian Networks (DBNs), and Hidden Markov Models (HMMs) [7][8][9].In comparison to models based on vehicle kinematics, those based on deep learning are increasingly gaining popularity [10,11].They excel in extracting hidden dependencies from contextual time steps, and are particularly adept at capturing long-term features in trajectories.Alahi et al. [12] pioneered the use of Social-LSTM, incorporating a social pooling layer to explore interactions among pedestrians.Inspired by this research, many researchers have started deploying similar model architectures to model social interactions among agents.Nachiket et al. [13] introduced convolutional operations in the social pooling layer, achieving promising experimental results in predicting vehicle trajectories.Sheng et al. [14] proposed a Graph-based Spatiotemporal Convolutional Network (GSTCN) that utilizes graph convolutional networks to handle spatial interactions.Given the breakthroughs of Transformers [15] in natural language processing and their wide-spread adoption for predicting agent behavior due to their long-term predictive capabilities [16,17], they efficiently address the memory problem of handling long sequences with attention mechanisms.The model can directly associate the entirety of input data sequences and context vectors, rather than being limited to the association with the last hidden states [18].Syed et al. [19] introduced the Spatiotemporal Graph Transformer (STGT) model, which uses CNN models for processing environmental image features and employs Transformers for sequence prediction.Mercat et al. [20], by introducing self-attention mechanisms considering interactions between vehicles, successfully achieved trajectory prediction for multiple vehicle agents.Roger et al. [21] proposed the AutoBots model, which, through the use of social multi-head self-attention (MHSA) modules, efficiently performs single-pass forward inference for the entire future scene, demonstrating high performance in handling complex traffic scenarios with multi-agent interactions.The adoption of deep learning models like Transformers and MHSA modules has significantly advanced multi-agent trajectory prediction in complex traffic scenarios.However, challenges such as the need for real-time adaptability, integrating dynamic environmental conditions, and reducing model complexity without sacrificing accuracy, remain areas ripe for future research.
With the advent of large-scale datasets, the introduction of high-definition maps (HD maps) and sensor data has brought about breakthroughs in the field of trajectory prediction [22,23].In the past, predicting vehicle trajectories primarily relied on the physical properties of vehicles, such as historical trajectories, speed, acceleration, and the relative distances to surrounding vehicles.While these methods perform well in simple traffic environments, the trajectories of vehicles in complex road scenarios are influenced not only by surrounding vehicles (SVs), but also by lane guidance and spatial constraints imposed by road boundaries.Therefore, the inclusion of HD maps around the vehicle and its sensor data in the scope of trajectory prediction has become increasingly necessary.
To enable neural networks to handle HD maps, some studies rasterize map data and then apply convolutional neural networks (CNNs) to extract features from it [24,25].Casas et al. [26] utilized a CNN detector to extract features from rasterized maps.Hong et al. [27] employed high-precision 3D perception and detailed semantic environmental maps.They encoded semantic information through spatial grid encoding and used deep convolutional models to integrate complex scene contexts.Rasterizing map data aims to obtain long-range information along the lane direction, requiring a relatively large perception field, which may lead to significant computational resource waste [28].Additionally, rasterization processing may result in information loss.Simply inputting map data into the model might not effectively capture complex information such as road structures, traffic signs, signals, etc.Therefore, there is a need for a deeper integration of vehicle trajectories and maps to address these challenges.
Another option is to use vectorized map features.VectorNet [29] vectorizes HD maps and agents' trajectories, employing graph neural networks to depict interactions between traffic participants and road environment, as well as interactions among the participants themselves.It can effectively capture complex interactions between traffic participants and structural information in the road environment, avoiding information loss introduced by rasterization, thus providing a more comprehensive map representation.MTR++ [30] uses a local self-attention mechanism to capture essential local structural information in vectorized road maps.This enables the system to more accurately understand and predict the future movements of multiple agents in complex traffic environments.Liang et al. proposed LaneGCN [31], which constructs a map node graph and uses a multi-step graph neural network to encode the map, considering the road's topological structure.This approach clarifies the interactions among traffic participants and more accurately represents their connection to map structures.However, due to the diverse ways in which actors move, fixed-size strides cannot effectively model distant-related map features, thereby limiting predictive performance.
To address the challenges mentioned above, this paper introduces an innovative trajectory prediction framework, which we refer to as "SRGAT" (a goal-guided and interactionaware state refinement graph attention network), building upon the LaneGCN [31] baseline proposed by Liang et al.This framework integrates high-definition maps and vehicle dynamics, employing lane graph convolution operators to capture complex traffic scenarios.The social encoding component combines 1D CNN with FPN to extract interactions between vehicles, utilizing a multi-head self-attention mechanism to further understand social relationships.The model estimates potential target points for vehicles and refines predictions by incorporating deep contextual information, enhancing prediction accuracy.The decoder utilizes a recursive feed-forward network and multi-head attention decoder layers to iteratively predict multimodal future trajectories.Each trajectory is based on a learnable seed parameter matrix and comes with an associated confidence score, allowing the model to consider probability distributions.By integrating high-definition maps and agent state information, dynamic interaction features from the social encoder, as well as advanced target estimation and trajectory generation strategies, our model ensures high-precision trajectory prediction in complex traffic situations.
The contributions of this paper can be summarized as follows: 1.
Leveraging prior research, our study introduces SRGAT, a cutting-edge trajectory prediction framework that innovatively merges HD maps with dynamic vehicle data.This method not only addresses the challenge of fixed stride by adapting to vehicle context and environmental objectives, but also comprehensively evaluates the road environment's influence on trajectory forecasts.2.
Our model significantly boosts the accuracy of trajectory predictions in intricate traffic scenarios by exploiting HD maps' spatial constraints and vehicles' dynamic states, effectively addressing the challenge of dynamic goal estimation.

3.
By introducing a dual-branch multimodal prediction architecture that generates multiple potential future trajectories and assigns a confidence score to each, the model's accuracy and diversity in trajectory prediction in complex traffic situations are significantly enhanced.It increases both the accuracy and variety of the predictions.4.
We conducted evaluations of the proposed model on both Argoverse and nuScenes datasets and engaged in a detailed comparison with the current state-of-the-art methods.The results demonstrate that our model exhibits substantial performance improvements over these methods across a range of critical performance metrics.
The remainder of this paper is organized as follows: Section 2 defines the problem of trajectory prediction.Section 3 details the data processing methods.Section 4 describes the network structure of the algorithm and presents the training details.Section 5 discusses the experimental setup, results, and comparisons with existing methods.Finally, conclusions are drawn in Section 6.

Problem Formulation
We assume that by observing traffic participants and their environment, it is possible to capture precise historical motion paths and high-resolution map data in a two-dimensional coordinate system.Specifically, for the ith agent in all n traffic participants, we can collect a series of state observations S i obs = [s i −T+1 , s i −T+2 , . . ., s i 0 ], i ∈ [0, n − 1] within a certain time frame {−T + 1, −T + 2, . . ., 0}, where s i t = (x i t , y i t , v i x,t , v i y,t , a i , o i t ), consisting of the x, y coordinate, velocity, agent type and orientation in a global Cartesian coordinate as features at time step t.A corresponding high-definition map m was added to establish a complete scene.The scene information set m = {Y, A} can be divided into a lane node feature matrix Y and an adjacency matrix set {A i } i∈{pre,suc,left,right} .The node matrix Y = (x j , y j , heading j , turn j , tra f j , intersect j ), j ∈ N node represents the lane geometry feature and adjacency matrix represents the topology connections between different nodes.The meaning of each matrix in m will be further explained in Section 3. The goal of this research is to predict the agent's future states S out = [s 1 , s 2 , . . ., s t ].With the current map, past states of the agent, and the states of other agents known, we aim to define the probability distribution of the agent's future states, p(S out |m, S obs , S O obs ), where S O obs indicates the observed states of other agents and S ego obs stands for ego agents when i = 0. Our model offers K modes of potential future trajectory sets for each agent.

Data Preprocessing
In the process of map information preprocessing, we first transform the map metadata from the Argoverse dataset into a vectorized map data representation.This approach primarily represents the map data as a graph structure, aiming to characterize a set of lanes and their connectivity.The course of the roads is represented by the two-dimensional coordinates of discretized road centerlines.To better utilize the relationship between the road and the ego vehicle, we adopt a 2D Cartesian coordinate system with the ego vehicle at the origin and the forward direction as the x-axis.The two-dimensional coordinates of the centerlines serve as the nodes of the graph, forming a series of 2D bird's eye viewpoints arranged according to lane direction.For any two directly accessible lanes, we define four types of connectivity-predecessor, successor, left neighbor, and right neighbor-and encode these connections as edge information in the graph, as shown in Figure 1.Specifically, lanes that can be directly accessed are considered neighboring predecessors and successors, while left and right neighbors are defined as the spatially nearest lane nodes on adjacent lanes to the left and right, measured by Euclidean distance.We represent lane nodes with V ∈ R N×2 (can be seen as part of Y), where N node is the number of lane nodes, and the ith row of V represents the BEV coordinates of the ith node.We use four adjacency matrices {A i } i∈{pre,suc,left,right} to represent the connectivity, where A i ∈ R N×N .The element in the jth row and kth column of In the preprocessing of trajectory information, participants located within a specific Euclidean distance from the ego vehicle (which is set to 10 m in our experiments) are exclusively considered to reduce the algorithm's complexity.To better leverage the relationship between participants and the road (especially the road centerlines) in the environment, all participant trajectory position information need to be represented in terms of relative position to nearby road centerlines.The input for participant n is a series of relative displacements: , where v t n represents the state vector, taking into account the participant's type (pedestrian, vehicle, cyclist).This representation method further utilizes the symmetry of the problem and avoids the disruption of learning due to changes in the ego vehicle's absolute position coordinates during subsequent training.

Structure of SRGAT Model 4.1. Model Framework
This section presents the network architecture utilized for trajectory prediction, as illustrated in Figure 2. It comprises three main components: the Encoder, Goal Areas Estimation, and Trajectory Decoding and Generation.The Encoder encompasses a scenario encoder and a social encoder.The scenario encoder processes high-precision map data, converting them into relevant information such as road layouts and geometric shapes.The social encoder processes the historical trajectories of vehicles, using a one-dimensional convolutional network to extract interaction features, and employing Transformer layers to assimilate behavioral patterns of different vehicles.The Goal Area Estimation predicts vehicle destinations using Transformer encoder embeddings to improve precision.The GoICrop mechanism refines this by focusing on crucial trajectory segments, reducing the impact of poor anchor generation on stability.The Trajectory Decoder then constructs the predicted paths, integrating these estimates with dynamic behavior for accurate forecasts.

Encoder
In this section, the main task is to acquire the historical states and environmental features of traffic participants within a scene, learn structured map representations, and then fuse the information of traffic participants with HD map data.

Scenario Encoder
To initiate the encoding process of scenarios, we first input lane map data into a graph convolutional network for feature extraction.It considers the size, orientation, and position of the lane graph nodes when encoding their information, leading to a defined set of features for these nodes: where MLP denotes a multi-layer perceptron with subscripts indicating shape and location.Additionally, v i represents the position of the ith lane node, v start i and v end i are the coordinates of the start and end points of node i, respectively, and y i is the feature of the ith graph node.To implicitly capture the directional information carried between graph nodes and seize the long-distance road information relied upon by vehicles during their journey, we employed the LaneConv operator: where F is the aggregated graph node feature, C is the dilation size, and A i and W i are the adjacency matrices and weight matrices corresponding to the ith type of connection, respectively.A pre , A suc , A left , and A right represent connections from a node to its immediately preceding, succeeding, left, and right nodes, respectively.A k c pre denotes the k power of A pre , allowing lane graph node features to propagate information along the lane graph for k c steps, where k is a hyperparameter.With the utilization of graph convolutional networks and LaneConv, the scenario encoder effectively captures spatial relationships and directional information between lane nodes, enabling the creation of informative graph node features.

Social Encoder
To encode the social interactions among traffic participants, which are crucial for understanding and predicting their behavior, we employ a one-dimensional convolutional neural network (1D CNN).As shown in Figure 3, the network processes the trajectory information of dynamic objects S obs .The 1D CNN architecture consists of multiple scales of convolutional layer groups, each featuring two residual blocks with a stride of 2, enabling the model to capture a wide range of temporal patterns.A feature pyramid network (FPN) [32] is utilized to integrate feature maps of different scales, obtaining the final feature tensor through additional residual blocks.The above process can be expressed as follows: Each convolutional layer in the network employs a 3 × 3 kernel size, outputting 128 feature channels, followed by layer normalization and ReLU activation function.This configuration is designed to capture the nuances of dynamic object behavior in detail.After encoding the scene and trajectory information, we relatively obtain a 2D feature matrix A emb from Formula (4), where each row A emb,i represents the features of the i participant, and a 2D feature matrix Y as mentioned above, where each row Y i represents the features of the ith lane node.To merge social and environmental information, we utilize the Transformer Fusion Layer.This layer combines the dynamic impact of actors on lanes and the real-time feedback of lanes on actor behavior, subsequently enhancing the interaction between actors and the map through a Transformer encoder.Specifically, we first combine the features of actors A emb,i with those of surrounding lane nodes Y i through a spatial attention mechanism, forming a weighted feature representation W i , enriched with each lane node's characteristics.Then, this module integrates the updated features Y ′ i of lane nodes with the features A emb,i of actor nodes to reflect the real-time impact of lane information on actor behavior.This process not only captures the current traffic state of lane nodes, but also encodes the influence of traffic flow on these lanes.Consequently, it ensures the model's ability to comprehensively understand and predict the potential behaviors of actors in specific traffic environments.

Goal Areas' Estimation
To accurately capture and predict the complexity of driving behaviors, it is crucial to avoid oversimplification and accuracy loss in trajectory prediction.We are committed to optimizing our model, ensuring that the predicted trajectories closely match the actual trajectories.After processing through the encoder, the data flow toward the Transformer Fusion Layer, which includes a module for locating target areas.Due to the stochastic and multimodal nature of driving behaviors, we employ multi-target prediction when locating target areas.Initially, we select the most confident target point from the predicted target set as the predicted goal, considering the vehicle's motion history and current driving environment.Which can be expressed as GP = FFN(TransformerEncoder(type_encoding(A emb , Y))) (6) where GP is a 2D matrix in which GP i can be seen as a representation of the likely intended endpoint of the prediction trajectories.Before being processed by the Transformer encoder, A emb and Y were type-encoded the same way as mentioned in ViLT [33].This process approximates the vehicle's destination with the highest probability.By cropping the map to create a target area, we ensure that the vehicle's final position is more likely to be within this optimized, relatively smaller area.This approach addresses the uncertainty in endpoint prediction and enhances the accuracy of the prediction, making it more likely for the vehicle's actual position to appear within this smaller area of focus, rather than relying on potentially fluctuating target points.
Based on these predicted target points, we implicitly construct a model of future interactions among actors using the GoICrop [28] technique, which can be expressed as follows: while GP ′ ∈ R B×K×2 is the feature of the ith actor, B refers to batch size and K is the number of prediction modes, ϕ i is layer normalization process, W i serves as weights, and y j is the jth lane node feature.This process serves as spatial distance-based attention and updates the goal area lane nodes' features back to the actors, enhancing the model's ability to capture complex driving behaviors.
This Region of Interest (ROI) filtering method allows us to precisely determine potential target points for each actor and predict their possible future interactions.This strategy significantly enhances the model's ability to understand and predict actor behaviors in complex traffic scenarios.
Finally, we use the updated vehicle features GP ′ as input to generate K confidence scores as matrix C ∈ R B×K .Both of them will be used in the decoding process to predict final future trajectories.Similar to LaneGCN, it can be generated simply by MLP: LinearRes() = GN(Linear(ReLU(GN(Linear()))) (10) where GN( ) stands for GroupNorm.A dual-branch multimodal prediction architecture is employed, where one branch is responsible for estimating possible trajectories, and the other for assigning confidence scores to these trajectories.Figure 4 is a schematic diagram illustrating the model's prediction of vehicle trajectories at an intersection and their associated confidence scores, where the green lines represent the predicted trajectories, the green stars indicate the predicted goals and the confidence scores for different trajectories are denoted by the numerical values along the paths.The following loss function is used to assess and optimize the accuracy of the predicted trajectories: where α 1 , β 1 , and ρ 1 serve as weights.By combining multi-target prediction and GoICrop technology, the model can precisely predict future trajectories among multiple possibilities.

Trajectory Decoding and Generation
The decoder of our model is designed to leverage the high-dimensional features extracted at earlier stages and generate accurate multimodal future trajectory predictions.This approach ensures the model's capability to anticipate various potential pathways, enhancing the reliability of the predictions.To produce K ∈ N distinct predictions from the same input scenario, the model initially employs K learnable seed parameter matrices T) , where T denotes the prediction time steps, and i ∈ {1, . . ., K}.By replicating each Q i across the agent dimension, a new input tensor of dimension R (d k ,M,T) is created, enabling the model to generate specific predicted trajectories for each agent at every time step, thereby effectively realizing diversified trajectory predictions for the same scenario.
The decoding process begins with handling the time dimension, employing a multihead attention-based decoder (MABD) layer to process the encoder's output GP ′ and the encoded seed parameters Q i .For n agents, this process can be denoted as where H ′ 0 is the output tensor, MHSA represents the multi-head self-attention mechanism, and rFFN is a residual feed-forward network.These components work together, allowing the MABD layer to efficiently process and decode time-series data, generating precise future trajectory predictions for each agent.
To ensure social consistency among the set elements in future scenarios, it is essential to process each time slice of H ′ 0 .Specifically, for the agent state set H ′ 0τ at some future time step τ, the decoder processes each element h ′ 0τ using a multi-head attention block (MAB) layer.
The decoder repeats these operations L dec times, with each iteration that updates the output tensor H ′ 0 progressively refining the prediction for each agent at future time steps.In decoding, different learnable seed parameters Q i and additional context information m i are used, repeating c times, resulting in a four-dimensional tensor O ∈ R (d k ×M×T×c) , containing all possible predictions.Finally, this output tensor can be element-wise processed through a ReLU activation function to produce the final output representation.

Training Details
The training process is divided into two stages: the target prediction stage and the regression stage.During the target prediction stage, we have adopted K mode endpoints' estimations , where g k n,end is the k-th predicted goal coordinates and c k n,end is the k-th predicted goal confidence of the n-th actor.Our objective is to identify the positive target whose Euclidean distance to the ground truth trajectory endpoint is minimized.We employ a sum of classification and regression losses to train this stage.Given a predicted target, we aim to find the positive target with the minimum Euclidean distance to the ground truth trajectory endpoint.For classification, we utilize a maximum margin loss: where N represents the total number of traffic participants, and ϵ = 0.2 is the margin boundary.For the regression task, a smooth L1 loss is applied to all positive trajectory prediction steps: where a kn ,end is the ground truth BEV coordinates of the n actor's trajectory endpoint, kn denotes the n element, and reg() is the smooth L1 loss.Additionally, we attempt to incorporate a "single-target prediction" module at the midpoint of each trajectory to aggregate map features, assisting in the prediction of the endpoint target and overall trajectory.Similarly, for each actor, a residual MLP is applied to regress a middle target.The loss for this module is given by where a * (n,end) represents the ground truth BEV coordinates at the midpoint of the n actor's trajectory.The total loss for the target prediction stage is we set the weights of α 1 , β 1 , and ρ 1 to 1, 0.2, and 0.1 in the experimental phase.In the regression stage, for classification, we employ a boundary loss L cls similar to the one used in the target prediction stage.For regression tasks, the smooth L1 loss is similarly utilized and applied to all positive trajectory prediction steps: where a k (n,t) represents the predicted positive BEV coordinates of actor n at time step t, while a * (n,t) represents a ground truth one.Moreover, to emphasize the importance of the endpoint, we introduce a loss term that imposes a penalty at the endpoint: The final training loss function is a weighted sum of these loss terms:

Experiment Setup
All the program tasks were conducted on Python 3.9, and the deep learning framework was based on PyTorch version 1.13.We train our model on a computer system equipped with an Intel(R) Xeon(R) Platinum 8358P CPU and an NVIDIA A40 GPU.Our predictive framework was evaluated on the extensive Argoverse motion prediction dataset, which provides trajectories for agent vehicles alongside high-definition map data.This dataset encompasses over 324,557 scenarios collected from Pittsburgh and Miami, segmented into training, validation, and test sets with 205,942, 39,472, and 78,143 samples, respectively.All training and validation scenarios consist of five-second sequences sampled at 10 Hz.In the trajectory prediction challenge hosted by Argoverse, the first 2 s of historical trajectory data are made available.Given the initial two-second observations, the Argoverse motion prediction challenge entails predicting the future three-second movement of agent vehicles.The dataset furnishes actor data as trajectories spanning 20 time steps; map data include a set of lane centerlines and their connectivity.
In addition to the Argoverse dataset, we also conducted evaluations on the nuScenes Prediction dataset [34], a self-driving car dataset collected in Boston and Singapore.It contains 1000 scenes, each lasting 20 s, with ground truth annotations and HD maps.Vehicles in nuScenes have manually annotated 3D bounding boxes, which are published at 2 Hz.The prediction task involves using the previous 2 s of object history and the map to predict the next 6 s.We employed the standard split from the nuScenes software (version 1.3) kit for our tests.
After data preprocessing, the relevant input features and desired outputs were extracted from both the training sets of Argoverse and nuScenes.The parameters of the network model were set according to the specifications outlined in Table 1.

Evaluation Metrics
This study evaluates experimental outcomes based on fundamental forecasting parameters and assessment metrics, focusing on unimodal (K = 1) and multimodal (K = 6) prediction outcomes.In instances where the model generates more than K trajectories, only the predictions with the top K probability scores are considered.The evaluation metrics include the minimum average displacement error (minADE), minimum final displacement error (minFDE), Brier minimum final displacement error (brierFDE), and miss rate (MR).The details of each metric are as follows.
Average displacement error (minADE) measures the average accuracy of predictions by calculating the average Euclidean distance between the ground truth trajectory and the best trajectory out of K-predicted trajectories.The formula for ADE is Here, N is the number of predicted trajectories, T pred is the prediction duration, T is the observation duration, ŷ is the predicted position at time t which derives from O (O = {y k n } k∈[0,k−1] ), and y gt t is the ground truth position at time t.Final displacement error (minFDE) evaluates the accuracy of the predicted trajectory at the final moment of the prediction period by measuring the Euclidean distance between the last point of the predicted trajectory and the last point of the true trajectory.Its formula is The symbols used here carry the same meaning as those in the ADE formula.Brier minimum final displacement error (brier-minFDE) is similar to FDE but incorporates a penalty term related to the accuracy of the predicted probabilities in calculating the error.This metric considers not only the final displacement error of the prediction, but also the probability accuracy of the predicted trajectory.The miss rate (MissRateK,2) imposes penalties solely on predictions that deviate by more than 2 m from the ground truth.The offroad rate quantifies the proportion of predictions that fall outside the road boundaries.

Performance Comparison to Other Methods
We conducted an extensive comparison of our SRGAT model against a broader range of state-of-the-art trajectory prediction methods on the Argoverse motion prediction benchmark [2].As presented in Table 2, SRGAT demonstrates superior performance over existing methods, notably TNT [22], LaneRCNN [35], LaneGCN [31], and the newly compared models in [23,25,36,37].The detailed analysis reveals that SRGAT consistently achieves lower average offset error and higher long-term prediction accuracy, indicating its robustness in diverse traffic scenarios.
Notably, the comparison with LaneGCN, which serves as our primary benchmark, highlights the effectiveness of our approach.Our model achieves significant improve-ments of 15%, 22%, 21%, 15%, and 13% in minADE 6 , minFDE 6 , brierFDE 6 , minADE 1 , and minFDE 1 , respectively.These improvements can be attributed to the innovative use of the Social Relationship Graph Attention Network (SRGAT), which effectively captures dynamic interactions among agents in traffic, providing a more accurate prediction of their future trajectories.Our model's profound comprehension of the dynamics between traffic participants and HD maps is facilitated by constructing a comprehensive map node graph coupled with a multi-layer graph neural network strategy.Leveraging HD map data alongside the Transformer network's aptitude for identifying long-range dependencies, it can skillfully forecast a range of potential objectives and Points of Interest (POIs).Employing a dual-branch, multimodal prediction framework, SRGAT, not only generates multiple viable future pathways linked to these POIs, but also accurately evaluates their likelihood.This holistic integration of technologies ensures SRGAT achieves significant improvements in trajectory prediction accuracy compared to previous models, effectively enhancing our understanding and forecasting of complex traffic interactions.
Furthermore, we offer both quantitative metrics and qualitative insights to understand the model's performance better.Through visual comparisons in specific scenarios, SRGAT not only accurately predicts trajectories, but also adapts to complex interactions, demonstrating its significant advantages over conventional models.

Ablation Study
We conducted a detailed ablation study on the validation set to assess the impact of each component within our model.Taking the LaneGCN model as a baseline, we add other components progressively.Firstly, to enhance the model's understanding of social interactions among traffic participants and avoid the inefficiencies observed during the inference process in graph neural networks (such as LaneGCN), we integrated an independent attention mechanism as the social interaction encoder in our model.In Table 3, 'Social-En' represents the social interaction encoder, and the '-' symbol indicates its replacement with a three-layer FPN.Secondly, to better leverage road features and enhance the model's understanding of the interaction between participants and the HD maps, we constructed a graph of map nodes and utilized a multi-step graph neural network to encode vectorized map information.Subsequently, we integrated the dynamic impact of participants on lanes with the real-time feedback of lanes on participant behavior using a Transformer Fusion Layer.These processes are denoted as Scene-En, with "-" indicating the substitution with feature extraction using a 32-layer CNN from rasterized map data.When combining the above two modules, we observe a performance improvement of over 54% on minFDE 6 , indicating the complementary effects of these modules and their importance in enhancing model performance.To improve training efficiency and model quality, we also employed learnable seed parameters, denoted as L-Seed.The introduction of seed parameters also makes an important contribution to further improving model performance.To better demonstrate the model's effectiveness in complex traffic scenarios, we visualized the prediction results.As illustrated, orange represents the actual location of the target vehicle, blue signifies the ego vehicle, and purple indicates the relevant other traffic participants.The red line depicts the actual trajectory (ground truth) of the target vehicle, while the green lines represent the multimodal predictive trajectories generated by our model, green stars indicate the predicted goals,each with a corresponding confidence level.From Figure 5a,b, we can observe that the model made accurate predictions about the target vehicle's direction of travel at the intersection.In Figure 5c, our model effectively utilized the interactions among surrounding traffic participants to generate accurate trajectory predictions for the target vehicle.These details demonstrate our model's advanced capability in understanding and adapting to complex traffic scenarios, accurately predicting vehicle behaviors, and providing reasonable predictions among various potential behavioral choices.These results underscore the predictive power and accuracy of our model in complex traffic situations.

Conclusions
This article presents SRGAT, a cutting-edge trajectory prediction model for predicting vehicle trajectories in advanced autonomous driving applications, leveraging highdefinition map data and vehicle dynamics.Its unique architecture, which combines a Transformer network with a dual-branch multimodal prediction mechanism, enables it to effectively capture complex traffic scenarios and predict future vehicle movements with high precision.The use of goal area estimation strengthens the model's ability to generate muti-mode trajectories and support effective use of road context.The integration of map data enhances the model's contextual understanding, while the attention mechanism and learnable seed parameters improve prediction diversity and training efficiency.Through comprehensive testing on the Argoverse dataset, our model demonstrates superior performance over existing methods.The results highlight SRGAT's advancement in trajectory prediction, showcasing its enhanced accuracy, reliability, and efficiency in predicting traffic movements.

Figure 1 .
Figure 1.Visualization of data preprocessing for the SRGAT model.The left side presents the structured data inputs for agents.The right side demonstrates the conversion of map metadata into a vectorized road network graph, oriented on a 2D Cartesian plane with the ego vehicle at the center.

Figure 2 .
Figure 2. Schematic of the SRGAT model for vehicle trajectory prediction.The Encoder phase assesses the scene via agent nodes and transforms lane graphs into vectorized maps.Subsequent processing by the scenario and social encoders refines these data.Goal Areas Estimation then projects potential destinations, incorporating context into actor dynamics with the help of auxiliary losses.The culmination is the Trajectory Output, which synthesizes predictions using Multi-Agent Behavior (MAB) and Time Encoding (MABD) informed by initial seed parameters.

Figure 3 .
Figure 3. Encoding process of the SRGAT model, capturing past trajectories and environmental features for traffic scenario analysis.

Figure 4 .
Figure 4. Predictive trajectory paths with associated confidence scores for an autonomous vehicle at an intersection.The encoding process captures past trajectories and environmental features for traffic scenario analysis.

Figure 5 .
Figure 5. Qualitative results of SRGAT model.(a) Model predicts the vehicle's left turn.(b) Model predicts straight movement.(c) Model utilizes traffic interactions for accurate trajectory predictions.

Table 1 .
The configuration of model parameters.

Table 2 .
Results on Argoverse (upper set) and nuScenes (lower set) motion forecasting dataset.The "-" denotes that this result was not reported in their paper.

Table 3 .
The ablation study of SRGAT model.