1. Introduction
The safety decision-making of autonomous driving systems heavily relies on accurate multi-agent trajectory prediction. It fundamentally requires modeling two key aspects: capturing the spatiotemporal dependencies of individual agents and representing the complex interactions among multiple traffic participants, especially in highly dynamic urban environments. Among existing approaches, the Hierarchical Vector Transformer (HiVT) has emerged as a representative baseline in multi-agent trajectory prediction due to its efficient vectorized representation and hierarchical feature aggregation mechanism. HiVT achieves a favorable balance between prediction accuracy and inference efficiency on mainstream benchmark datasets such as Argoverse [
1,
2]. Nevertheless, practical deployments reveal that HiVT still exhibits room for further optimization. In particular, feature extraction in the encoder stage lacks explicit fine-grained refinement of local spatiotemporal information, while the transition of features from global interaction modeling to trajectory decoding does not fully exploit complementary information across multiple interaction scales. These limitations lead to subtle performance bottlenecks in key evaluation metrics, including minimum Average Displacement Error (minADE), minimum Final Displacement Error (minFDE), and Miss Rate (MR).
Recent advances in trajectory prediction have substantially improved modeling accuracy by leveraging more expressive network architectures and sophisticated interaction modeling strategies. However, these improvements often come at the cost of increased model complexity, making it difficult to deploy such methods in real-time autonomous driving systems with constrained computational resources.
In practice, many learning-based trajectory prediction models face a persistent trade-off between prediction accuracy and parameter efficiency. Performance gains are frequently achieved by increasing model depth or width, which leads to higher computational overhead and reduced scalability. As a result, achieving consistent accuracy improvements while maintaining a lightweight model design remains an open challenge. In this work, a “significant” increase in parameter count is defined as an increase exceeding 10%.
This challenge motivates the exploration of feature refinement strategies that enhance representational quality without relying on substantially larger models, forming the basis of the proposed approach.
Building on this motivation, this paper proposes a lightweight enhanced trajectory prediction framework termed PFR-HiVT. Without modifying the original hierarchical backbone of HiVT, the proposed framework improves feature representation through a two-stage enhancement strategy. First, encoder-side feature enhancement is introduced by incorporating the Feature Enhancement Module (FEM) and the Attention Enhancement Module (AEM), which explicitly refine local spatiotemporal features and provide more informative representations for subsequent interaction modeling. Second, the global interaction stage is redesigned by introducing a Progressive Feature Refinement Global Interactor, which systematically integrates three lightweight modules—the Simple Feature Refinement Module (SFR), the Lightweight Gate Module (LG), and the Residual Connection Module (RC)—to refine and reweight globally interacted features before trajectory decoding. All proposed modules adopt lightweight designs and introduce only approximately 230 k additional parameters compared to the original HiVT-128. Extensive experiments on the Argoverse 1.1 dataset demonstrate that the proposed framework consistently improves minADE, minFDE, and MR, while maintaining the real-time inference capability of the baseline model.
The main contributions of this work can be summarized as follows:
A two-stage feature enhancement strategy is introduced for multi-agent trajectory prediction, which incorporates two lightweight, insertion-based encoder-level feature enhancement modules that performs feature refinement rather than attention reweighting, prior to global interaction, together with a Progressive Feature Refinement Global Interactor, while preserving the original HiVT backbone architecture.
A lightweight global interaction refinement mechanism is developed, integrating feature refinement, gating, and residual learning in a unified manner, enabling performance improvements with limited additional parameter cost.
Extensive experiments are conducted on the Argoverse 1.1 dataset, where the proposed framework achieves a minADE of 0.703, a minFDE of 1.041, and an MR of 0.112, corresponding to improvements of 2.7%, 2.5%, and 1.0% over the baseline HiVT model, respectively.
The remainder of this paper is organized as follows.
Section 2 reviews related work on multi-agent trajectory prediction.
Section 3 presents the proposed methodology, including the overall architecture and the introduced enhancement modules.
Section 4 describes the experimental setup and reports both ablation and comparative results. Finally,
Section 5 concludes the paper and discusses limitations and future work.
2. Related Work
Trajectory prediction for dynamic agents has become increasingly important in autonomous driving applications [
2]. Existing trajectory prediction approaches can be broadly categorized into three classes: physics-based methods, classical machine learning-based methods, and neural network-based methods.
Physics-based methods model agent motion using kinematic or dynamic constraints derived from physical laws, such as constant velocity, constant acceleration, or vehicle dynamics models [
3,
4]. These approaches typically require limited training data and offer strong interpretability, making them computationally efficient and suitable for simple driving scenarios. However, due to their reliance on handcrafted assumptions, physics-based methods struggle to model complex interactions, multimodal behaviors, and long-term dependencies in dense traffic environments.
Classical machine learning-based methods aim to learn motion patterns from data by fitting probabilistic models or handcrafted features. Representative approaches include Gaussian processes [
5], Gaussian mixture models [
6], hidden Markov models [
7], and conditional random fields. For example, Gaussian processes have been used to model pedestrian and vehicle trajectories with uncertainty-aware predictions, while Gaussian mixture models and hidden Markov models have been applied to capture multimodal motion patterns and maneuver switching behaviors in urban driving scenarios. These methods provide a principled probabilistic formulation and improve flexibility compared to purely physics-based models. Nevertheless, their representational capacity is limited by feature design and model assumptions, making it difficult to scale to highly dynamic, multi-agent urban traffic scenarios.
With the availability of large-scale driving datasets, neural network-based methods have become the dominant paradigm for trajectory prediction. Existing frameworks leverage Convolutional Neural Networks (CNN) [
8,
9], Recurrent Neural Networks (RNN) [
10], Long Short-Term Memory (LSTM) networks [
11], Graph Neural Networks (GNN) [
12,
13], or hybrid architectures to capture spatial, temporal, and relational patterns in agent motion. These approaches significantly improve prediction accuracy and scalability compared to traditional methods.
Social interactions play a crucial role in traffic scenarios, as the future motion of each agent is strongly influenced by surrounding participants. To model such interactions, many methods incorporate social pooling mechanisms, Graph Neural Networks [
13,
14,
15,
16], or attention-based interaction modeling [
17,
18,
19,
20,
21]. Social LSTM [
11] explicitly models inter-agent interactions by aggregating hidden states based on spatial proximity. Inspired by the success of Transformer models in natural language processing and computer vision [
22,
23,
24], recent works apply Transformers to trajectory prediction to capture long-range spatial-temporal dependencies and agent–map interactions [
25,
26,
27].
More recently, a number of advanced Transformer-based models have been proposed to further improve trajectory prediction performance. For example, QCNet introduces a query-centric attention mechanism to enhance interaction modeling among agents [
28]. These methods achieve strong performance on large-scale benchmarks, demonstrating the rapid evolution of Transformer-based trajectory prediction. Similarly, MTR++ [
29] adopts a Transformer-based architecture with guided intention querying and symmetric scene modeling to capture multi-agent interactions and multimodal future behaviors. By explicitly incorporating intention-level representations into the prediction process, MTR++ further improves trajectory forecasting accuracy in complex traffic scenarios, while highlighting the effectiveness of attention-based interaction modeling.
Among these methods, HiVT proposes a hierarchical vectorized Transformer architecture [
30] with a two-stage design consisting of a Local Encoder and a Global Interactor. The Local Encoder captures temporal dependencies within individual agent trajectories, while the Global Interactor models spatial interactions among agents. This hierarchical structure enables efficient multi-scale feature learning and avoids the high computational cost of all-to-all message passing. However, despite its effectiveness, HiVT and other Transformer-based approaches primarily process features using standard Transformer blocks composed of self-attention and feed-forward layers, without introducing explicit feature enhancement or re-calibration mechanisms. In particular, the Local Encoder outputs are directly forwarded to the Global Interactor without intermediate refinement, which may propagate redundant or weakly informative features into the global interaction stage. Moreover, the features produced by the Global Interactor are directly decoded into future trajectories, without any progressive feature refinement, gating, or adaptation modules to re-adjust globally interacted representations for the prediction task. These architectural characteristics limit the model’s ability to progressively strengthen informative spatiotemporal features across different stages of the network.
To address these limitations, a multi-module collaborative feature enhancement framework is proposed to systematically improve feature representation and interaction modeling. Building upon the hierarchical structure of HiVT, the proposed method introduces two encoder-side feature enhancement modules and a Progressive Feature Refinement Global Interactor, enabling progressive refinement of features before and after global interaction. Through the synergistic integration of multiple lightweight enhancement modules, the proposed framework achieves improved prediction accuracy while maintaining computational efficiency.
3. Methodology
3.1. Framework
The HiVT model adopts a hierarchical architecture consisting of three main components: a Local Encoder, a Global Interactor, and a Decoder. The Local Encoder encodes local spatiotemporal features of individual agents, the Global Interactor models interactions among multiple agents at the scene level, and the Decoder generates multimodal trajectory predictions.
Although HiVT achieves a favorable trade-off between prediction accuracy and inference efficiency, its feature propagation pipeline follows a direct sequential design, in which the outputs of the Local Encoder are directly forwarded to the Global Interactor and subsequently decoded without intermediate feature refinement or re-calibration [
30]. From a network design perspective, the absence of explicit intermediate transformation or refinement layers may allow redundant or weakly informative features to be propagated across stages, thereby reducing the effective representational capacity of subsequent interaction modeling. In particular, the features generated by the Local Encoder primarily encode local spatiotemporal patterns and are not explicitly filtered or re-weighted before global interaction, which may limit their ability to support effective long-range dependency modeling and context-aware interaction reasoning. To mitigate these limitations, two lightweight enhancement modules are introduced prior to the Global Interactor: the Feature Enhancement Module (FEM) and the Attention Enhancement Module (AEM). FEM focuses on refining local spatiotemporal representations, while AEM incorporates global contextual dependencies via self-attention. Together, these modules provide more informative and expressive feature representations for downstream interaction modeling.
Furthermore, in the original HiVT framework, the output of the Global Interactor is directly fed into the Decoder. Such a design limits the exploitation of rich multimodal interaction features and may result in suboptimal feature quality for high-precision trajectory prediction. To overcome this limitation, a Progressive Feature Refinement Global Interactor (PFR-Global Interactor) is introduced, which integrates three complementary lightweight modules: the Simple Feature Refinement (SFR) module, the Lightweight Gate (LG) module, and the Residual Connection (RC) module. These modules are tightly coupled within the Global Interactor to progressively refine interaction features, selectively emphasize informative components, and stabilize feature propagation through residual learning.
By embedding progressive refinement directly into the interaction modeling stage, the proposed PFR Global Interactor enhances feature quality before decoding while preserving the original hierarchical structure of HiVT. The refined features are subsequently passed to the Decoder for multimodal trajectory prediction. The overall framework of the proposed PFR-HiVT model is illustrated in
Figure 1.
3.2. Feature Enhancement Module
In the Local Encoder, the feature extraction process consists of three key modules: the Agent–Agent Interaction module, the Temporal Dependency module, and the Agent-Line Interaction module. The Agent–Agent Interaction module is designed to model the interaction relationships between traffic agents; the Temporal Dependency module captures the motion patterns along the temporal dimension; and the Agent–Line Interaction module establishes the correlation between road structures and agent motion. The feature tensor generated by this processing pipeline, denoted as
, contains a comprehensive representation of the central agent and its surrounding environment, where
N represents the number of agents, and
D denotes the feature dimension. However, these features may contain noise and redundant information. To address this, the Feature Enhancement Module (FEM) is introduced, which performs feature distillation through multi-layer feature transformations and regularization operations. The flow of features within this module is described by the following equation:
Here, denotes the input feature tensor produced by the Local Encoder, where N is the number of agents and D is the feature dimension. The Feature Enhancement Module operates on each agent feature independently, applying the same set of learnable parameters to all agents.
The weight matrices
and
, and the bias vectors
and
, are shared across agents and applied row-wise to
x.
and
denote the mean and variance computed along the feature dimension for Layer Normalization. In Equation (
4), Dropout is applied with a probability
.
is a scalar hyperparameter controlling the residual scaling.
After the above transformations, a residual connection is applied with the original output feature
x, with the weight
. The residual scaling factor
is set to 0.1 to control the contribution of the enhancement branch and stabilize training. The residual scaling factor
is set to 0.1 based on a small-scale hyperparameter search on the validation set. Among candidate values 0.05, 0.1, 0.2,
provided the most stable optimization behavior and consistently favorable performance across evaluation metrics.
For convenience in subsequent descriptions, the above process is summarized as:
Two linear transformations enable the module to re-combine heterogeneous spatiotemporal features into more informative representations for trajectory prediction. The LayerNorm layer is employed to align feature distributions across different agents and time steps, mitigating distribution shifts caused by diverse motion patterns and interaction dynamics. The Dropout layer regularizes the enhancement branch and reduces overfitting to specific motion patterns, thereby improving generalization in complex traffic scenarios. Finally, the residual connection preserves low-level geometric and kinematic information while injecting refined contextual features, ensuring motion continuity and robustness in long-horizon trajectory prediction.
3.3. Attention Enhancement Module
The Local Encoder primarily focuses on local interactions, lacking the ability to model long-range dependencies between agents. This may lead to the model’s inability to fully understand the mutual influence between agents in complex traffic scenarios. To address this issue, the Attention Enhancement Module (AEM) is introduced. The AEM captures global dependencies between agents by incorporating a multi-head self-attention mechanism, thereby providing each agent with global contextual information. AEM consists of a multi-head self-attention layer and a feature fusion layer. The input to this module is the output from the previous module , where N represents the number of agents and D is the feature dimension.
first passes through a multi-head self-attention layer, where self-attention is computed. The learnable parameter matrices are
,
, and
.
Here, the projection matrices
map the input features into query, key, and value representations. In Equation (
9), embeddim denotes the embedding dimension of the attention mechanism, which is set to
D. With 8 attention heads, each head operates on a subspace of dimension
. The number of attention heads is set to 8 based on a small-scale hyperparameter comparison on the validation set. Among the candidate values 4 and 8, using 8 heads provided more stable optimization behavior and consistently better performance while maintaining computational efficiency under the lightweight design constraint.
is the output projection matrix.
The features
and
are concatenated and passed as input to the feature fusion network. The structure of the feature fusion network is exactly the same as that of the previously described Feature Enhancement Module. After obtaining the output, a residual connection is applied with the original input of the module
, with the weight
.
This module fuses the original features with the attention features, learning richer interaction representations through the feature fusion layer. This enables the provision of higher-quality input features to the subsequent PFR-Global Interactor, enhancing the model’s ability to understand complex traffic scenarios and improve prediction accuracy. The structures of the two modules are shown in
Figure 2.
3.4. PFR-Global Interactor
In the original HiVT framework, the Global Interactor directly outputs interaction-aware features that are subsequently sent to the Decoder. Although effective, this design limits the refinement and selective utilization of rich multimodal interaction features. To address this limitation, the Global Interactor is redesigned as a Progressive Feature Refinement Global Interactor (PFR-Global Interactor), which incorporates a sequence of lightweight refinement mechanisms to enhance feature quality prior to decoding.
Specifically, the proposed PFR-Global Interactor integrates three tightly coupled refinement components in a progressive manner: the Simple Feature Refinement (SFR) module, the Lightweight Gate (LG) module, and the Residual Connection (RC) module. Rather than serving as an independent pre-decoder processing block, these modules collectively form the internal refinement pipeline of the Global Interactor, enabling gradual optimization of interaction features while preserving the original hierarchical structure of HiVT.
Simple Feature Refinement (SFR) Module. The SFR module employs a two-layer MLP (multi-layer perceptron) to model nonlinear transformations of the interaction features produced by the Global Interactor. Its purpose is to perform fine-grained feature adjustment with minimal computational overhead, thus enhancing feature quality while maintaining efficiency.
The residual coefficient is set to 0.05 to ensure that refinement introduces only a controlled adjustment, thus preserving the core interaction information. Since interaction features aggregated from multiple agents may contain noise and redundancy, placing the SFR module at the beginning of the PFR pipeline enables early-stage denoising and stabilization, providing a cleaner feature representation for subsequent refinement stages.
Lightweight Gate (LG) Module. Following initial refinement, the LG module introduces a learnable gating mechanism to selectively regulate feature propagation. It consists of a single MLP layer followed by a nonlinear activation function.
In this formulation, • denotes element-wise multiplication and represents the Sigmoid function. The LG module learns importance-aware gating weights that suppress irrelevant or noisy components while emphasizing salient interaction features. This selective filtering mechanism ensures that informative features are retained within the PFR-Global Interactor and effectively transmitted to the Decoder.
Residual Connection (RC) Module. As the final refinement stage within the PFR-Global Interactor, the RC module further enhances feature representations through learnable residual weighting, implemented via a two-layer MLP.
Here, the residual weight is implemented as a learnable scalar parameter and is initialized to a small value (0.1) to ensure stable optimization at the early training stage. It is trained jointly with all other network parameters via backpropagation to adaptively control the contribution of the refined features. This residual learning strategy facilitates stable gradient propagation and effective feature reuse. By serving as the final stage of the progressive refinement pipeline, the RC module enables the PFR-Global Interactor to produce more discriminative, robust, and noise-reduced interaction features, which are subsequently passed to the Decoder for accurate multimodal trajectory prediction.
The overall architecture of the proposed PFR-Global Interactor is illustrated in
Figure 3.
3.5. Multimodal Future Decoder
We follow the original HiVT framework [
30] and directly adopt its decoder architecture without any modification. The decoder is briefly summarized below for completeness. Given the inherent multimodality of an agent’s future motion, we parameterize the distribution of future trajectories as a mixture model where each mixture component is a Laplace distribution. The prediction is performed simultaneously for all agents. For each agent
i and each component
f, the local and global features are fused and then used as input to an MLP, which outputs the position of the agent at each future timestep
in the local coordinate frame, along with its associated uncertainty
. The output tensor of each regression head has the shape
, where
F is the number of mixture components,
N is the number of agents in the scene, and
H is the number of future timesteps predicted. Additionally, we use another MLP layer followed by a Softmax function to generate the mixture coefficients for each agent, which has the shape
.
3.6. Training
To enhance the diversity of the multiple trajectory hypotheses, this study introduces a diversity loss function [
31]. This loss term is optimized during training only for the best trajectory among the
F trajectories generated by the model. Specifically, the error between the predicted positions of the
F mixture components and the ground-truth positions of each agent at each timestep is first computed. Then, the errors for all future timesteps are summed to form an error matrix of dimension
, where
F represents the number of trajectories and
N represents the number of agents. Based on this matrix, the optimal trajectory for each agent (i.e., the trajectory with the minimum error) is determined by selecting the minimum value for each column. Finally, the loss function is composed of the regression loss
and the classification loss
, which are summed with equal weighting. The regression loss follows the formulation in [
30], which is integrated into the total loss function as shown.
4. Experiment
4.1. Experiment Setup
The performance of the prediction framework was evaluated on the large-scale Argoverse 1.1 motion prediction dataset. This dataset provides agent trajectories and high-definition map data, consisting of 323,557 real driving scenarios, which are split into training, validation, and test sets with sample sizes of 205,942, 39,472, and 78,143, respectively. The training and validation sets consist of 5 s sequences (with a sampling frequency of 10 Hz), while only the first 2 s of the trajectory data are publicly available for the test set. Based on these 2 s of initial observations, the Argoverse motion prediction challenge requires predicting the agent’s motion state for the subsequent 3 s.
Evaluation Metrics: The model’s performance is evaluated using standard motion prediction metrics, including:
Minimum Average Displacement Error (minADE): From the multiple trajectories predicted by the model, the trajectory with the smallest endpoint error is selected as the optimal prediction. The average 2D planar distance (in meters) between the positions of this trajectory and the ground truth trajectory at all future timesteps is then computed;
Minimum Final Displacement Error (minFDE): The Euclidean distance between the position of the optimal predicted trajectory at the final timestep and the ground truth endpoint;
Miss Rate (MR): The proportion of test scenes in which the distance between the ground truth trajectory endpoint and the optimal predicted endpoint exceeds 2.0 m.
The model was trained for 64 epochs on a GeForce RTX 4070 Ti Super GPU using the AdamW optimizer [
32]. The detailed information of the experimental equipment is shown in
Table 1. The batchsize was set to 32, the initial learning rate was set to
, and the weight decay and dropout rates were set to
and 0.1, respectively. The learning rate was decayed using a cosine annealing scheduler [
33]. We did not use techniques such as ensemble methods or data augmentation. We conducted experiments using a model with 128 hidden units per layer and refer to it as PFR-HiVT. Throughout this paper, the baseline model is consistently defined as HiVT-128.
4.2. Ablation Study
To analyze the contribution of different components in the proposed framework, comprehensive ablation experiments are conducted on the Argoverse 1.1 validation set. The ablation study is designed at two levels: group-wise ablation and full ablation validation, aiming to evaluate both the stage-wise effectiveness and the collaborative behavior of the proposed modules.
4.2.1. Group-Wise Ablation
In the group-wise ablation setting, the enhancement components are organized according to their functional roles and insertion stages in the network. Specifically, two groups are considered: (1) the encoder-stage enhancement modules, including the Feature Enhancement Module (FEM) and the Attention Enhancement Module (AEM); (2) the PFR-Global Interactor, which augments the original Global Interactor with progressive feature refinement mechanisms. This grouping allows us to assess the independent impact of enhancements applied at different stages of the model.
The results of the group-wise ablation experiment are reported in
Table 2. When only the encoder-stage enhancement modules (FEM + AEM) are enabled, the model achieves a minADE of 0.714. Compared with the full configuration, this corresponds to a performance degradation of approximately 1.5%, while still significantly outperforming the original HiVT baseline. This indicates that explicit feature enhancement at the encoder stage effectively improves the quality of local spatiotemporal representations and provides a stronger foundation for subsequent interaction modeling.
When only the PFR-Global Interactor is applied, the model achieves a minADE of 0.722, which is identical to the baseline performance. It is worth noting that PFR-GI alone does not improve minADE compared to the baseline, which is consistent with its design objective. Rather than introducing new predictive information, PFR-GI is designed to progressively refine and reweight interaction features that have already been enhanced by the encoder-side modules. Consequently, when applied in isolation, PFR-GI preserves baseline-level accuracy while providing marginal improvements in minFDE and MR by stabilizing long-horizon feature propagation. Its effectiveness is most apparent when jointly integrated with encoder-stage enhancement, where it complements earlier feature refinement and contributes to the overall performance gains of the full PFR-HiVT framework.
When both groups are jointly applied, i.e., under the full configuration of the proposed method, the model achieves the best overall performance, with a minADE of 0.703, a minFDE of 1.041, and an MR of 0.112. Compared with using only FEM + AEM or only the PFR-Global Interactor, the full configuration yields relative improvements of approximately 1.5% and 2.7% in minADE, respectively. These results clearly demonstrate that enhancement modules deployed at different stages of the network are complementary and can work synergistically, progressively improving feature quality from local encoding to global interaction modeling.
4.2.2. Full Ablation Validation
To further analyze the contribution of each component in the proposed framework, a full ablation study is conducted on the Argoverse 1.1 validation set. In this experiment, the complete configuration with FEM, AEM, and the PFR-Global Interactor (PFR-GI) enabled is treated as the reference setting. The performance impact is then evaluated by selectively disabling individual components, including FEM, AEM, and the internal sub-modules of PFR-GI. The results are summarized in
Table 3.
Effectiveness of PFR-Global Interactor as a whole. When only the PFR-GI is enabled without encoder-stage enhancement (FEM and AEM), the model achieves a minADE of 0.722, which is comparable to the baseline performance. This indicates that PFR-GI alone is not sufficient to significantly improve prediction accuracy without high-quality input features. In contrast, when FEM and AEM are jointly enabled together with PFR-GI, the model reaches the best performance (minADE 0.703, minFDE 1.041, MR 0.112), demonstrating that PFR-GI is most effective when operating on enhanced feature representations.
Contribution of encoder-stage enhancement modules. Enabling FEM or AEM individually leads to noticeable improvements over the baseline, with minADE reduced to 0.719 and 0.716, respectively. When both FEM and AEM are activated simultaneously, the performance further improves to a minADE of 0.714, indicating that these two modules provide complementary benefits at the encoder stage. This confirms that explicit feature enhancement and attention-based refinement before global interaction are crucial for improving feature quality.
Ablation of internal components within PFR-GI. To investigate the internal structure of the PFR-Global Interactor, its sub-modules are further removed one at a time while keeping FEM and AEM enabled. Removing the Simple Feature Refinement (SFR) module results in a slight performance degradation (minADE 0.714), suggesting that SFR provides additional but relatively moderate refinement benefits. Removing the Lightweight Gate (LG) module leads to a more pronounced performance drop (minADE 0.728, MR 0.117), indicating that feature selection and noise suppression play an important role in maintaining prediction accuracy. The most significant degradation is observed when removing the Residual Connection (RC) module, where minADE increases to 0.739 and MR rises to 0.116. This highlights the critical role of residual connections in stabilizing feature propagation and preserving discriminative information within PFR-GI.
Overall, the full ablation study demonstrates that FEM and AEM are essential for producing high-quality input features, while the PFR-Global Interactor effectively exploits these features through progressive refinement, gating, and residual learning. The synergy between encoder-stage enhancement and the structured design of PFR-GI is key to achieving the best overall performance.
4.3. Quantitative Results
4.3.1. Comparison with Baseline
As shown in
Table 4, our improved model outperforms the baseline model in all metrics: minADE decreases from 0.722 to 0.703 (a 2.7% improvement), minFDE decreases from 1.067 to 1.041 (a 2.5% improvement), and MR decreases from 0.113 to 0.112. It is noteworthy that these improvements were achieved with an increase of only 230.5 k parameters, demonstrating the efficiency of our lightweight design.
Figure 4 presents a logarithmic scale comparison of the parameter counts for all modules to clearly illustrate the magnitude differences. The number of parameters in the five newly added modules (FEM, AEM, SFR, LG, RC) is significantly smaller than that of the baseline modules. The total number of parameters in the new modules (230.5 k) accounts for only 8.7% of the total parameters of HiVT-128 (2.63 M), reflecting the lightweight nature of our design. The smallest module (LG) has only 16.5 k parameters, while the largest new module (AEM) has only 115 k parameters, both of which are far fewer than the baseline modules, such as the local encoder (1.4 M) and Global Interactor (1.1 M).
Several activation statistics are computed to characterize the behavior of different modules. Specifically, the L2 norm mean, standard deviation, and maximum are computed over the feature dimension to quantify the magnitude and dispersion of feature representations. The activation mean, standard deviation, and maximum are computed over all elements of the output feature tensor to reflect the overall response intensity and variability of each module. These metrics jointly provide a quantitative measure of feature strength, distribution spread, and selective amplification behavior across different stages of the network.
Figure 5,
Figure 6 and
Figure 7 visualize these activation statistics for different modules in the proposed architecture. Based on these metrics,
Figure 5 and
Figure 6 show that the PFR-Global Interactor exhibits higher activation mean and maximum values compared to other modules. Rather than indicating indiscriminate feature amplification, this behavior reflects the selective enhancement of informative interaction features introduced by the progressive refinement, gating, and residual weighting mechanisms. Notably, while the activation mean and maximum are significantly increased, the L2 norm mean remains comparable to earlier modules, suggesting that PFR-GI does not uniformly magnify all features but instead amplifies a subset of salient feature dimensions. The increased activation standard deviation further indicates a more discriminative feature distribution, where important interaction cues are emphasized while less informative components are suppressed. This selective amplification is consistent with the design of the Lightweight Gate and Residual Connection modules, which adaptively regulate feature propagation. These results suggest that PFR-GI refines and re-weights globally interacted features into a more expressive and discriminative representation, which is complementary to the encoder-stage enhancement and contributes to the overall performance gain when integrated into the full PFR-HiVT framework.
4.3.2. Computation Cost Analysis
To evaluate the computational overhead introduced by the proposed refinement modules, we compare the parameter count and inference time of PFR-HiVT against the HiVT-128 baseline. All measurements are conducted on a single NVIDIA GeForce RTX 4070 Ti Super GPU using identical batch size (64) and input settings. Only the forward pass is measured, excluding loss computation and metric updates.
As shown in
Table 5, PFR-HiVT introduces an additional 230.5 k parameters, corresponding to an increase of approximately 8.7% over the baseline. In terms of inference time, PFR-HiVT incurs only a marginal overhead, with an average forward time of 46.612 ms per batch compared to 46.262 ms for HiVT-128, resulting in a relative increase of approximately 0.76%. These results indicate that the proposed progressive feature refinement modules add negligible runtime cost while preserving near-real-time feasibility.
4.3.3. Comparison with SOTA
As shown in
Table 6 on the Argoverse 1.1 test set, PFR-HiVT achieves slightly better performance in minADE and competitive results in minFDE compared to most representative baselines. The improvement in minADE indicates more accurate primary trajectory estimation, reflecting the effectiveness of the proposed feature enhancement and refinement strategy.
Compared with Wayformer [
34], PFR-HiVT attains a marginally lower minADE (0.761 vs. 0.767), while Wayformer achieves slightly better minFDE and MR. This suggests that PFR-HiVT focuses more on improving the accuracy of the dominant motion mode, whereas Wayformer exhibits stronger long-horizon consistency and multimodal coverage.
DCMS [
35] achieves a lower miss rate (MR) of 0.109 compared to PFR-HiVT (0.125), indicating stronger coverage of uncertain or long-term motion patterns. This suggests that DCMS prioritizes capturing diverse motion patterns, especially those with higher uncertainty. However, PFR-HiVT achieves slightly better minADE (0.761 vs. 0.766) and minFDE (1.131 vs. 1.135), indicating more accurate primary trajectory estimation and better long-term trajectory stability. This suggests that PFR-HiVT focuses more on refining the dominant mode of motion and improving the accuracy of short- to medium-term trajectory predictions. Overall, the comparison highlights a trade-off between improving accuracy for dominant motion modes and enhancing coverage of uncertain or multi-modal trajectories. While DCMS excels in reducing MR by covering more diverse motion patterns, PFR-HiVT offers a more precise trajectory prediction with slightly better minADE and minFDE, which is reflected in the prioritization of dominant-mode accuracy over long-tail coverage.
GANet [
36] exhibits a similar trend, achieving a lower MR (0.118) than PFR-HiVT but higher minADE (0.806) and minFDE (1.161), further highlighting this accuracy–coverage trade-off among lightweight multimodal predictors.
TPCN++ [
37] and Scene Transformer [
20] exhibit comparable minADE and minFDE values, while achieving slightly better MR. This further highlights a trade-off between improving dominant trajectory accuracy and covering diverse motion hypotheses.
Overall, PFR-HiVT demonstrates a modest improvement in minADE and competitive performance in minFDE, while maintaining a balanced trade-off between trajectory accuracy and multimodal coverage. Its relatively higher MR compared to some baselines reflects the inherent challenge of simultaneously improving primary-mode accuracy and long-tail uncertainty modeling in lightweight architectures.
4.3.4. Qualitative Results
Figure 8 presents the qualitative results of our model on the Argoverse 1.1 validation set. These figures illustrate the predicted trajectories in various traffic scenarios, demonstrating the effectiveness of our approach. The light blue curves represent lane centerlines, the blue circle marks the trajectory start point, the green curve is the ground truth (real trajectory), the red curve is the predicted trajectory, and the green and red stars indicate the ground truth endpoint (GT End) and predicted endpoint (Pred End), respectively. The visualizations cover diverse scenarios, including lane intersections, curved lanes, and multiple parallel lanes.
To provide a more concrete interpretation of the qualitative results, several representative scenarios in
Figure 8a–f are analyzed.
In
Figure 8a, the predicted trajectory closely follows the ground-truth path when executing a left-turn maneuver at a multi-branch intersection. The prediction preserves the correct curvature trend and remains within the target lane geometry, avoiding incorrect straight or downward branches. Despite a small endpoint deviation, the predicted trajectory remains on the correct road segment, indicating stable intention recognition.
Figure 8b illustrates a near-straight driving scenario with subtle lane curvature. The predicted trajectory remains aligned with the centerline and closely overlaps with the ground truth over the entire horizon, demonstrating robust short- and mid-term prediction stability in low-curvature environments.
Figure 8c,e,f present left-turn scenarios in complex intersections with multiple feasible outgoing branches. In all three cases, the model correctly selects the left-turn branch consistent with the ground truth and avoids drifting toward competing straight or incorrect side branches. Although moderate endpoint deviations are observed, the predicted trajectories remain within the geometric bounds of the correct target lanes. This suggests that the model captures turning intent and maintains consistent branch selection in multi-modal road layouts.
Figure 8d depicts a intersection scenario: the predicted trajectory closely follows the global direction and curvature of the ground-truth path and remains aligned with the correct lane without drifting toward adjacent crossing branches.
Overall, across diverse intersection geometries and maneuver types, the model demonstrates stable intention recognition and branch-consistent trajectory generation. While endpoint deviations are occasionally present, they primarily manifest as amplitude errors rather than structural failures such as incorrect branch selection or opposite-direction turning. These qualitative results indicate that the proposed framework produces geometrically consistent and lane-aligned predictions in complex road topologies.
Figure 9,
Figure 10 and
Figure 11 illustrate the temporal evolution of displacement error for the proposed method in comparison with the HiVT-128 baseline on three representative validation sequences. The displacement error (in meters) is plotted as a function of the prediction time step, and each figure reports the error progression over 30 future steps together with the corresponding mean displacement error for both methods.
Figure 9 shows a scenario in which both methods exhibit increasing error as the prediction horizon extends. The proposed method maintains a lower error trajectory than the baseline over most time steps, resulting in a lower mean displacement error (0.896 m vs. 1.203 m). This example highlights a case where progressive feature refinement leads to a slower accumulation of prediction error over time.
Figure 10 depicts a more challenging scenario characterized by larger displacement errors for both methods. While both curves show a rising trend, the proposed method yields a lower mean error (1.408 m vs. 2.096 m) and a comparatively smoother growth pattern, indicating improved temporal stability under more uncertain motion dynamics.
Figure 11 presents a constrained setting with relatively small overall errors. The proposed method consistently produces lower displacement errors across all time steps, achieving a mean error of 0.433 m compared to 0.924 m for the baseline.
These examples are intended to complement the quantitative evaluation by visualizing how prediction errors evolve over time on individual sequences, rather than to demonstrate overall performance superiority. They provide qualitative insight into the dynamic behavior of error accumulation and long-term drift under different motion patterns. Due to space constraints, only three representative sequences from the validation set are visualized.
5. Conclusions
In this paper, PFR-HiVT is proposed as a lightweight and effective multi-agent trajectory prediction framework that enhances the feature extraction and feature transition processes of the HiVT model through progressive feature refinement. Instead of modifying the original hierarchical backbone, a set of carefully designed enhancement components is introduced, including the Feature Enhancement Module (FEM), the Attention Enhancement Module (AEM), and a unified PFR-Global Interactor, which integrates feature refinement, gating, and residual learning mechanisms. This design enables systematic improvement of feature representation quality and information flow across different stages of the network while preserving the efficiency advantages of the original HiVT architecture.
All enhancement modules are lightweight and introduce only 230.5 k additional parameters, accounting for approximately 8.7% of the total parameters of HiVT-128, making the proposed method suitable for resource-constrained autonomous driving scenarios. Experimental results on the Argoverse 1.1 validation set demonstrate that PFR-HiVT achieves a minADE of 0.703, a minFDE of 1.041, and an MR of 0.112, corresponding to improvements of 2.7%, 2.5%, and 1.0%, respectively, over the HiVT-128 baseline. These gains indicate that progressive feature refinement can effectively enhance both short-term accuracy and long-horizon trajectory stability without relying on increased model complexity. These percentage improvements are computed with respect to the HiVT-128 baseline on the validation set; the corresponding test-set results are reported separately in
Table 6.
The ablation studies analyze the impact of individual modules and their combinations in the proposed framework. Results indicate that the encoder-stage enhancement modules (FEM and AEM) improve local spatiotemporal feature representation, whereas the PFR-Global Interactor is related to more stable feature transitions before the decoding stage. In addition, the full configuration outperforms the partial variants across the evaluated settings, indicating that the integration of multiple modules provides complementary benefits compared to using each module in isolation.
Despite the promising results, several limitations of the proposed PFR-HiVT framework should be acknowledged.
First, the proposed progressive feature refinement strategy is instantiated and evaluated on the HiVT backbone. Although HiVT is a representative and competitive architecture, the effectiveness of the refinement paradigm on other trajectory prediction backbones has not been empirically validated in this work.
Second, the refinement design introduces a trade-off between dominant-mode accuracy and multimodal coverage. While PFR-HiVT consistently improves minADE and minFDE, the miss rate is not always superior to all baseline methods that explicitly emphasize uncertainty coverage. This reflects an inherent balance between refining primary trajectory estimates and preserving diverse motion hypotheses.
Third, qualitative analysis indicates that endpoint deviations may still occur in complex multi-branch intersections and long-term predictions. Although progressive refinement enhances feature quality and interaction modeling, it does not fully resolve long-term uncertainty in highly multimodal traffic scenarios.
Based on the experimental findings, improving generalization across data splits emerges as the most important direction for future work. The observed performance gap between the validation and test results on Argoverse 1.1 indicates that the current refinement strategy may still be sensitive to distributional shifts.
A secondary direction is to further reduce endpoint deviations in long-term predictions, particularly in complex multi-branch intersections, as revealed by the qualitative analysis. This suggests that incorporating more adaptive or context-aware refinement mechanisms may help better handle long-term uncertainty.
Finally, extending the proposed refinement framework to other trajectory prediction backbones remains a longer-term direction. While the current study focuses on HiVT, exploring broader applicability is left for future investigations.
Author Contributions
Conceptualization, Y.B. and Z.L.; methodology, Z.L.; software, Z.L.; validation, Z.L., Y.G. and Y.S.; formal analysis, Z.L.; investigation, Z.L.; resources, Z.L.; data curation, Z.L.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L. and Y.G.; visualization, Y.S.; supervision, Y.B.; project administration, Y.B.; funding acquisition, Y.B. and Z.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Inner Mongolia Education Department Project (Grant No. JY20230118), the Key R&D and Achievement Transformation Program in the Public Welfare Field during the 14th Five-Year Plan of Inner Mongolia (Grant No. 2023YFSH0003), the Scientific Research Project of Colleges and Universities in Inner Mongolia Autonomous Region (Grant No. NJZY22387), and the Scientific Research Project of Inner Mongolia University of Technology (Grant No. ZY202018). The APC was funded by these projects.
Data Availability Statement
All data generated and analyzed during this study are included in this article. For further details, please contact the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| AEM | Attention Enhancement Module |
| CNN | Convolutional Neural Network |
| FEM | Feature Enhancement Module |
| GNN | Graph Neural Network |
| GT End | Ground Truth Endpoint |
| LG | Lightweight Gate Module |
| LSTM | Long Short-Term Memory |
| minADE | Minimum Average Displacement Error |
| minFDE | Minimum Final Displacement Error |
| MLP | Multi Layer Perceptron |
| MR | Miss Rate |
| PFR-GI | Progressive Feature Refinement Global Interactor |
| Pred End | Prediction Endpoint |
| RC | Residual Connection Module |
| RNN | Recurrent Neural Network |
| SFR | Simple Feature Refinement Module |
References
- Argoverse Motion Forecasting Competition. Available online: https://eval.ai/web/challenges/challenge-page/454/leaderboard/1279 (accessed on 25 December 2025).
- Chang, M.-F.; Lambert, J.; Sangkloy, P.; Singh, J.; Bak, S.; Hartnett, A.; Wang, D.; Carr, P.; Lucey, S.; Ramanan, D.; et al. Argoverse: 3D Tracking and Forecasting with Rich Maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2019. [Google Scholar]
- Sun, N.; Xu, N.; Guo, K.; Han, Y.; Wang, L. Research on vehicle trajectory fusion prediction based on physical model and driving intention recognition. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2023, 239, 239–254. [Google Scholar] [CrossRef]
- Schöller, C.; Aravantinos, V.; Lay, F.; Knoll, A. What the constant velocity model can teach us about pedestrian motion prediction. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV); IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar]
- Trautman, P.; Krause, A. Unfreezing the robot: Navigation in dense, interacting crowds. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2010; pp. 797–803. [Google Scholar]
- Schneider, N.; Gavrila, D.M. Pedestrian path prediction with recursive Bayesian filters. In Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC); IEEE: New York, NY, USA, 2017; pp. 1–6. [Google Scholar]
- Bennewitz, M.; Burgard, W.; Cielniak, G.; Thrun, S. Learning motion patterns of people for compliant robot motion. Int. J. Robot. Res. 2005, 24, 31–48. [Google Scholar] [CrossRef]
- Cai, G.; Liu, L.; Zhang, C.; Zhou, Y. Algorithm for prediction of the 6G vehicle trajectory based on the GNN-LSTM-CNN network. J. Xidian Univ. 2023, 50, 50–60. [Google Scholar]
- Deo, N.; Trivedi, M.M. Convolutional social pooling for vehicle trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: New York, NY, USA, 2018; pp. 1549–1557. [Google Scholar]
- Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 64–67. [Google Scholar]
- Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social LSTM: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2016. [Google Scholar]
- Gu, X.; Wang, J.; Cheng, D.; Li, C.; Huang, Q. Efficient multi-target vehicle trajectory prediction based on multi-scale graph convolution. Pattern Anal. Appl. 2025, 28, 12. [Google Scholar] [CrossRef]
- Liang, M.; Yang, B.; Hu, R.; Chen, Y.; Liao, R.; Feng, S.; Urtasun, R. Learning lane graph representations for motion forecasting. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2020; pp. 1–8. [Google Scholar]
- Mohamed, A.; Qian, K.; Elhoseiny, M.; Claudel, C. Social-STGCNN: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020. [Google Scholar]
- Huang, Y.; Bi, H.; Li, Z.; Mao, T.; Wang, Z. STGAT: Modeling spatial-temporal interactions for human trajectory prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2019. [Google Scholar]
- Casas, S.; Gulino, C.; Liao, R.; Urtasun, R. SPAGNN: Spatially-aware graph neural networks for relational behavior forecasting from sensor data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2020. [Google Scholar]
- Gao, J.; Sun, C.; Zhao, H.; Shen, Y.; Anguelov, D.; Li, C.; Schmid, C. VectorNet: Encoding HD maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2020; pp. 11525–11533. [Google Scholar]
- Liu, Y.; Zhang, J.; Fang, L.; Jiang, Q.; Zhou, B. Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2021. [Google Scholar]
- Mercat, J.; Gilles, T.; El Zoghby, N.; Sandou, G.; Beauvois, D.; Pita Gil, G. Multi-head attention for multi-modal joint vehicle motion forecasting. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2020. [Google Scholar]
- Ngiam, J.; Caine, B.; Vasudevan, V.; Zhang, Z.; Chiang, H.-T.L.; Ling, J.; Roelofs, R.; Bewley, A.; Liu, C.; Venugopal, A.; et al. Scene transformer: A unified architecture for predicting multiple agent trajectories. In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar]
- Ye, L.; Wang, Z.; Chen, X.; Wang, J.; Wu, K.; Lu, K. GSAN: Graph self-attention network for learning spatial-temporal interaction representation in autonomous driving. IEEE Internet Things J. 2021, 9, 9190–9204. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS); Curran Associates: New York, NY, USA, 2017. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
- Giuliari, F.; Hasan, I.; Cristani, M.; Galasso, F. Transformer networks for trajectory forecasting. In Proceedings of the 25th International Conference on Pattern Recognition (ICPR); IEEE: New York, NY, USA, 2020. [Google Scholar]
- Yu, C.; Ma, X.; Ren, J.; Zhao, H.; Yi, S. Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2020. [Google Scholar]
- Yuan, Y.; Weng, X.; Ou, Y.; Kitani, K. AgentFormer: Agent-aware transformers for socio-temporal multi-agent forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021. [Google Scholar]
- Zhou, Z.; Wang, J.; Li, Y.; Huang, Y. Query-Centric Trajectory Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 17863–17873. [Google Scholar] [CrossRef]
- Shi, S.; Jiang, L.; Dai, D.; Schiele, B. MTR++: Multi-Agent Motion Prediction with Symmetric Scene Modeling and Guided Intention Querying. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3955–3971. [Google Scholar] [CrossRef] [PubMed]
- Zhou, Z.; Ye, L.; Wang, J.; Wu, K.; Lu, K. HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: New York, NY, USA, 2022; pp. 8813–8823. [Google Scholar] [CrossRef]
- Thiede, L.A.; Brahma, P.P. Analyzing the variety loss in the context of probabilistic trajectory prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2019. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Nigamaa, N.; Rami, A.-R.; Aurick, Z.; Goel, K.; Khaled, S.R.; Sapp, B. Wayformer: Motion forecasting via simple & efficient attention networks. arXiv 2022, arXiv:2207.05844. [Google Scholar] [CrossRef]
- Ye, M.; Xu, J.; Xu, X.; Wang, T.; Cao, T.; Chen, Q. Bootstrap motion forecasting with self-consistent constraints. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 8470–8480. [Google Scholar] [CrossRef]
- Wang, M.; Zhu, X.; Yu, C.; Li, W.; Ma, Y.; Jin, R.; Ren, X.; Ren, D.; Wang, M.; Yang, W. GANet: Goal Area Network for motion forecasting. arXiv 2022, arXiv:2209.09723. [Google Scholar]
- Ye, M.; Cao, T.; Chen, Q. TPCN: Temporal point cloud networks for motion forecasting. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 11313–11322. [Google Scholar] [CrossRef]
- Wang, J.; Messaoud, K.; Liu, Y.; Gall, J.; Alahi, A. Forecast-PEFT: Parameter-efficient fine-tuning for pre-trained motion forecasting models. arXiv 2024, arXiv:2407.19564. [Google Scholar]
Figure 1.
Overall framework.
Figure 1.
Overall framework.
Figure 2.
The structures of FEM and AEM.
Figure 2.
The structures of FEM and AEM.
Figure 3.
The structure of the Progressive Feature Refinement Global Interactor.
Figure 3.
The structure of the Progressive Feature Refinement Global Interactor.
Figure 4.
This figure shows the comparison of the number of parameters for the baseline model modules and the newly added modules on a logarithmic scale.
Figure 4.
This figure shows the comparison of the number of parameters for the baseline model modules and the newly added modules on a logarithmic scale.
Figure 5.
Module Activation Statistics (Raw Values) across different modules. The heatmap shows the raw activation values for each metric (L2 norm, activation mean, etc.) within the modules.
Figure 5.
Module Activation Statistics (Raw Values) across different modules. The heatmap shows the raw activation values for each metric (L2 norm, activation mean, etc.) within the modules.
Figure 6.
Normalized Module Activation Statistics. This heatmap presents the activation values normalized for each metric, allowing a direct comparison of module behavior across various performance indicators.
Figure 6.
Normalized Module Activation Statistics. This heatmap presents the activation values normalized for each metric, allowing a direct comparison of module behavior across various performance indicators.
Figure 7.
Comparison of Module Activation Mean and Norm Mean. This figure compares the mean activation values and L2 norm statistics for each module, highlighting differences in the activation dynamics.
Figure 7.
Comparison of Module Activation Mean and Norm Mean. This figure compares the mean activation values and L2 norm statistics for each module, highlighting differences in the activation dynamics.
Figure 8.
Qualitative results of our model on the Argoverse 1.1 validation set. The predicted trajectories in various scenarios are shown, with the ground truth and predicted paths visualized. (a,c,e,f) Scene of vehicles turning at intersections; (b,d) Scene of vehicles changing lanes near intersections.
Figure 8.
Qualitative results of our model on the Argoverse 1.1 validation set. The predicted trajectories in various scenarios are shown, with the ground truth and predicted paths visualized. (a,c,e,f) Scene of vehicles turning at intersections; (b,d) Scene of vehicles changing lanes near intersections.
Figure 9.
Error over time comparison between our method and the baseline. Our method shows a significant reduction in displacement error across time steps, with an average error of 0.896 m compared to the baseline’s 1.203 m.
Figure 9.
Error over time comparison between our method and the baseline. Our method shows a significant reduction in displacement error across time steps, with an average error of 0.896 m compared to the baseline’s 1.203 m.
Figure 10.
Displacement error progression over time for our method versus the baseline. Our method consistently outperforms the baseline, maintaining a lower error trajectory with an average error of 1.408 m compared to the baseline’s 2.096 m.
Figure 10.
Displacement error progression over time for our method versus the baseline. Our method consistently outperforms the baseline, maintaining a lower error trajectory with an average error of 1.408 m compared to the baseline’s 2.096 m.
Figure 11.
Performance comparison of our method and the baseline over 30 time steps. Our method maintains a significantly lower displacement error (0.433 m) compared to the baseline (0.924 m).
Figure 11.
Performance comparison of our method and the baseline over 30 time steps. Our method maintains a significantly lower displacement error (0.433 m) compared to the baseline (0.924 m).
Table 1.
Experimental Setup Details.
Table 1.
Experimental Setup Details.
| Component | Specification |
|---|
| GPU | Geforce RTX 4070ti Super |
| CPU | Intel Core i5-14600KF |
| RAM | 32 GB DDR5 |
| Storage | 2 TB SSD |
| Operation System | Ubuntu 20.04 |
| CUDA version | 12.8 |
| Deep Learning Framework | PyTorch 2.1.0 |
Table 2.
Group-wise Ablation Experiment conducted on the Argoverse 1.1 validation set.
Table 2.
Group-wise Ablation Experiment conducted on the Argoverse 1.1 validation set.
| FEM + AEM | PFR-GI | minADE | minFDE | MR |
|---|
| | | 0.722 | 1.067 | 0.113 |
| ✓ | | 0.714 | 1.074 | 0.115 |
| | ✓ | 0.722 | 1.064 | 0.112 |
| ✓ | ✓ | 0.703 | 1.041 | 0.112 |
Table 3.
Full ablation validation results on the Argoverse 1.1 validation set. The first row reports the performance of the baseline HiVT model. w/o XXX denotes that the corresponding sub-module is removed from the PFR-Global Interactor.
Table 3.
Full ablation validation results on the Argoverse 1.1 validation set. The first row reports the performance of the baseline HiVT model. w/o XXX denotes that the corresponding sub-module is removed from the PFR-Global Interactor.
| FEM | AEM | PFR-GI | w/o SFR | w/o LG | w/o RC | minADE | minFDE | MR |
|---|
| | | | | | | 0.722 | 1.067 | 0.113 |
| ✓ | | | | | | 0.719 | 1.071 | 0.115 |
| | ✓ | | | | | 0.716 | 1.062 | 0.112 |
| | | ✓ | | | | 0.722 | 1.064 | 0.112 |
| ✓ | ✓ | | | | | 0.714 | 1.074 | 0.115 |
| ✓ | ✓ | | ✓ | | | 0.714 | 1.043 | 0.111 |
| ✓ | ✓ | | | ✓ | | 0.729 | 1.091 | 0.117 |
| ✓ | ✓ | | | | ✓ | 0.739 | 1.079 | 0.116 |
| ✓ | ✓ | ✓ | | | | 0.703 | 1.041 | 0.112 |
Table 4.
Comparison with the Baseline Model on the Argoverse 1.1 Validation Set.
Table 4.
Comparison with the Baseline Model on the Argoverse 1.1 Validation Set.
| Model | minADE | minFDE | MR |
|---|
| PFR-HiVT | 0.703 | 1.041 | 0.112 |
| HiVT-128 | 0.722 | 1.067 | 0.113 |
| HiVT-64 | 0.743 | 1.136 | 0.124 |
Table 5.
Computation cost comparison between HiVT-128 and PFR-HiVT.
Table 5.
Computation cost comparison between HiVT-128 and PFR-HiVT.
| Model | Params (M) | Inference Time (ms/Batch) | Overhead |
|---|
| HiVT-128 | 2.63 | 46.262 | – |
| PFR-HiVT | 2.86 | 46.612 | +0.76% |
Table 6.
Quantitative results on the Argoverse 1 motion forcasting leaderboard.
Table 6.
Quantitative results on the Argoverse 1 motion forcasting leaderboard.
| Model | minADE | minFDE | MR |
|---|
| HiVT-64 | 0.807 | 1.243 | 0.139 |
| HiVT-128 | 0.774 | 1.169 | 0.126 |
| Wayformer [34] | 0.767 | 1.162 | 0.118 |
| DCMS [35] | 0.766 | 1.135 | 0.109 |
| GANet [36] | 0.806 | 1.161 | 0.118 |
| TPCN++ [37] | 0.779 | 1.167 | 0.116 |
| Forecast-FT [38] | 0.842 | 1.271 | 0.138 |
| Scene-Transformer [20] | 0.802 | 1.232 | 0.125 |
| PFR-HiVT | 0.761 | 1.131 | 0.125 |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |