A Proposal-Aware Proactive Encoding Framework for Trajectory Prediction in Autonomous Driving

Liu, Hongkun; Liu, Xuetao; Liu, Ziyi

doi:10.3390/electronics15112435

Open AccessArticle

A Proposal-Aware Proactive Encoding Framework for Trajectory Prediction in Autonomous Driving

by

Hongkun Liu

¹

,

Xuetao Liu

²

and

Ziyi Liu

^1,*

¹

School of Artificial Intelligence, University of Science and Technology Beijing, Beijing 100083, China

²

School of Artificial Intelligence, Beijing University of Technology, Beijing 100124, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2435; https://doi.org/10.3390/electronics15112435

Submission received: 14 April 2026 / Revised: 25 May 2026 / Accepted: 27 May 2026 / Published: 2 June 2026

Download

Browse Figures

Versions Notes

Abstract

Trajectory prediction plays a crucial role in autonomous driving by forecasting the future trajectories of agents to support safe and efficient decision-making. Most existing methods that adopt an encoder–decoder architecture have achieved remarkable success, where the scene encoder extracts contextual representations from agents’ history trajectories and lane segments. However, this architecture remains fundamentally constrained by the blind encoder. Specifically, the scene encoder of models extracts contextual information without foresight, leading to significant semantic pollution from proposal-irrelevant context, thereby degrading the prediction performance. To rectify this model deficiency, we propose ProFocus, a proactive encoding framework that reformulates the trajectory prediction model architecture via an anticipatory feedback loop. ProFocus generates the potential proposals in the nascent stage layers, utilizing them as attentional priors to dynamically modulate the scene encoding process. In addition, to optimize the information flow within the attention mechanism and reduce irrelevant context interference in attention distributions, we introduce spatio-temporal focal attention (STFA). By implementing a relation-conditioned sharpening operator through a spatio-temporal relation-controlled softmax, STFA adaptively recalibrates the attention distribution according to related dependencies. Comprehensive evaluations on the Argoverse 1 dataset and INTERACTION dataset validate that ProFocus attains competitive performance across miss rate (MR), minimum average displacement error (minADE) and minimum final displacement error (minFDE), while maintaining a real-time inference speed of 16 ms on an RTX 3090. The results from our ablation studies demonstrate that ProFocus reduces MR, minFDE, and minADE by 2.80%, 2.52%, and 1.41% relative to the baseline, respectively. Furthermore, qualitative visualizations also corroborate that ProFocus exhibits robust performance in diverse traffic scenarios.

Keywords:

trajectory prediction; motion prediction; motion forecasting; autonomous driving; deep learning

1. Introduction

In autonomous driving systems, predicting the future trajectories of agents is a foundational capability, serving as the linchpin for safety-critical decision-making and motion planning. The trajectory prediction module typically takes as input the historical trajectories of traffic participants together with structured lane segments, which are generated by upstream perception and mapping systems. The prevailing trajectory prediction paradigm fundamentally hinges on the dynamic interaction between query proposals and contextual representations. Therefore, the ability to extract proposal-relevant contextual features is paramount for ensuring the robustness of trajectory predictions.

In recent years, many trajectory prediction methods have followed the encoder–decoder architecture. VectorNet [1], LaneGCN [2], LaneRCNN [3], and Trajectron++ [4] encode agents’ history trajectories and lane segments via a hierarchical graph structure, establishing actor–map and actor–actor interactions. HiVT [5], Scene Transformer [6], Wayformer [7], and MTR [8] introduce multi-head attention into the scene encoder, enhancing the representation along both temporal and spatial dimensions. Methods like ForecastMAE [9] and DMotion [10] improve the model’s semantic representation and enhance discriminability through self-supervised learning. However, these paradigms are fundamentally constrained by a blind encoder mechanism, in which the scene encoder extracts contextual information without prospective foresight. By relying solely on backpropagation losses to implicitly learn feature relevance, the contextual presentation is prone to being semantically polluted by proposal-irrelevant scenario context.

In Figure 1, it is presented that the decoder attention distribution under a blind encoder without proposal-aware guidance in the first row. In the first row, the decoder attention tends to spread over context regions that are not well aligned with the target proposal, particularly in complex intersections, leading to diffuse and potentially misleading contextual aggregation. This dispersed encoding causes the model to be distracted by irrelevant contexts and even by regions that are inconsistent with traffic rules, thereby weakening its focus on proposal-relevant cues and degrading the accuracy of future trajectory prediction. To provide a quantitative analysis, we introduce a metric named spatial attention relevance (SAR). For each scene, we first rank all context tokens in descending order according to their attention weights, and then retain the leading tokens until their cumulative attention mass reaches 50%. We then calculate the proportion of these high-attention tokens that fall within 5 m and 10 m of the corresponding proposals. We found that the SAR across agent distributions under 5 m and 10 m is 0.2053 and 0.3523, and the SAR across lane distributions under 5 m and 10 m is 0.6559 and 0.8087. These results indicate that agent attention remains relatively dispersed, while lane attention is more concentrated around the proposals, but still needs further improvement. Although recent methods PBP [11] and FINet [12] incorporate potential proposals into query proposal embedding initialization or decoder interaction modules, they fail to address the limitations of the scene encoder and overlook the critical necessity of proactive encoding, leaving the framework’s encoder layer isolated from anticipated future proposals.

To bridge this gap, we propose ProFocus, a proposal-aware encoding framework that reformulates the trajectory prediction architecture via an anticipatory feedback loop. Distinct from previous methods, ProFocus generates potential proposals in the early layers, and we employ an attention mechanism to integrate these potential proposals into the scene encoder. Specifically, we treat potential proposals as the key and value, history trajectories and lane segments as the query, allowing the attention mechanism to facilitate effective interaction between potential proposals and contextual representations, thereby fortifying the proposal-relevant contextual representations for accurate trajectory prediction. Furthermore, to mitigate irrelevant context in attention distributions, we propose spatio-temporal focal attention, which sharpens attention through a spatio-temporal relation-controlled softmax, thereby selectively amplifying the relevant representations. In addition, we also apply STFA to the decoder’s contextual interaction module, facilitating more reliable feature aggregation with spatio-temporal dependencies. Furthermore, to mitigate irrelevant context in attention distributions, we propose spatio-temporal focal attention, which sharpens attention through a spatio-temporal relation-controlled softmax, thereby selectively amplifying the relevant representations. In addition, we also apply STFA to the decoder’s contextual interaction module, facilitating more reliable feature aggregation with spatio-temporal dependencies. The second row of Figure 1 shows the prediction and attention results obtained by ProFocus with proposal-aware encoding. ProFocus produces more accurate future trajectory predictions, and the decoder yields a more concentrated attention distribution around proposal-consistent regions, leading to cleaner contextual aggregation and reduced distraction from irrelevant scenario context. Moreover, we also evaluate the SAR of ProFocus within 5 m and 10 m of the proposals. The SAR values for agent attention are 0.2143 and 0.3651, respectively, while those for lane attention are 0.7717 and 0.9057. Compared with the baseline, these results show clear improvements, indicating that agent attention remains relatively dispersed, whereas lane attention is more concentrated around proposal-relevant regions.

ProFocus employs an encoder–decoder architecture, which is divided into two main components: the potential proposals generator, which generates coarse future trajectory proposals, and the proposal-aware trajectory refiner, which refines these proposals through interaction with the contextual representation of proposal-aware scene encoding. ProFocus incorporates regression and classification objectives, where the regression loss penalizes geometric deviation between predicted and ground-truth trajectories, while the classification loss learns confidence scores to suppress implausible proposals. Furthermore, ProFocus achieves competitive performance on the Argoverse 1 dataset and INTERACTION dataset in critical metrics such as MR, minADE, and minFDE, and shows strong adaptability to diverse traffic scenarios. The main novelty of this work lies in the proposal-aware proactive encoding mechanism, which generates coarse potential proposals at an early stage and feeds them to guide the encoding of lane segments and agents’ historical trajectories, instead of encoding the scene in a blind manner. Moreover, the proposed spatio-temporal focal attention further improves context aggregation by adaptively sharpening attention distributions according to spatio-temporal relations. We highlight the following key contributions in our paper:

(1) ProFocus framework: We introduce ProFocus, a proposal-aware trajectory prediction framework that consists of a potential proposals generator and a proposal-aware trajectory refiner. The potential proposals generator generates potential trajectory proposals, while the proposal-aware trajectory refiner leverages these proposals to guide scene encoding. This design effectively strengthens the model’s ability to capture proposal-relevant spatio-temporal trajectories and lane segments, thereby facilitating the accuracy and robustness of trajectory prediction.

(2) Spatio-temporal focal attention: We propose STFA, which dynamically adjusts attention distributions via a spatio-temporal relation controlled softmax, guiding the model to prioritize attention on the relevant contextual representations while mitigating noisy attention, thereby enhancing the accuracy of the trajectory prediction model.

(3) Superior performance: Our framework demonstrates competitive results on the Argoverse 1 and INTERACTION benchmark across MR, minFDE, and minADE. The results from our ablation studies demonstrate that ProFocus reduces MR, minFDE, and minADE by 2.8%, 2.52%, and 1.41% relative to the baseline, respectively. Qualitative visualizations also demonstrate that ProFocus predicts accurate and reliable future trajectories in diverse driving scenarios.

2. Related Work

2.1. Trajectory Prediction

Early trajectory prediction methods [1,2,3,13,14,15] for autonomous driving leverage graph-based architectures to model interactions between history trajectories and lane segments. VectorNet [1] pioneers a hierarchical graph neural network that encodes individual lane segments and history trajectories as vectorized polylines, then aggregates them to capture high-order interactions among all scene components. Building upon structured map representations, LaneGCN [2] introduces a dedicated lane graph to preserve road topology and extends graph convolution with multiple adjacency matrices for long-range lane dependencies. LaneRCNN [3] advances this graph-centric paradigm by learning a local lane graph ROI for each agent’s history trajectory and lane segment, enabling efficient message passing among these per-actor subgraphs within a shared global graph. Recent trajectory prediction models have improved towards attention-based encoder–decoder architectures [6,8,16,17,18,19,20]. Encoders use multi-head self-attention to encode heterogeneous inputs of history trajectory and lane segment features over space and time dimensions, facilitating long-range dependencies to emerge. AgentFormer [16] proposes an agent-aware Transformer that simultaneously models the history trajectories and lane segments interactions of all agents by flattening them and applying self-attention across the combined sequence. Scene Transformer [6] takes a scene-centric approach with a unified attention network that combines features across road graph elements, inter-agent interactions, and time steps, producing consistent multi-agent future predictions in a single forward pass. CMTT [20] proposes a convolutional Transformer network for vehicle trajectory prediction in urban traffic scenarios, which combines convolutional modules for local motion-pattern extraction with Transformer-based modeling of long-range spatio-temporal dependencies. MTR [8] employs a Transformer encoder–decoder architecture with a set of learnable intention queries, each responsible for predicting a distinct modality of the agent’s future trajectories. QCNet [17] adopts a two-stage decoding approach, in which the first stage generates coarse trajectory proposals using anchor-free queries in a recurrent manner and then refines these trajectories with a second set of anchor-based queries. MTR and QCNet ignore the role of potential future proposals in guiding the encoding process. Although methods such as PBP [11] and FINet [12] integrate potential proposals into the query initialization and interaction module of the decoder, they still suffer from architectural myopia with the scene encoder.

ProFocus follows an attention-based encoder–decoder framework and regards potential trajectories as guidance for incorporating the contextual representation. Specifically, we introduce STFA into the encoder for proposal-aware trajectories and lane segments feature extraction, thereby better aligning with human anticipatory perception in real driving.

2.2. Attention Mechanism in Trajectory Prediction

In trajectory prediction, another important research direction is to design an adaptive attention mechanism that selectively aggregates informative context while suppressing distractors. Trajformer [21] proposes local self-attentive contexts for social interaction reasoning, improving the saliency of contextual features. LAformer [22] further reduces map-induced noise by estimating lane likelihoods densely over time and selecting only the top-k highly probable lane segments, thereby filtering out irrelevant lane tokens before attention-based decoding. MTR [8] advocates locality-preserving attention by restricting encoder self-attention to locally connected neighborhoods, improving efficiency and reducing spurious global interactions. Existing approaches typically rely on static neighborhoods or heuristic token selection, and thus fail to dynamically modulate the attention distribution according to spatio-temporal relations.

ProFocus addresses this challenge by introducing a spatio-temporal aware focal attention that adaptively sharpens attention distributions, amplifying proposal-induced focal context while suppressing irrelevant attention.

3. Method

We propose an encoder–decoder trajectory prediction framework that explicitly leverages potential future proposals to guide subsequent scene encoding. In Figure 2, it is presented that the overall architecture of ProFocus consists of two sequential components: a potential proposals generator and a proposal-aware trajectory refiner. The potential proposals generator produces a set of coarse potential future trajectories based exclusively on lane segments and agents’ history trajectories, providing potential future proposals. The proposal-aware trajectory refiner encodes the scene conditioned on the potential future proposals produced by the potential proposals generator and outputs refined multimodal trajectories with improved accuracy and interaction consistency. Moreover, to mitigate attention distribution of irrelevant context during interactions, we introduce spatio-temporal aware focal attention. Specifically, it adaptively sharpens the attention distribution based on spatio-temporal relations, allowing the model to concentrate on the relevant history trajectories and lane segments, and preventing performance degradation caused by attention distraction.

3.1. Spatio-Temporal Focal Attention

In Figure 3, it is presented that the proposed spatio-temporal focal attention serves as an alternative to the standard attention layer to reduce distraction from irrelevant contextual information in the attention probability. Given an input X, we project it to form the query representation Q, while we map a separate input Y to both the key representation K and the value representation V. The formulations for these three feature projections are given as follows.

Q = X W^{Q}, K = Y W^{K}, V = Y W^{V} .

(1)

where

W^{Q}, W^{K}, W^{V}

are the learnable parameters. To obtain an attention distribution according to the spatio-temporal relation, we introduce a spatio-temporal relation-controlled softmax:

STFA (Q_{i}, K_{j}, V_{j}) = softmax (\frac{Q_{j} K_{j}^{⊤}}{P (δ_{i, j}) \sqrt{d}}) V_{j},

(2)

where

\sqrt{d}

serves as a normalization term that aligns the scale of similarity scores across different embedding dimensions, preventing attention collapse. The control parameter is adaptively determined according to the spatio-temporal relation

δ_{i, j}

, rather than being treated as a fixed hyperparameter. A multi-layer perceptron (MLP) is introduced to encode the relative spatio-temporal attributes between the query token and the corresponding key token. We map these attributes to a D dimensional feature space and then project the features to the attention head dimension in order to adaptively derive the control parameter, ensuring it dynamically matches the spatiotemporal attributes of the interaction between query and key.

P (δ_{i, j}) = MLP (MLP (δ_{i j})),

(3)

P (δ_{i, j}) = clip (\tilde{P} (δ_{i, j}), γ_{\min}, γ_{\max}) .

(4)

where

γ_{\min}

and

γ_{\max}

denote the lower and upper bounds of

P (δ_{i, j})

, controlling the minimum and maximum modulation strength.

γ_{\min}

prevents overly unstable scaling, while

γ_{\max}

avoids excessive smoothing.

δ_{i j}

denotes the spatio-temporal relationship between the key token

K_{j}

and the query token

Q_{i}

is encoded as

r_{i j} = [Δ p_{i j}, Δ θ_{i j}, Δ ϕ_{i j}, Δ t_{i j}],

(5)

where

Δ p_{i j}

denotes relative position,

Δ θ_{i j}

denotes relative orientation,

Δ ϕ_{i j}

denotes relative heading angle, and

Δ t_{i j}

denotes temporal displacement. Specifically, when

P (δ_{i, j}) > 1

, the relative differences between logits are reduced, leading to a flatter distribution that encourages exploration of global features. In contrast, when

P (δ_{i, j}) < 1

, the differences between logits are amplified, creating a peaked distribution where higher logits dominate the probability mass, enabling the model to focus on fine-grained features. This adaptive attention mechanism, distinct from standard attention layers in existing methods, enables a dynamic trade-off between global exploration and local exploitation.

The output of the spatio-temporal focal attention is subsequently passed through a position-wise feed-forward network. Residual connections and layer normalization are applied after FFN modules, enabling stable optimization while progressively refining proposal-aware contextual representations. Finally, the spatio-temporal focal attention output is integrated into an STFA block:

H = LN (Q + STFA (Q, K, V)),

(6)

O = LN (H + FFN (H)),

(7)

where

LN (\cdot)

denotes layer normalization. The residual formulation ensures stable gradient flow and mitigates representation drift, allowing the network to iteratively refine spatio-temporally grounded features.

3.2. Potential Proposals Generator

The potential proposals generator is composed of a scene encoder and a proposal trajectory decoder.

3.2.1. Scene Encoder

In the scene encoder, the road polyline encoder and history trajectory encoder are employed to extract structured representations from lane segments and history trajectories, respectively.

Road polyline encoder. In the road polyline encoder, each lane segment is encoded as a node in a graph, whose attributes capture essential geometric characteristics of the lane, including its position

l_{p}

, heading orientation

l_{h}

, and physical length

l_{g}

. These attributes are concatenated and embedded into a latent feature space using a two-layer multilayer perceptron. A lane segment feature is represented as

M_{L}

:

M_{L} = MLP (Concat (l_{p}, l_{h}, l_{g})) .

(8)

To explicitly encode road connectivity and topology, the map graph establishes edges between lane nodes based on adjacency as well as predecessor–successor relationships. The lane features are first processed by a multi-head self-attention (MHSA) module to model intra-lane dependencies and consistency and are subsequently refined through a multi-head cross-attention (MHCA) module that aggregates features from topologically related neighboring lanes. Let

M_{L}^{i}

denote the feature of the current lane node and

M_{L}^{n}

denote the set of neighboring lane features. The final lane representation

E_{L}

is obtained as:

E_{L} = MHCA (MHSA (M_{L}^{i}), M_{L}^{n}, M_{L}^{n}) .

(9)

Agent trajectory encoder. The history trajectory encoder captures representations from their trajectories over the past T time steps. Specifically, given the historical positions

{n_{t}}_{c = 0}^{T}

, each trajectory is represented as a sequence of displacement vectors between consecutive positions:

n_{t} = c_{t} - c_{t - 1}, t = 1, \dots, T .

(10)

These displacement vectors are concatenated to form a unified temporal representation:

N_{H} = [n_{1}, n_{2}, \dots, n_{T}] .

(11)

The resulting representation is then projected into a high-dimensional latent space through a stacked multi-layer perceptron, yielding the historical trajectory embedding:

E_{H} = MLP (MLP (N_{H})) .

(12)

By utilizing displacement-based encoding, the representation captures motion patterns while remaining invariant to global coordinate systems.

3.2.2. Potential Proposals Decoder

In the potential proposals decoder, we employ a contextual interaction module that applies STFA to model the interaction between proposals and contextual representations. Specifically, the proposal embeddings serve as queries, while lane segment features and history trajectory features are separately treated as keys and values, enabling the decoder to incorporate contextual representation.

Agent interaction module. The agent interaction module constructs interactions between query proposals and contextual representations. Given the proposal embedding

E_{Q}

and historical trajectory embeddings

E_{H}

, the proposal attends to agents’ history trajectories through STFA. By setting

E_{H}

as keys and values and

E_{Q}

as queries, the module aggregates proposal-relevant history trajectories, which are further processed by a feed-forward block to enhance representation capacity. The output of the history-aware proposal representation

R_{H}

is calculated as follows:

R_{H} = FFN (STFA (E_{Q}, E_{H}, E_{H})) .

(13)

Map interaction module. The map interaction module captures interactions between trajectory proposals and lane segments. Each proposal embedding

E_{Q}

interacts with lane segment features

E_{L}

using the STFA, where lane features

E_{L}

serve as keys and values. The aggregated proposal representation is subsequently refined through the feed-forward block. The output of the map-aware proposal representation

R_{M}

is calculated as follows:

R_{M} = FFN (STFA (E_{Q}, E_{L}, E_{L})) .

(14)

Proposal refinement module. The proposal refinement module integrates context-aware proposal representations from history trajectory interaction and map interaction. Specifically, the history trajectories interaction representation

R_{H}

and the lane segments interaction representation

R_{M}

are fused via element-wise summation to obtain a unified proposal representation:

R_{T} = R_{H} + R_{M}

(15)

The mode attention module aims to enhance proposal representations by capturing dependencies across the temporal dimension, interacting agents, and output modes. Given the fused proposal feature

R_{T}

, MHSA is applied sequentially along the time, agent, and mode dimensions, and each attention block is followed by a feedforward layer to enhance representation:

U = F_{γ} (F_{μ} (F_{β} (R_{T}))) .

(16)

Each operator

F_{{β, μ, γ}}

is defined as a multi-head self-attention block followed by a feed-forward network:

F_{k} (R) = FFN ({MHSA}_{k} (R, R, R)), k \in {β, μ, γ} .

(17)

R denotes the input proposal representation of each attention module.

{MHSA}_{β}

captures temporal dependencies across the time horizon,

{MHSA}_{μ}

models interactions among neighboring agents, and

{MHSA}_{γ}

interacts over multiple output modes, enabling structured and interpretable multimodal trajectory refinement.

Prediction header. The prediction header is responsible for predicting multi-modal potential proposal trajectories from the proposal embeddings. Given the output of the mode attention module as U, the trajectory prediction branch outputs K potential proposals:

{\hat{y}}^{pro} = T (U)

(18)

where

T (\cdot)

denotes a multi-layer perceptron. In parallel, a confidence scoring branch predicts a confidence score for each potential proposal:

{\hat{s}}^{pro} = C (U)

(19)

where

C (\cdot)

is implemented as a multi-layer perceptron.

3.3. Proposal-Aware Trajectory Refiner

The proposal-aware trajectory refiner is composed of a proposal-aware scene encoder and a trajectory decoder.

3.3.1. Proposal-Aware Scene Encoder

The proposal-aware scene encoder comprises a potential proposal encoder, a proposal-aware map encoder, and a proposal-aware history trajectory encoder.

Potential proposals encoder. The potential proposals encoder is designed to encode potential proposal trajectories that provide high-level guidance for scene encoding. Each potential proposal trajectory is represented as a sequence of points, from which relative displacement vectors are extracted and concatenated into a path representation

{p_{t}}_{t = 0}^{F}

. We compute relative displacement vectors between consecutive points

d_{t} = p_{t} - p_{t - 1}, t = 1, \dots, F .

(20)

These displacement vectors are concatenated to form a path representation:

V_{P} = [d_{1}, d_{2}, \dots, d_{F}] .

(21)

The representation is then projected into a latent feature space through a multi-layer perceptron, yielding the potential proposal embedding:

D_{P} = MLP (V_{P}) .

(22)

The proposal representations capture the geometric structure of potential proposal trajectories while remaining invariant to global coordinate systems.

Proposal-aware road polyline encoder. The potential proposal-aware road polyline encoder interacts with lane segment representations and potential proposal trajectories to emphasize relevant lane segments. Lane segments are encoded through a multi-head self-attention module and a multi-head cross-attention module that capture topological relationships among neighboring lanes. Let

F_{L}^{i}

denote the current lane features, and

F_{L}^{n}

denote the topological neighbors of lane i. We obtain structure-aware lane embeddings by:

G_{L} = MHSA (MHCA (F_{L}^{i}), F_{L}^{n}, F_{L}^{n}),

(23)

To incorporate potential proposals, we treat the lane embeddings

G_{L}

as queries and the proposal embeddings

D_{P}

as keys and values, and we perform proposal-aware interaction using STFA:

G_{L P} = FFN (STFA (G_{L}, D_{P}, D_{P})) .

(24)

G_{L P}

serves as the proposal-aware map representation, which emphasizes lane segments aligned with plausible potential proposals while suppressing irrelevant lane segment features.

Proposal-aware agent trajectory encoder. The proposal-aware agent trajectory encoder extends the agent trajectory encoder by leveraging potential future trajectory proposals to guide the encoding of history trajectories. The raw history trajectories are projected into a high-dimensional latent space through a stacked multi-layer perceptron:

G_{H} = MLP (MLP (N_{H})) .

(25)

To incorporate proposal guidance, the history trajectory embedding is further updated by interacting with potential proposal embeddings

D_{P}

via STFA, where

G_{H}

serves as the query and

D_{P}

serves as keys and values:

G_{H P} = FFN (STFA (G_{H}, D_{P}, D_{P})) .

(26)

This proposal-aware encoding emphasizes history trajectories that are consistent with proposal trajectories while suppressing irrelevant history trajectories.

3.3.2. Proposal-Aware Trajectory Decoder

The trajectory decoder is composed of an agent interaction module and a map interaction module, followed by a trajectory refinement module. The query proposal embeddings

G_{Q}

are initialized with the potential proposal trajectories generated by the potential proposals generator.

Agent interaction module. Given a proposal embedding

G_{Q}

and proposal-aware history trajectory embeddings of agents

G_{H P}

, the proposals attend to the agent history trajectory through STFA.

S_{H P} = FFN (STFA (G_{Q}, G_{H P}, G_{H P})) .

(27)

Map interaction module. Each proposal embedding

G_{Q}

interacts with proposal-aware map representations

G_{L P}

using STFA, where

G_{L P}

serves as keys and values. The map interaction representation is subsequently updated through a feed-forward block:

S_{M P} = FFN (STFA (G_{Q}, G_{L P}, G_{L P})) .

(28)

Proposal refinement module. The proposal refinement module consolidates representations derived from agent trajectory interaction and map interaction. The agent trajectory interaction features

S_{H P}

and the map interaction features

S_{M P}

are combined through element-wise summation:

S_{T P} = S_{H P} + S_{M P} .

(29)

Building upon the fused representation, a mode attention module is employed to progressively refine features by modeling correlations along the temporal dimension, interacting agents, and multimodal trajectories. MHSA blocks are applied sequentially across these dimensions, with each block followed by a feed-forward network:

Z = F_{γ} (F_{μ} (F_{β} (S_{T P}))), k \in {β, μ, γ} .

(30)

Prediction header. The prediction header outputs final multimodal trajectory predictions based on the refined proposal representations. For each proposal embedding Z, a trajectory prediction branch predicts future prediction trajectories:

{\hat{y}}^{ref} = V (Z)

(31)

where

V (\cdot)

denotes a multi-layer perceptron and i represents the layer of the decoder. Simultaneously, a confidence estimation branch outputs a confidence score for each trajectory mode:

{\hat{s}}^{ref} = D (Z)

(32)

where

D (\cdot)

is a multilayer perceptron.

3.4. Training Objective

Our model is trained in two stages using a multi-component loss function that jointly supervises trajectory regression and confidence prediction within each stage. The reason for using sequential training is mainly for optimization stability. Therefore, we first train the potential proposal generator to produce plausible coarse future trajectories, and then use these proposals to guide the proposal-aware trajectory refiner. The training objective for each prediction stage is defined as

L_{total}^{r} = L_{reg}^{r} + L_{cls}^{r}, r \in {pro, ref}

(33)

3.4.1. Regression Loss

The best mode is selected according to the accumulated displacement error of the proposal trajectory:

k^{*} = \underset{k \in {1, \dots, K}}{arg min} {∥{\hat{y}}_{k}^{r} - y∥}_{2}, r \in {pro, ref}

(34)

where K denotes the number of modes.

{\hat{y}}_{k}^{r}

denotes the predicted trajectory of the k-th proposal mode, and

y

denotes the corresponding ground-truth trajectory. The regression loss is computed on the selected mode for branch r, which denotes either the coarse proposal or the refined trajectory:

L_{reg}^{r} = ℓ_{Huber} ({\hat{y}}_{k^{*}}^{r}, y), r \in {pro, ref}

(35)

where

ℓ_{Huber} (\cdot)

denotes the 2D Smooth L1 loss summed over the coordinate dimension.

3.4.2. Classification Loss

The classification loss

L_{cls}^{r}

is defined as the cross-entropy loss between the predicted mode probability distribution

{\hat{s}}_{k}^{r}

and the one-hot mode label

s_{k}

:

L_{cls}^{r} = - \sum_{k = 1}^{K} s_{k} log {\hat{s}}_{k}^{r}, r \in {pro, ref}

(36)

where

{\hat{s}}_{k}^{r}

denotes the predicted confidence of the k-th mode, and

s_{k}

denotes the corresponding one-hot mode label.

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset

We evaluate our trajectory forecasting method on the Argoverse 1 motion forecasting dataset [23] and INTERACTION dataset [24] for evaluating the performance of trajectory prediction.

Argoverse 1. The training split contains 205,942 sequences and the validation split contains 39,472 sequences. Each sequence provides a 2-s history and a 3-s future for all tracked agents sampled at 10 Hz, resulting in 20 historical steps and 30 future steps. The goal is to predict 6 candidate future trajectories for the target agent given the observed scene context. Argoverse 1 additionally provides high-definition vector maps with lane centerlines, connectivity, and turn semantics, which we use as structured map context.

INTERACTION. The INTERACTION dataset contains 12,000 scenarios used for training and 2000 for validation, with all trajectories recorded at a sampling rate of 10 Hz. Each scenario includes a 1-s observation history together with 3 s of future trajectory annotations.

4.1.2. Metrics

We evaluate the performance on metrics under 6 predicted trajectories. minADE is the average L2 distance error of the best-predicted trajectory and the ground-truth trajectory over all time steps, while minFDE measures the error only at the final time step. MR is the percentage of scenarios in which none of the predicted trajectories fall within 2.0 m of the ground truth trajectory.

4.1.3. Implement Details

The number of layers in ProFocus is set to 2, and the hidden size of the MLP layer is configured to 128. The model training process is divided into two sequential stages: in the first stage, we train the potential proposals generator; in the second stage, we train the proposal-aware trajectory refiner, which takes the potential proposals generator’s outputs as guidance to achieve proposal-aware trajectory refinement. The Adam optimizer is employed for training with an initial learning rate of 3.5 ×

10^{- 4}

, and a step-wise learning rate decay strategy is adopted, and the learning rate is halved every 10 epochs, starting from the 26th epoch. ProFocus has a total of 4.53 million parameters, and the inference is tested under a batch size of 16. In Table 1, it is presented that the average latency for one driving scenario is reported, with data preprocessing, candidate generation, and batched inference latency listed separately. Under an average scene density of 22.71 agents and 176.23 lanes on Argoverse 1, the preprocessing time is 1.36 ms per scene, the candidate generation stage takes 10.24 ms per scene, and batched inference takes 16.41 ms per scene. On INTERACTION, under an average scene density of 11.90 agents and 62.47 lane tokens, the average preprocessing time is 0.08 ms per scene, the candidate generation stage takes 7.99 ms per scene, and the average batched inference time is 10.49 ms per scene.

4.2. Main Result

4.2.1. Argoverse 1

We evaluate the performance of the ProFocus framework on the Argoverse 1 validation set and compare it with representative trajectory prediction methods. In Table 2, it is presented that ProFocus demonstrates competitive performance across metrics, validating the effectiveness of the proposed proposal-aware framework. To make the comparison more transparent, we additionally summarize the experimental configurations of the compared methods, including the prediction stage, input format, map encoder type, and the number of output modes. Although the detailed implementation and training recipes of different methods are not completely identical, most compared methods follow the standard Argoverse 1 evaluation protocol with six predicted trajectory modes.

ProFocus achieves the lowest MR of 0.07, outperforming SIMPL by 12.5%, which indicates that ProFocus generates more reliable multimodal trajectory predictions with fewer failed cases. In terms of end-point accuracy, ProFocus delivers the best minFDE of 0.88 m, surpassing LAformer by 4.3%. This result suggests that proposal-aware scene encoding can better localize the final position of the target agent by emphasizing proposal-relevant contextual information. ProFocus also achieves a minADE of 0.64 m, ranking among the top-performing methods and matching LAformer while outperforming most other baselines. These results indicate that ProFocus improves end-point accuracy and maintains competitive trajectory-level accuracy.

In Table 3, it is presented that an accuracy-efficiency comparison with representative baseline methods on the Argoverse 1 validation set is provided. The comparison includes parameter size, total latency, MR, minFDE, and minADE. The results show that ProFocus achieves the best prediction accuracy among the compared methods while maintaining a practical total latency of 28.01 ms on an RTX 3090. Although SIMPL has a lower latency, ProFocus provides substantially better prediction accuracy, reducing MR from 0.08 to 0.07 and minFDE from 0.95 m to 0.88 m. Compared with HiVT-128, ProFocus is also faster while achieving better prediction performance. These results demonstrate that ProFocus provides a favorable trade-off between prediction accuracy and deployment efficiency.

4.2.2. INTERACTION

We further conduct experiments on the INTERACTION validation set to examine the generalization ability of ProFocus in highly interactive scenarios. In Table 4, it is presented that both the experimental configurations and performance comparison on the INTERACTION validation set are reported. ProFocus achieves the best results on all reported metrics, highlighting the effectiveness of the proposed framework in complex interactive traffic scenarios. Notably, compared with FJMP, which also adopts a two-stage graph-based framework with 6 output modes, ProFocus reduces MR from 0.0810 to 0.0659, corresponding to an 18.6% improvement. This suggests that our method can better avoid failed predictions in challenging traffic interactions. In terms of end-point accuracy, ProFocus attains the lowest minFDE of 0.5682 m, surpassing FJMP by 8.8%. In addition, ProFocus records the best minADE of 0.1801 m, improving upon FJMP by 6.7%, which reflects its stronger capability in modeling the complete future motion process. Overall, these findings indicate that ProFocus can learn more discriminative spatio-temporal interaction representations and yield more precise and dependable prediction results on the INTERACTION benchmark.

4.3. Ablation Study

In this section, we provide a comprehensive ablation analysis of ProFocus from multiple perspectives, including the contribution of key components, the design of proposal-aware encoding branches, the effectiveness of STFA, the proposal-relevant attention distribution, robustness to proposal perturbations, and the sensitivity to the number of generated proposals. These experiments are designed to clarify how proposal-aware guidance and relation-adaptive attention contribute to contextual representation learning and final trajectory refinement. Unless otherwise specified, all ablation experiments are conducted using one-sixth of the training set and the full validation set of Argoverse 1.

4.3.1. Component Study

In the component study, we evaluate the contribution of key modules. Specifically, we analyze: (i) incorporating the potential-aware encoder into the framework, and (ii) replacing the standard multi-head attention with the proposed STFA. These ablations quantify how the potential proposal-aware trajectory prediction framework and STFA affect overall forecasting performance.

In Table 5, it is presented that the effects of the proposal-aware encoder and STFA are compared through component-level ablations. Introducing proposals into the encoder brings consistent gains over the baseline, reducing minADE from 0.7039 to 0.6966 and alleviating MR from 0.0833 to 0.0820, with minFDE decreasing from 0.9994 to 0.9865. Replacing standard multi-head attention with STFA also yields measurable improvements, most notably reducing minFDE to 0.9910, indicating that STFA better captures interactions with history trajectories and lane segments. When both modules are enabled, the full ProFocus model achieves the best overall performance, attaining an MR of 0.0810, minFDE of 0.9742, and minADE of 0.6940, consistently outperforming all ablated variants. These results validate the complementary roles of the proposal-aware encoder and STFA in proposal-aware contextual representation learning and spatio-temporal interaction modeling.

4.3.2. Proposal-Aware Encoder

To analyze the contributions of the proposal-aware history encoder and the proposal-aware map encoder in the proposal-aware encoder, we conduct module-wise ablation experiments by incorporating each component independently and comparing the results. In Table 6, it is presented that the individual contributions of the proposal-aware map encoder and the proposal-aware history encoder are compared. Introducing proposal trajectories into either the map encoder or the history encoder consistently improves performance over the baseline. Specifically, enabling the proposal-aware map encoder reduces MR from 0.0833 to 0.0815 and decreases minFDE from 0.9994 to 0.9833, minADE from 0.7039 to 0.6966, demonstrating that the proposal-aware map encoder helps eliminate map-inconsistent futures and improves end-point localization. Enabling the proposal-aware history encoder yields comparable gains, decreasing MR to 0.0821 and reducing minFDE to 0.9801 while alleviating minADE to 0.6979, suggesting that proposal-aware encoders provide discriminative cues for intent realization. The full proposal-aware encoder achieves the best overall results with an MR of 0.0810, a minFDE of 0.9742, and a minADE of 0.6940, indicating complementary benefits from the proposal-aware encoder. These results show that both the proposal-aware map encoder and the proposal-aware history encoder contribute positively to performance, and their combination enables more effective proposal-relevant contextual representation learning.

4.3.3. Spatio-Temporal Focal Attention

We investigate how different fusion strategies affect spatio-temporal interaction modeling, and compare our proposed spatio-temporal focal attention (STFA) with commonly used alternatives. Specifically, we evaluate four fusion patterns for contextual aggregation: a baseline design without explicit fusion, feature concatenation, standard multi-head attention, temperature-based attention and our STFA, which sharpens attention weights through relation-aware modulation to suppress irrelevant interactions.

In Table 7, it is presented that different fusion strategies for contextual aggregation are compared. Naive concatenation degrades performance, increasing MR to 0.0925 and slightly worsening minADE, indicating that simply stacking features introduces irrelevant information and weakens discriminative interaction cues. Standard multi-head attention improves MR and minFDE over the baseline, suggesting that learned attention is beneficial for fusing heterogeneous context. Temperature-based attention achieves a certain improvement in minADE compared with the baseline, but it does not bring consistent gains in MR and minFDE. Nevertheless, STFA achieves the best results across all metrics, reducing MR to 0.0811 and improving minFDE and minADE to 0.9821 and 0.6945, respectively. These gains demonstrate that relation-aware attention sharpening provides more selective and robust context aggregation than conventional fusion, leading to more accurate intent inference and end-point localization under multi-modal uncertainty.

4.3.4. Spatio-Temporal Focal Attention in Encoder

In Table 8, it is presented that the influence of different STFA range bounds in the encoder is analyzed. These bounds determine the range of attention scaling in proposal-aware encoding, thereby controlling the modulation strength. As shown in Table 8, setting the bounds to

γ_{\min} = 0.1

and

γ_{\max} = 2.0

achieves the best overall results, attaining the lowest MR together with the best minFDE and minADE. This observation indicates that a sufficiently broad range span enables the attention mechanism to adaptively emphasize relevant interactions while preserving the capacity to represent diverse motion patterns. When the range is narrowed, performance degrades across all metrics, suggesting that restrictive bounds reduce the expressiveness of the encoder and weaken its ability to model heterogeneous spatio-temporal relations. Based on these results, we adopt

γ_{\min} = 0.1

and

γ_{\max} = 2.0

as the default encoder configuration in all subsequent experiments. These results demonstrate that the effectiveness of STFA in the encoder depends not only on relation-aware modulation itself, but also on providing a sufficiently expressive scaling range for adaptive attention allocation.

4.3.5. Spatio-Temporal Focal Attention in Decoder

We further assess the effectiveness of spatio-temporal focal attention when applied in the decoder. We fix the

γ

range of encoder STFA to

γ_{\min} = 0.1

and

γ_{\max} = 2.0

, while varying the decoder range to assess its impact on forecasting performance. In Table 9, it is presented that different decoder STFA ranges affect forecasting performance. Using the same wide range in the decoder leads to a higher displacement error, with a minFDE of 0.9756. In contrast, adopting a narrower decoder range

(0.7, 1.3)

achieves a minFDE of 0.9632 and reduces failure cases, achieving an MR of 0.0807. These results suggest that the focal attention of the decoder is particularly sensitive to value scaling; overly large value ranges may induce more fine-grained iterations in the decoder, ultimately degrading the final predictions.

4.3.6. Analysis of Proposal-Relevant Attention

We calculate the proportion of these high-attention tokens that fall within 5 m and 10 m of the corresponding proposals. A higher SAR value indicates that more attention is assigned to proposal-relevant spatial regions.

In Table 10, it is presented that SAR values between the baseline and ProFocus are compared for both agent and lane attention. The results show that ProFocus improves SAR for both agent and lane attention. For the agent context, SAR increases from 0.2053 to 0.2143 within 5 m and from 0.3523 to 0.3651 within 10 m. For lane context, the improvement is more significant as SAR increases from 0.6559 to 0.7717 within 5 m and from 0.8087 to 0.9057 within 10 m. These results indicate that the attention becomes more concentrated around proposal-relevant regions after introducing proposal-aware encoding.

We then conduct an experiment to verify the influence of attention distributions on the final prediction results. For each validation scene, we rank the tokens according to their attention scores and mask 10% of the tokens in each scene. We compare three masking strategies: high-attention token masking, low-attention token masking, and random token masking. In Table 11, it is presented that masking high-attention tokens leads to a substantial degradation in prediction performance, increasing minADE from 0.6940 to 0.7998, minFDE from 0.9742 to 1.2605, and MR from 0.0810 to 0.1327. In contrast, masking low-attention tokens produces moderate changes, with minADE, minFDE, and MR remaining close to the no-masking baseline. These results demonstrate that tokens assigned high attention have a direct impact on the accuracy of trajectory prediction, while low-attention tokens contribute much less to the final prediction.

4.3.7. Analysis of Robustness to Proposal Errors

To simulate poor-quality proposals, we conducted an additional perturbation experiment on the proposals generated by the first stage. Specifically, we added zero-mean Gaussian noise with standard deviations of 0.1 m, 0.3 m, 0.5 m, 1.0 m, and 2.0 m to the generated proposals and compared the results with the baseline. In Table 12, it is presented that the performance of ProFocus under different levels of first-stage proposal perturbation is reported. The results indicate that ProFocus remains stable under moderate proposal perturbations. When the noise standard deviation increases from 0 to 0.5 m, minADE increases from 0.6940 to 0.6989, minFDE from 0.9742 to 0.9820, and MR from 0.0810 to 0.0825, showing only marginal degradation. Even under stronger perturbation with a standard deviation of 1.0 m, the degradation remains moderate, with minADE, minFDE, and MR reaching 0.7061, 0.9954, and 0.0846, respectively. A clearer performance drop appears when the standard deviation reaches 2.0 m, where the first-stage proposals are substantially corrupted, increasing minADE to 0.7194, minFDE to 1.0157, and MR to 0.0875. These results suggest that ProFocus does not rely on perfectly accurate first-stage proposals and has a certain degree of robustness to proposal noise, although extremely poor proposals can still affect the refinement stage.

4.3.8. The Number of Generated Proposals

In Table 13, it is presented that the sensitivity of ProFocus to the number of generated proposals K in the first stage is analyzed. In ProFocus, the first-stage proposals are not directly treated as final predictions, but serve as anticipatory guidance for the proposal-aware scene encoder and the subsequent trajectory refinement process. Therefore, K controls not only the diversity of coarse future hypotheses, but also the amount of proposal information introduced into the refinement stage. As shown in Table 13, setting

K = 6

achieves the best overall performance, with MR, minFDE, and minADE of 0.0810, 0.9742, and 0.6940, respectively. Increasing K to 9 or 12 does not bring further improvement. This indicates that simply increasing the number of proposals may introduce redundant or less reliable candidate hypotheses, which can weaken the focusing effect of proposal-aware refinement. Based on these results, we adopt

K = 6

as the default setting, which provides a better balance between proposal diversity, refinement stability, and prediction accuracy.

4.3.9. Visualization

The attention distribution analysis of ProFocus. We further investigate whether ProFocus improves the contextual representations, so that the decoder can better capture proposal-relevant spatio-temporal cues and assign higher attention to tokens that are more pertinent to the target agent’s plausible future behavior. Specifically, we visualize the relationship between attention value and distance for the attention distribution of two kinds of context: lane segments and surrounding agents.

In Figure 4, it is presented that the relationship between lane-segment attention and distance to the target agent is visualized. Compared with the baseline without ProFocus, both ProFocus with MHA in the encoder and ProFocus with STFA in the encoder assign higher attention to relevant lane segments and exhibit a sharper attenuation in the long-distance regions. This suggests that ProFocus enhances the contextual representation ability to emphasize spatially closer and proposal-relevant lane context, while suppressing attention to less relevant lane segments. Moreover, ProFocus with STFA further strengthens this distance-aware attention allocation, indicating that the proposed STFA improves the selectivity of contextual aggregation and reduces distraction from proposal-irrelevant lane segments tokens.

In Figure 5, it is presented that the relationship between agent attention and distance to the target agent is visualized. Compared with the baseline without ProFocus, both ProFocus with MHA in the encoder and ProFocus with STFA in the encoder assign higher attention to surrounding agents in close proximity to the target agent, while exhibiting a more pronounced decay of attention as the distance increases. This indicates that ProFocus improves the contextual representation ability to emphasize socially relevant neighboring agents that are more likely to influence the target agent’s future motion, while suppressing attention to distant agents with weaker interaction relevance. Furthermore, ProFocus with STFA yields a slightly sharper attenuation in the long-range agents, suggesting that STFA further enhances the selectivity of social context aggregation and reduces distraction from proposal-irrelevant neighboring agents.

These results demonstrate that ProFocus improves the representational quality of contextual features, especially by strengthening the encoding of context that is more relevant to the target agent. This enhanced contextual representation enables the decoder to form a more reasonable attention distribution, assigning greater attention to proposal-relevant and target-related contextual representations while reducing interference from weakly relevant context. Consequently, ProFocus facilitates more effective aggregation of informative spatio-temporal cues for accurate future trajectory prediction.

Qualitative visualization results. In Figure 6, it is presented that qualitative prediction results of SIMPL, LAformer, and ProFocus are compared across four driving scenarios: straight driving with curbside pull over, straight-forward lane changing, turning left, and turning right. The first column of Figure 6a shows the straight-driving scenario with a curbside pull-over intention. When the vehicle exhibits an intention to pull over toward the roadside, other methods fail to capture this behavior, whereas ProFocus accurately predicts the pull-over trajectory. In the second column of Figure 6a, the lane-change case highlights a limitation of other methods, which misalign with the lane-change intention, leading to end-point errors. In contrast, ProFocus correctly infers the lane-change intent and accurately localizes the end-point position, yielding trajectories that align closely with the ground truth. As shown in the first column of Figure 6b, which represents a left-turn driving scenario, ProFocus produces future trajectories with more accurate heading changes and end-point position coordinates, reflecting a better comprehension of the agent’s behavior and underlying lane geometry. In the complex intersection case, as shown in the second column of Figure 6b, ProFocus not only predicts accurate turning right intent and end-point position but also outputs prediction trajectories that are more consistent with lane topology and driving rules. Benefiting from proposal-aware scene encoding and STFA, ProFocus emphasizes historical trajectories and lane segments that are relevant to potential future proposals, leading to more accurate future trajectory predictions.

5. Conclusions

This paper investigates a limitation of conventional encoder–decoder trajectory prediction frameworks, where proposal-agnostic scene encoders may lead the interaction module to allocate attention to proposal-irrelevant regions, thereby limiting prediction accuracy. To address this issue, we present ProFocus, a proposal-aware trajectory forecasting framework that explicitly uses early-stage potential proposals to guide scene encoding, guiding the model to assign more attention to proposal-relevant context and reduce prediction errors. The framework is composed of a potential proposal generator that produces coarse candidate futures from historical trajectories and lane segments, and a proposal-aware trajectory refiner that encodes the scene under proposal guidance to extract proposal-relevant context for subsequent refinement. To mitigate distraction from irrelevant context in attention distributions, we introduce spatio-temporal focal attention, which modulates attention logits with a spatio-temporal relation-controlled softmax. This mechanism enables relation-adaptive attention sharpening, promoting fine-grained interactions around proposal-relevant regions while suppressing irrelevant context. ProFocus achieves competitive performance on the Argoverse 1 and INTERACTION datasets, delivering consistent reductions in MR, minFDE, and minADE, while maintaining robust qualitative behavior across diverse traffic scenarios.

Despite the promising performance of ProFocus in trajectory prediction across driving scenarios, the framework still exhibits limitations. Firstly, the current evaluation is conducted on Argoverse 1 and INTERACTION, and we have not yet completed experiments on long-horizon benchmarks such as Argoverse 2 and Waymo Open Motion. These datasets contain longer temporal horizons, more diverse driving scenarios, and stronger domain shifts, which are important for further validating the transferability and robustness of the proposed framework. Secondly, ProFocus may still face challenges in extremely rare traffic rule-violating scenarios, such as leaving the drivable area, illegal multi-lane U-turns, or sudden stop-and-reverse maneuvers. These long-tail cases remain difficult due to their low occurrence frequency and highly uncertain behavioral patterns.

In future work, we will prioritize extending ProFocus to Argoverse 2 and Waymo Open Motion to further examine its generalization ability under longer temporal horizons and more diverse driving scenarios. We will also investigate richer multimodal inputs and more robust proposal refinement mechanisms to improve prediction reliability under long-tail and safety-critical conditions.

Author Contributions

Conceptualization, H.L. and X.L.; methodology, H.L.; software, H.L.; validation, H.L. and X.L.; formal analysis, H.L.; investigation, Z.L.; resources, H.L.; data curation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, H.L.; visualization, Z.L.; supervision, H.L.; project administration, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant (62402034).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions of the study are included in the article, and further inquiries can be directed to the corresponding authors.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant (62402034), and was also supported by the High-performance Computing Platform of University of Science and Technology Beijing. The authors also thank all those who contributed to the academic and technical environment in which this work was completed.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Indices and sets
Symbol	Definition
$i, j$	Index for query and key tokens
k	Index for attention module type or trajectory mode
$k^{*}$	Best-matching trajectory mode selected by proposal error
r	Prediction branch indicator, where $r \in {pro, ref}$
t	Index of time steps
c	Index of historical trajectory coordinates
S	Number of decoder layers
T	Historical observation horizon
F	Future prediction horizon
K	Number of predicted trajectory modes
d	Feature embedding dimension
$β, μ, γ$	Temporal, agent, and mode attention dimensions

Parameters
Symbol	Definition
X	Input feature for generating query representation
Y	Input feature for generating key and value representations
Q	Query representation in attention calculation
K	Key representation in attention calculation
V	Value representation in attention calculation
$W^{Q}$	Learnable projection matrix for query representation
$W^{K}$	Learnable projection matrix for key representation
$W^{V}$	Learnable projection matrix for value representation
$P (δ_{i, j})$	Spatio-temporal relation controlled attention modulation parameter
$\tilde{P} (δ_{i, j})$	Unbounded modulation value before clipping
$γ_{\min}$	Lower bound of the parameter
$γ_{\max}$	Upper bound of the parameter
$δ_{i j}$	Spatio-temporal relationship between key token $K_{j}$ and query token $Q_{i}$
$r_{i j}$	Encoded relative spatio-temporal relation vector
$Δ p_{i j}$	Relative position between query and key tokens
$Δ θ_{i j}$	Relative orientation between query and key tokens
$Δ ϕ_{i j}$	Relative heading angle between query and key tokens
$Δ t_{i j}$	Temporal displacement between query and key tokens
$l_{p}$	Lane segment position attribute
$l_{h}$	Lane segment heading orientation attribute
$l_{g}$	Lane segment physical length attribute
$M_{L}$	Lane segment feature representation
$E_{L}$	Final lane representation in the scene encoder
$n_{t}$	Displacement vector of historical trajectory at time step t
$N_{H}$	Concatenated historical displacement representation
$E_{H}$	Historical trajectory embedding
$E_{Q}$	Proposal embedding in the potential proposal decoder
$R_{H}$	History-aware proposal representation
$R_{M}$	Map-aware proposal representation
$R_{T}$	Fused proposal representation
U	Output of the mode attention module in the proposal generator
$p_{t}$	Point of potential proposal trajectory at time step t
$d_{t}$	Relative displacement vector of proposal trajectory at time step t
$V_{P}$	Concatenated proposal displacement representation
$D_{P}$	Potential proposal embedding
$G_{L}$	Structure-aware lane embedding
$G_{L P}$	Proposal-aware map representation
$G_{H}$	History trajectory embedding in the proposal-aware encoder
$G_{H P}$	Proposal-aware history trajectory representation
$G_{Q}$	Query proposal embedding in the proposal-aware decoder
$S_{H P}$	Agent-history interaction representation in the refiner
$S_{M P}$	Map interaction representation in the refiner
$S_{T P}$	Fused representation in the proposal-aware refinement module
Z	Refined proposal representation before prediction
${\hat{y}}^{r}$	Predicted trajectory from branch r
$y$	Ground-truth trajectory
${\hat{s}}^{r}$	Predicted confidence distribution from branch r
s	One-hot mode label for the ground truth
$L_{total}^{r}$	Total training objective for branch r
$L_{reg}^{r}$	Regression loss for branch r
$L_{cls}^{r}$	Classification loss for branch r
$ℓ_{Huber} (\cdot)$	2D Smooth L1 loss for trajectory regression

References

Gao, J.; Sun, C.; Zhao, H.; Shen, Y.; Anguelov, D.; Li, C.; Schmid, C. VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11522–11530. [Google Scholar] [CrossRef]
Liang, M.; Yang, B.; Hu, R.; Chen, Y.; Liao, R.; Feng, S.; Urtasun, R. Learning Lane Graph Representations for Motion Forecasting. In Proceedings of the Computer Vision–ECCV 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 541–556. [Google Scholar] [CrossRef]
Zeng, W.; Liang, M.; Liao, R.; Urtasun, R. LaneRCNN: Distributed Representations for Graph-Centric Motion Forecasting. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2021; pp. 532–539. [Google Scholar] [CrossRef]
Salzmann, T.; Ivanovic, B.; Chakravarty, P.; Pavone, M. Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data. In Proceedings of the Computer Vision–ECCV 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 683–700. [Google Scholar] [CrossRef]
Zhou, Z.; Ye, L.; Wang, J.; Wu, K.; Lu, K. HiVT: Hierarchical Vector Transformer for Multi-Agent Motion Prediction. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2022; pp. 8813–8823. [Google Scholar] [CrossRef]
Ngiam, J.; Caine, B.; Vasudevan, V.; Zhang, Z.; Chiang, H.T.L.; Ling, J.; Roelofs, R.; Bewley, A.; Liu, C.; Venugopal, A.; et al. Scene Transformer: A Unified Architecture for Predicting Future Trajectories of Multiple Agents. In Proceedings of the International Conference on Learning Representations (ICLR 2022), Virtual, 25–29 April 2022; Available online: https://openreview.net/forum?id=Wm3EA5OlHsG (accessed on 25 May 2026).
Nayakanti, N.; Al-Rfou, R.; Zhou, A.; Goel, K.; Refaat, K.S.; Sapp, B. Wayformer: Motion Forecasting via Simple and Efficient Attention Networks. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2023; pp. 2980–2987. [Google Scholar] [CrossRef]
Shi, S.; Jiang, L.; Dai, D.; Schiele, B. Motion Transformer with Global Intention Localization and Local Movement Refinement. In Proceedings of the Advances in Neural Information Processing Systems 35, New Orleans, LA, USA, 28 November–9 December 2022; Neural Information Processing Systems Foundation; pp. 6531–6543. [Google Scholar] [CrossRef]
Cheng, J.; Mei, X.; Liu, M. Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2023; pp. 8645–8655. [Google Scholar] [CrossRef]
Liu, H.; Liu, H.; Fan, B.; Xu, J. DMotion: Diverse Modalities Alignment Enhanced Motion Prediction for Autonomous Driving. IEEE Trans. Comput. Soc. Syst. 2025, 12, 3349–3364. [Google Scholar] [CrossRef]
Afshar, S.; Deo, N.; Bhagat, A.; Chakraborty, T.; Shao, Y.; Buddharaju, B.R.; Deshpande, A.; Cui, H. PBP: Path-based Trajectory Prediction for Autonomous Driving. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2024; pp. 12927–12934. [Google Scholar] [CrossRef]
Li, S.; Liu, C.; Xu, X.; Yeo, S.Y.; Yang, X. Future-Aware Interaction Network for Motion Forecasting. In Proceedings of the 2025 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2025; pp. 7505–7515. [Google Scholar] [CrossRef]
Choi, D.; Yim, J.; Baek, M.; Lee, S. Machine Learning-Based Vehicle Trajectory Prediction Using V2V Communications and On-Board Sensors. Electronics 2021, 10, 420. [Google Scholar] [CrossRef]
Sighencea, B.I.; Stanciu, I.R.; Căleanu, C.D. D-STGCN: Dynamic Pedestrian Trajectory Prediction Using Spatio-Temporal Graph Convolutional Networks. Electronics 2023, 12, 611. [Google Scholar] [CrossRef]
Hortelano, J.L.; Trentin, V.; Artuñedo, A.; Villagra, J. GPU-Accelerated Interaction-Aware Motion Prediction. Electronics 2023, 12, 3751. [Google Scholar] [CrossRef]
Yuan, Y.; Weng, X.; Ou, Y.; Kitani, K. AgentFormer: Agent-Aware Transformers for Socio-Temporal Multi-Agent Forecasting. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 9793–9803. [Google Scholar] [CrossRef]
Zhou, Z.; Wang, J.; Li, Y.H.; Huang, Y.K. Query-Centric Trajectory Prediction. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2023; pp. 17863–17873. [Google Scholar] [CrossRef]
Da, C.; Qian, Y.; Zeng, J.; Wei, X.; Zhang, F. ADSAP: An Adaptive Speed-Aware Trajectory Prediction Framework with Adversarial Knowledge Transfer. Electronics 2025, 14, 2448. [Google Scholar] [CrossRef]
Su, H.; Wang, N.; Wang, X. Collision Risk Assessment of Lane-Changing Vehicles Based on Spatio-Temporal Feature Fusion Trajectory Prediction. Electronics 2025, 14, 3388. [Google Scholar] [CrossRef]
Li, A.; Xu, Z.; Pan, Y.; Zhang, J.; Chen, N.; Yu, H.; Chen, Y.; Li, Y. A Convolutional Transformer Network for Vehicle Trajectory Prediction in Urban Traffic Scenarios. J. Transp. Eng. Part A Syst. 2026, 152, 04025132. [Google Scholar] [CrossRef]
Bhat, M.; Francis, J.; Oh, J. Trajformer: Trajectory Prediction with Local Self-Attentive Contexts for Autonomous Driving. In Proceedings of the Machine Learning for Autonomous Driving Workshop at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2020; Available online: https://ml4ad.github.io/files/papers2020/Trajformer:%20Trajectory%20Prediction%20with%20Local%20Self-Attentive%20Contexts%20for%20Autonomous%20Driving.pdf (accessed on 25 May 2026).
Liu, M.; Cheng, H.; Chen, L.; Broszio, H.; Li, J.; Zhao, R.; Sester, M.; Yang, M.Y. LAformer: Trajectory Prediction for Autonomous Driving with Lane-Aware Scene Constraints. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: New York, NY, USA, 2024; pp. 2039–2049. [Google Scholar] [CrossRef]
Chang, M.F.; Lambert, J.; Sangkloy, P.; Singh, J.; Bak, S.; Hartnett, A.; Wang, D.; Carr, P.; Lucey, S.; Ramanan, D.; et al. Argoverse: 3D Tracking and Forecasting With Rich Maps. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2019; pp. 8740–8749. [Google Scholar] [CrossRef]
Zhan, W.; Sun, L.; Wang, D.; Shi, H.; Clausse, A.; Naumann, M.; Kümmerle, J.; Königshof, H.; Stiller, C.; de La Fortelle, A.; et al. INTERACTION Dataset: An International, Adversarial and Cooperative Motion Dataset in Interactive Driving Scenarios with Semantic Maps. arXiv 2019, arXiv:1910.03088. [Google Scholar] [CrossRef]
Gu, J.; Sun, C.; Zhao, H. DenseTNT: End-to-End Trajectory Prediction from Dense Goal Sets. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2021; pp. 15283–15292. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, J.; Fang, L.; Jiang, Q.; Zhou, B. Multimodal Motion Prediction with Stacked Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2021; pp. 7573–7582. [Google Scholar] [CrossRef]
Da, F.; Zhang, Y. Path-Aware Graph Attention for HD Maps in Motion Prediction. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2022; pp. 6430–6436. [Google Scholar] [CrossRef]
Zhang, L.; Li, P.; Chen, J.; Shen, S. Trajectory Prediction with Graph-based Dual-scale Context Fusion. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2022; pp. 11374–11381. [Google Scholar] [CrossRef]
Bhattacharyya, P.; Huang, C.; Czarnecki, K. SSL-Lanes: Self-Supervised Learning for Motion Forecasting in Autonomous Driving. In Proceedings of the 6th Conference on Robot Learning (CoRL), Auckland, New Zealand, 14–18 December 2022; PMLR: Cambridge, MA, USA, 2023; pp. 1793–1805. Available online: https://proceedings.mlr.press/v205/bhattacharyya23a.html (accessed on 25 May 2026).
Park, D.; Ryu, H.; Yang, Y.; Cho, J.; Kim, J.; Yoon, K.J. Leveraging Future Relationship Reasoning for Vehicle Trajectory Prediction. In Proceedings of the International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, 1–5 May 2023; Available online: https://openreview.net/forum?id=CGBCTp2M6lA (accessed on 25 May 2026).
Aydemir, G.; Akan, A.K.; Güney, F. ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2023; pp. 8261–8271. [Google Scholar] [CrossRef]
Choi, S.; Kim, J.; Yun, J.; Choi, J.W. R-Pred: Two-Stage Motion Prediction Via Tube-Query Attention-Based Trajectory Refinement. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2023; pp. 8491–8501. [Google Scholar] [CrossRef]
Zhang, L.; Li, P.; Liu, S.; Shen, S. SIMPL: A Simple and Efficient Multi-Agent Motion Prediction Baseline for Autonomous Driving. IEEE Robot. Autom. Lett. 2024, 9, 3767–3774. [Google Scholar] [CrossRef]
Huang, Z.; Li, Y.; Li, D.; Mu, Y.; Qin, H.; Zheng, N. Post-interactive Multimodal Trajectory Prediction for Autonomous Driving. Transp. Res. Part C Emerg. Technol. 2025, 179, 105271. [Google Scholar] [CrossRef]
Casas, S.; Gulino, C.; Suo, S.; Luo, K.; Liao, R.; Urtasun, R. Implicit Latent Variable Model for Scene-Consistent Motion Forecasting. In Proceedings of the Computer Vision–ECCV 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 624–641. [Google Scholar] [CrossRef]
Zhao, H.; Gao, J.; Lan, T.; Sun, C.; Sapp, B.; Varadarajan, B.; Shen, Y.; Shen, Y.; Chai, Y.; Schmid, C.; et al. TNT: Target-driveN Trajectory Prediction. In Proceedings of the 2020 Conference on Robot Learning (CoRL 2020), Virtual, 16–18 November 2020; PMLR: Cambridge, MA, USA, 2021; pp. 895–904. Available online: https://proceedings.mlr.press/v155/zhao21b.html (accessed on 25 May 2026).
Gilles, T.; Sabatini, S.; Tsishkou, D.; Stanciulescu, B.; Moutarde, F. THOMAS: Trajectory Heatmap Output with Learned Multi-Agent Sampling. In Proceedings of the International Conference on Learning Representations (ICLR 2022), Virtual, 25–29 April 2022; Available online: https://openreview.net/forum?id=QDdJhACYrlX (accessed on 25 May 2026).
Rowe, L.; Ethier, M.; Dykhne, E.H.; Czarnecki, K. FJMP: Factorized Joint Multi-Agent Motion Prediction over Learned Directed Acyclic Interaction Graphs. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2023; pp. 13745–13755. [Google Scholar] [CrossRef]

Figure 1. Visualization of contextual attention distributions without and with ProFocus.

Figure 2. Overview of the ProFocus framework, consisting of two components: the potential proposal generator (top) and the proposal-aware trajectory refiner (bottom). The generator first encodes scene context via the scene encoder, then produces coarse-grained potential future trajectory proposals through the potential proposal decoder. The refiner leverages these proposals to guide the proposal-aware scene encoder in encoding spatiotemporal context, and further refines the proposals into accurate, plausible trajectories via the proposal-aware trajectory decoder.

Figure 3. Illustration of the spatio-temporal aware focal attention. The query, key, and value embeddings are first projected by linear layers. A learnable control parameter, derived from the spatio-temporal relation, is introduced to adaptively scale the attention logits before the softmax operation.

Figure 4. Relationship between lane segment attention and distance to the target agent. ProFocus with MHA in encoder and ProFocus-STFA in encoder show a sharper decay of attention with increasing distance than the baseline, indicating stronger focus on nearby and proposal-relevant lane segments. ProFocus with STFA achieves the best attention distribution with increasing distance.

Figure 5. Relationship between agent attention and distance to the target agent. ProFocus-MHA and ProFocus-STFA show stronger attention to nearby agents and a clearer distance-aware decay than the baseline, indicating improved focus on socially relevant neighboring agents. ProFocus with STFA achieves the best attention distribution with increasing distance.

Figure 6. Qualitative comparison of multi-modal trajectory predictions. Rows correspond to SIMPL, LAformer, and ProFocus. The red circle denotes the target agent, blue polylines indicate the history trajectory, green curves represent predicted trajectories, and the purple star marks the ground truth.

Table 1. Runtime detail in Argoverse 1 and INTERACTION.

Dataset	Agents	Lanes	Preprocess	Candidate Generation	Batched Inference
Argoverse 1	22.71	176.23	1.36 ms	10.24 ms	16.41 ms
INTERACTION	11.90	62.47	0.08 ms	7.99 ms	10.49 ms

Table 2. Summary of experimental configurations and performance comparison of compared methods on Argoverse 1 validation set.

Method	MR ↓	minFDE ↓	minADE ↓	Stages	Input Format	Map Encoder	Output Modes
DenseTNT [25] (ICCV 2021)	0.10	1.05	0.73	1	Graph	GNN	6
mmTransformer [26] (CVPR 2021)	0.11	1.15	0.71	1	Sequence	MLP	6
PAGA [27] (ICRA 2022)	0.09	1.02	0.69	1	Graph	GNN	6
HiVT [5] (CVPR 2022)	0.09	0.96	0.66	1	Graph	GNN	6
DSP [28] (IROS 2022)	0.09	0.98	0.69	1	Graph	GNN	6
SSL-Lanes [29] (CoRL 2022)	0.09	1.01	0.70	1	Graph	GNN	6
FRM [30] (ICLR 2023)	–	0.99	0.68	1	Graph	GNN	6
ADAPT [31] (ICCV 2023)	0.08	0.95	0.67	1	Sequence	MLP	6
R-Pred [32] (ICCV 2023)	0.09	0.95	0.66	2	Sequence	MLP	6
PBP [11] (ICRA 2024)	0.10	1.01	–	2	Graph	GNN	6
SIMPL [33] (RAL 2024)	0.08	0.95	0.66	1	Sequence	MLP	6
LAformer [22] (CVPRW 2024)	–	0.92	0.64	2	Sequence	MLP	6
Pioformer [34] (Transp. Res. Part C 2025)	0.09	0.95	0.66	3	Graph	GNN	6
FINet [12] (ICCV 2025)	0.09	0.95	0.59	2	Sequence	Mamba	6
ProFocus (Ours)	0.07	0.88	0.64	2	Graph	GNN	6

↓ indicates that lower values are better. “–” indicates that the corresponding metric was not reported in the original paper or official source under the same Argoverse 1 validation setting. Bold and underlined values indicate the best and second-best performance among the compared methods, respectively.

Table 3. Accuracy-efficiency comparison on the Argoverse 1 validation set. All latency measurements are conducted on a GeForce RTX 3090.

Method	Params (M)↓	Latency (ms) ↓	MR ↓	minFDE ↓	minADE ↓
HiVT-128 [5]	2.56	45.60	0.09	0.96	0.66
SIMPL [33]	1.80	12.69	0.08	0.95	0.66
ProFocus (Ours)	4.53	28.01	0.07	0.88	0.64

↓ indicates that lower values are better. Bold and underlined values indicate the best and second-best performance among the compared methods, respectively.

Table 4. Summary of experimental configurations and performance comparison of compared methods on INTERACTION validation set.

Method	MR ↓	minFDE ↓	minADE ↓	Stages	Input Format	Map Encoder	Output Modes
ILVM [35] (ECCV 2020)	0.1980	0.8400	–	1	Graph	GNN	6
TNT [36] (PMLR 2021)	–	0.6700	0.2100	3	Graph	GNN	6
SceneTransformer [6] (ICLR 2022)	0.1180	0.8400	–	1	Sequence	Transformer	6
THOMAS [37] (ICLR 2022)	0.1180	0.7600	–	2	Raster	CNN	6
FJMP [38] (CVPR 2023)	0.0810	0.6230	0.1930	2	Graph	GNN	6
ProFocus (Ours)	0.0659	0.5682	0.1801	2	Graph	GNN	6

↓ indicates that lower values are better. “–” indicates that the corresponding metric was not reported in the original paper or official source under the same Argoverse 1 validation setting. Bold and underlined values indicate the best and second-best performance among the compared methods, respectively.

Table 5. Component study of ProFocus on Argoverse 1 dataset.

Baseline	Proposal-Aware Encoder	Spatio-Temperal Aware Focal Attention	MR ↓	minFDE ↓	minADE ↓
✓			0.0833	0.9994	0.7039
✓	✓		0.0820	0.9865	0.6966
✓		✓	0.0829	0.9910	0.7031
✓	✓	✓	0.0810	0.9742	0.6940

↓ indicates that lower values are better. ✓ indicates that the corresponding component is used in the model. Bold values indicate the best performance among the compared methods.

Table 6. Ablationstudy of proposal-aware encoder.

Baseline	Proposal-Aware Map Encoder	Proposal-Aware History Encoder	MR ↓	minFDE ↓	minADE ↓
✓			0.0833	0.9994	0.7039
✓	✓		0.0815	0.9833	0.6966
✓		✓	0.0821	0.9801	0.6979
✓	✓	✓	0.0810	0.9742	0.6940

↓ indicates that lower values are better. ✓ indicates that the corresponding component is used in the model. Bold values indicate the best performance among the compared methods.

Table 7. Fusion pattern experiment.

Fusion Pattern	MR ↓	minFDE ↓	minADE ↓
Baseline	0.0833	0.9994	0.7039
Concatenate	0.0925	0.9930	0.7046
Multi-head attention	0.0815	0.9893	0.7032
Temperature-based attention	0.08399	0.9923	0.6994
Spatio-temporal focal attention	0.0811	0.9821	0.6945