Disentangling Interaction and Intention for Long-Tail Pedestrian Trajectory Prediction

Yang, Chengkai; Liu, Jincheng; Dong, Xingping

doi:10.3390/computers15030186

Open AccessArticle

Disentangling Interaction and Intention for Long-Tail Pedestrian Trajectory Prediction

by

Chengkai Yang

^1,2

,

Jincheng Liu

^1,2

and

Xingping Dong

^1,2,*

¹

School of Computer Science, Wuhan University, Wuhan 430072, China

²

National Engineering Research Center for Multimedia Software, Institute of Artificial Intelligence, Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(3), 186; https://doi.org/10.3390/computers15030186

Submission received: 23 February 2026 / Revised: 8 March 2026 / Accepted: 10 March 2026 / Published: 12 March 2026

(This article belongs to the Section AI-Driven Innovations)

Download

Browse Figures

Versions Notes

Abstract

Pedestrian trajectory prediction remains a challenging task, particularly in long-tail scenarios where goal distributions are sparse and inter-agent behaviors are uncertain. In this work, we propose to disentangle the trajectory prediction task into two complementary components: interaction modeling and intention modeling. For interaction modeling, we introduce an adaptive meta-strategy that proactively extracts latent and rare-yet-critical interaction patterns often overlooked by conventional trajectory-only approaches. For intention modeling, we propose Continuous Waypoint Slot-Driven Prototypical Contrastive Learning (PCL). It adapts prototype learning to the multi-modal reality where conventional PCL fails to model diverse and continuous goal distributions. Capitalizing on the complementary strengths of both components, we orchestrate a unified frequency-based fusion module that seamlessly integrates interaction and intention modeling, yielding enhanced overall prediction accuracy. In particular, our method is model-agnostic and can be seamlessly incorporated into a wide range of existing prediction frameworks. Extensive experiments on several datasets demonstrate that our approach not only achieves consistent performance gains in standard settings, but also significantly alleviates degradation on hard or long-tail trajectory samples.

Keywords:

pedestrian trajectory prediction; abnormal interaction; PCL; long-tail goals; frequency-based; model-agnostic

1. Introduction

Trajectory prediction, essential for intelligent systems [1,2,3], has evolved from kinematic models [4,5,6] and traditional ML [7,8,9] to data-driven deep learning approaches [9,10,11,12,13,14] that better capture complex interactions. A recent study [15] employed contrastive learning to mitigate accuracy degradation in rare trajectories.

Although these studies [15,16] concentrate on the hard case issue, they rarely analyze the two fundamental sources of long-tail behaviors: abnormal social interactions and rare goal intentions. Consequently, by applying a monolithic autoencoder to inherently continuous trajectories without disentangling these underlying causes, their approaches suffer from limited reconstruction quality [17] and produce ill-suited discrete representations. Furthermore, their evaluation, often restricted to a single algorithm, leaves generalization to other representative methods unclear. As a result, they struggle to capture rare events while maintaining accuracy on common cases. Such rare samples often stem from either abnormal inter-agent interactions or uncommon goals that require distinct modeling strategies. To address this, we propose explicitly decoupling interaction and intention into modular components, allowing independent adaptation and improved generalization across both frequent and rare scenarios.

In modeling social interactions, we address two key limitations: the lack of explicit, interpretable uncertainty representation beyond Gaussian noise [18] and the failure to capture rare-but-critical abnormal behaviors (e.g., collision avoidance). To overcome these, a richer representation using distance (to describe interaction intensity), speed (to describe motion intent), and angle (to describe social structure) is necessary to provide a more physically meaningful and dynamic description than a static snapshot of relative positions. As shown in Figure 1a, pedestrians exhibiting extremely high speeds or remaining stationary can indicate abnormal interactions that an ego agent should pay more attention to. In intention modeling, although many studies generate goal candidates to assist intention modeling, they [15,16,19] do not emphasize the continuity and long-tail characteristic of continuous trajectory goals, such as those in Figure 1b, resulting in limited interpretability and performance. In this work, we propose an approach that jointly considers interactions and intentions together.

Meteorologists predict future weather patterns by analyzing satellite imagery that captures the evolving formations of anomalous cloud masses rather than examining the distribution of individual droplets [20]. Inspired by this high-level perspective, our work focuses on extracting abnormal social interactions and intentions by identifying structured patterns rather than modeling each in isolation. To address the challenges in modeling pedestrian intention and interaction in complex scenarios, we propose the following approach: (1) We first decouple social interaction from intention. (2) Social interactions are mapped to a Distance–Velocity–Angle (DVA) space modeled as multivariate normal distributions. A Gaussian Mixture Model (GMM) classifies interactions, enabling offline generation of an abnormal interaction database for scenario-specific aggregation. (3) An enhanced PCL framework incorporates a GMM-based module that generates pseudo-labels of rare intentions to supervise motion prediction while maintaining continuity, which eliminates ill-suited discrete representation and the need for training a separate autoencoder. (4) To ensure robust performance across both challenging cases and the entire dataset, we integrate (2) and (3) into a frequency-sensitive decoder which regards rare intentions predicted by (3) as goal-driven to generate complete trajectories. Our method demonstrates consistent improvements over several recent representative trajectory prediction models on multiple benchmarks.

The contributions of our work are summarized as follows: (1) a framework of decoupled interaction–intentions for motion forecasting; (2) a DVA space that models interaction uncertainty coupled with offline GMM-based preprocessing for efficient extraction and aggregation of abnormal interactions; (3) an improved prototypical contrastive learning method for rare intentions under continuous goal labels; (4) a frequency-sensitive decoder to combine abnormal interactions and intentions together which could seamlessly connect to existing methods. To our knowledge, this is the first work studying the uncertainty and long-tail of two important factors (interactions and intentions) together. Extensive experiments and in-depth analysis confirm that our method consistently outperforms several recent representative methods.

2. Related Work

Social Interaction Modeling. Social pooling methods [21,22,23] propagate neighboring pedestrians’ temporal features to the target agents. NADP [24] is a decoupled pedestrian trajectory prediction network that uses a near-aware attention module to extract core spatiotemporal features for prediction. While early attention-based approaches [11,23,25] employ factorized structures

O (N^{2})

to model social interactions, recent methods like T-MLSTG [26] (GNN-based), PMITra [27] (GNN-based), QCNet [28] (GNN-based), IA-STDGNN [29] (DGNN-based), and FJMP [30] (directed graphs) exhibit strong structural biases. SocialCircle [31] innovatively adopts angle-space aggregation, inspired by marine echo localization, but its uniform averaging of meta-components in angle space may overlook critical abnormal interactions in another feature dimension, which may be a key focus for motion prediction.

Intention Capturing in Trajectory Prediction. Agents have uncertain intentions (e.g., a person standing at an intersection can either go straight or turn left, as long as they behave in accordance with traffic rules). Most of the present methods regard the future trajectory of the ego agent as an M mixture distribution, where M denotes ’multi-modalities’. To model such uncertainty, PGP [18] decomposes it into lateral variability of anchor-based map candidates and longitude variability, which can be regarded as the random noise of normal distribution. MELON [32] decouples trajectory decoding into specialized modules with adaptive spatiotemporal uncertainty quantification and a streaming prediction scheme, achieving state-of-the-art performance on complex urban traffic datasets. PPT [33] uses a progressive learning architecture, modeling short-term goals and long-term goals in two stages. Current generative approaches [13,25,34,35,36,37,38] suffer from unreliable uncertainty modeling due to their uninterpretable noise sampling, highlighting the need for robust heuristic rules.

Long-Tail Distribution in Trajectory Prediction. Manual class balancing fails with increasing categories, inducing long-tail distributions. Previous studies [39,40] have addressed this imbalance for categorical targets. Prototypical contrastive learning (PCL) [41,42] addresses long-tailed data distributions by learning underlying features through instance discrimination, as demonstrated in previous work [41,43,44]. Pedestrian trajectories exhibit long-tail distributions (e.g., turns vs. straight paths) owing to inertial motion in unchanged scenes as well. FEND [15] proposes a future enhanced contrastive learning framework and a hypernetwork to recognize these long-tailed patterns. TrACT [16] proposes to incorporate richer information on training dynamics into a prototypical contrastive learning framework. Hi-SCL [19] fights long-tailed trajectory prediction with hierarchical wave-semantic contrastive learning. Unlike prior methods [15,16,19] that uniformly amplify minority patterns and distort the majority feature space, our approach selectively extracts long-tail signals as a plugin module, preserving the core representations and accuracy of normal data. As a continuous multi-modal regression problem, applying traditional prototype learning methods to pedestrian trajectory prediction is not feasible. Prior methods [15,16,19] often ignore the continuity imbalance of trajectory intention distributions, while applying direct binning method like [45] in the trajectory prediction task generates memory-intensive

O (N^{2})

2D grids, making the precise calibration of bin size critical as well. Moreover, grid-based discretization of 2D coordinates fails to preserve the underlying data structure in clustering tasks, as rigid spatial partitioning disregards intrinsic density variations.

3. Method

3.1. Problem Formulation

Denote the past trajectory of ego pedestrian i with

t_{h}

timesteps as

O^{i} = (p_{1}^{i}, p_{2}^{i}, \dots, p_{t_{h}}^{i})

, where 2D

p_{t}^{i} = (x_{t}^{i}, y_{t}^{i})

. We aim to forecast their future trajectory

F^{i} = (p_{t_{h} + 1}^{i}, p_{t_{h} + 2}^{i}, \dots, p_{t_{h} + t_{f}}^{i})

based on

O^{i}

and the past trajectories of all their

N_{a}

neighbors

O^{/ i}

= {

O^{j}

∣

1 \leq j

\leq N_{a}

}, where j ∈ neighbor(i). ‘Neighbors’ refers to non-ego agents (e.g., optionally filtered by distance). The motion forecasting task is to find an optimal model

θ^{*} = max_{θ} P (F^{i} | O^{i}, O^{/ i})

. In real-world scenarios, pedestrians usually decide their trajectories in a three-step manner that can be decomposed into a realistic cognitive-behavioral process: risk perception → goal formation → motion execution. Previous studies [46,47,48] disentangle the latent space and construct a Bayesian Network in deep learning tasks. In our work, we disentangle lateral variations into a social interaction part

z_{s o c i a l}

(

z_{s o c}

) and an individual intention part

z_{i n t e n t i o n}

(

z_{i n t}

), where we manage to minimize

P (F^{i} | O^{i}, O^{/ i})

via Equation (1):

P (F^{i} | O^{i}, O^{/ i}) = \int_{z_{s o c}} \int_{z_{i n t}} P (z_{s o c} | O^{i}, O^{/ i}) \cdot P (z_{i n t} | z_{s o c}) \cdot P (F^{i} | z_{i n t}, z_{s o c}) d_{z_{s o c}} d_{z_{i n t}} .

(1)

Equation (1) illustrates the decision-making process: A pedestrian first observes the surrounding environment

z_{s o c}

, which subsequently informs and refines their intended goal

z_{i n t}

. The future trajectory (

F^{i}

) is then generated based on the combination of this contextual information and the refined goal. Figure 2a–c shows our overall framework, corresponding to the above three items in Equation (1). The key symbol definitions in the Methods section are given in Table 1.

3.2. Abnormal Social Interaction Modeling

We focus on how to model the social interaction part

z_{s o c}

, which is used to simulate risk perception.

DVA Multivariate Gaussian Space. Previous studies [31,49] treat the neighbor social interactions to ego pedestrian i as social meta-components

R_{m e t a}^{i} = {r_{m e t a}^{i \leftarrow j} | 1 \leq j \leq N_{a}}

. In this work, we construct meta-components as

r_{m e t a}^{i \leftarrow j} = {

r_{d i s}^{i \leftarrow j}

,

r_{v e l}^{j}

,

r_{θ}^{i \leftarrow j}

}.

Relative Distance $r_{d i s}^{i \leftarrow j}$ . Neighbor agents exert different influences on the target agent depending on their current distances from the ego one. Formally, for any neighbor j of ego pedestrian i,

$r_{d i s}^{i \leftarrow j} = {∥ p_{t_{h}}^{i} - p_{t_{h}}^{j} ∥}_{2} .$

(2)
Absolute Velocity $r_{v e l}^{j}$ . Not only do high-velocity neighbors pose a greater threat, but static agents can too, especially if they are close by, as this may prompt the target pedestrian to take proactive avoidance measures. Formally, for any neighbor j of an ego pedestrian i,

$r_{v e l}^{j} = {∥ p_{t_{h}}^{j} - p_{0}^{j} ∥}_{2} .$

(3)
Relative Angle $r_{θ}^{i \leftarrow j}$ . Pedestrians often take advantage of the relative orientation angle to judge their surroundings (e.g., if there are crowds to the north). Formally, for any neighbor j of an ego pedestrian i,

$r_{θ}^{i \leftarrow j} = a t a n 2 (y_{t_{h}}^{i} - y_{t_{h}}^{j}, x_{t_{h}}^{i} - x_{t_{h}}^{j}) .$

(4)

We project

R_{m e t a}

into a multivariate Gaussian space

{μ_{m e t a}, Σ_{m e t a}}

to reflect its uncertainty, where

\{\begin{matrix} μ_{m e t a}^{i \leftarrow j} = [μ_{d i s}^{i \leftarrow j}, μ_{v e l}^{i \leftarrow j}, μ_{θ}^{i \leftarrow j}]; \\ Σ_{m e t a} = (\begin{matrix} σ_{d i s}^{2} & ρ_{d i s_v e l} σ_{d i s} σ_{v e l} & ρ_{d i s_θ} σ_{d i s} σ_{θ} \\ ρ_{d i s_v e l} σ_{d i s} σ_{v e l} & σ_{v e l}^{2} & ρ_{v e l_θ} σ_{v e l} σ_{θ} \\ ρ_{d i s_θ} σ_{d i s} σ_{θ} & ρ_{v e l_θ} σ_{v e l} σ_{θ} & σ_{θ}^{2} \end{matrix}) . \end{matrix}

(5)

Abnormal Social Meta-Components. Abnormal social behaviors at the scene level have certain commonalities (e.g., in congested environments, agents may change direction to accelerate overtaking, while high-velocity agents will prioritize avoiding static neighbors on their paths forward). However, these interactions in DVA space may be long-tail and influenced by the combined effects of multiple independent variables (e.g., in the previous example, ‘high-velocity’ corresponds to absolute velocity

r_{v e l}

and ‘on the forward path’ corresponds to relative angle

r_{θ}

—the two factors interact to create the aforementioned long-tail scenarios). To deal with these long-tail social behaviors, a GMM

Θ_{a b n} = \sum_{n = 1}^{N_{a b n}} λ_{n} N (μ_{n}, Σ_{n})

with

N_{a b n}

components is constructed to extract abnormal interactions. Inspired by [31], our abnormal interactions represent another spatial interactive context so that we can handle the abnormal interaction sequence along with the object trajectory (

N_{a b n}

=

t_{h}

) for easy alignment and concatenation as well. In detail, for all social meta-components in the training set of

N_{t r}

pedestrians

R_{t r} =

{R_{m e t a}^{1}, R_{m e t a}^{2}, \dots, R_{m e t a}^{i}, \dots, R_{m e t a}^{N_{t r}}}

, we use

Θ_{a b n}

to fit them for the first time, the optimization goal L of which is to maximize the maximum-likelihood function in Equation (6) through an EM algorithm.

log L (Θ_{a b n} | R_{t r}) = \sum_{i = 1}^{N_{t r}} log (\sum_{n = 1}^{N_{a b n}} λ_{n} N (R_{m e t a}^{i}; {μ_{n}, Σ_{n}})),

(6)

where

\sum_{n = 1}^{N_{a b n}} λ_{n} = 1

and

\forall

λ_{n} \geq 0

.

N (μ_{n}, Σ_{n})

is the probability density function of a single trivariate Gaussian component representing the distribution of relative distance, absolute velocity, and relative angle.

We compute posterior probabilities

log P (R_{m e t a}^{i} | Θ_{a b n}) = log \sum_{n = 1}^{N_{a b n}} λ_{n} \cdot N (R_{m e t a}^{i}; Θ_{a b n})

for each social meta-component in

R_{t r}

, then extract abnormal interactions

R_{a b n} = {R_{m e t a}^{i} | log P (R_{m e t a}^{i} | Θ_{a b n}) < ϵ_{a b n}}

by filtering out normal ones where

ϵ_{a b n}

is the threshold for the definition of abnormal social interactions. After that,

Θ_{a b n}

manages to fit the

R_{a b n}

set for the second time with the initial center of the normal social meta-components of the first fit stage via Equation (7):

log L (Θ_{a b n} | R_{a b n}) = \sum_{i = 1}^{| R_{a b n} |} log (\sum_{n = 1}^{N_{a b n}} λ_{n} N (R_{a b n}^{i}; {μ_{n}, Σ_{n}})),

(7)

where

{μ_{n}, Σ_{n}}

is named the n-th abnormal interaction base after the second fit. The workflow diagram in Figure 3 describes the above two-stage GMM fitting procedure, which can be conducted in an offline manner.

When training or performing inference online,

I_{a b n}

is the indicator function used to judge whether the pedestrian j is an abnormal neighbor to which the pedestrian ego i should pay attention.

ϵ_{a b n 2}

is the filtering threshold for abnormal social interactions during training and evaluation.

I_{a b n}^{i \leftarrow j} = \{\begin{matrix} 0 & log P (R_{m e t a}^{i \leftarrow j} | Θ_{a b n}) \leq ϵ_{a b n 2}; \\ 1 & Otherwise . \end{matrix}

(8)

Denote

A^{i} = {R_{m e t a}^{i \leftarrow j} | I_{a b n}^{i \leftarrow j} = 1, j \in n e i g h b o r (i)}

as the abnormal neighbor set for i. For an abnormal agent j, for i,

R_{m e t a}^{i \leftarrow j}

will be decomposed and projected to several abnormal social component bases generated during training via Equation (9):

P (R_{m e t a}^{i \leftarrow j}) = \sum_{n = 1}^{N_{a b n}} λ_{n}^{i \leftarrow j} \cdot N (R_{m e t a}^{i \leftarrow j}; {μ_{n}, Σ_{n}}),

(9)

where

\sum_{n = 1}^{N_{a b n}} λ_{n}^{i \leftarrow j} = 1

and

λ_{n}^{i \leftarrow j} \geq 0

.

Under the assumption of mutual independence among abnormal interaction bases, we aggregate abnormal neighbors

s^{i}

according to their projection lengths

λ_{n}^{i \leftarrow j}

to these bases as in Equation (10), where n represents the n-th abnormal social interaction base. Note that we adopt the reparameterization trick [50] to generate the n-th component of

s^{i}

here.

s_{n}^{i} \sim N (\frac{1}{| A^{i} |} Σ_{j} λ_{n}^{i \leftarrow j} μ_{n}, \frac{1}{{| A^{i} |}^{2}} {(Σ_{j} λ_{n}^{i \leftarrow j})}^{2} Σ_{n}) .

(10)

Our serialized abnormal social uncertainty feature

z_{a b n_s o c}

is defined as Equation (11), where

g_{e m b e d}

is denoted as an MLP layer with the tanh activation function. Note that if an ego agent has no abnormal interaction component, we pad its social abnormal feature with zero.

z_{a b n_s o c} = \{\begin{matrix} g_{e m b e d} ([s_{1}, s_{2}, \dots, s_{n}]) & | A^{i} | > 0; \\ g_{e m b e d} ([0, 0, \dots, 0]) & Otherwise . \end{matrix}

(11)

The characteristic of abnormal social interaction

z_{a b n_s o c}

\in R^{d_{s}}

will be concatenated (∥) with the past trajectory characteristic of the ego agent

f_{b e h}

\in R^{d_{i}}

produced by the backbone encoder. Then, a temporal attention module

g_{f u s e}

combined with a list of MLP modules is set to form our final social interaction feature

z_{s o c}

:

z_{s o c} = g_{f u s e} ([f_{b e h} ∥ z_{a b n_s o c}]) .

(12)

3.3. Rare Intention Modeling

In this part, we focus on modeling intention feature

z_{i n t}

, which is used to simulate goal formation through a novel prototypical contrastive learning (PCL) method to address rare intentions. We initialize a learnable waypoint slots set

S_{i n t}

with length of multi-modalities M. Another multi-head attention layer

S_{a t t n}

uses

S_{i n t}

as query

Q

,

z_{s o c}

as key

K

, and value

V

to encode inputs of PCL

z_{i n t}

to form

P (z_{i n t} | z_{s o c})

via Equation (13):

z_{i n t} = S_{a t t n} (Q = S_{i n t}, K = z_{s o c}, V = z_{s o c}) .

(13)

The number of our intention base is

N_{i n t}

. We adopt a two-stage GMM fitting paradigm where the first stage captures global goal distributions and the second stage focuses on long-tail intentions. In detail, we regard the last points of future trajectories in training set

e = [e^{1}, e^{2}, \dots, e^{N_{t r}}]

as goal intentions and first use a GMM with

\frac{N_{i n t}}{2}

components

Θ_{e 1}

as

{μ_{e 1}, Σ_{e 1}}

to fit them in the training set. The goals with the lowest log-likelihood score

log P (e | Θ_{e 1})

among those filtered by the

R_{i n t}

ratio, denoted as

e_{r}

, are filtered out and defined as rare intentions. Then, we utilize another GMM

Θ_{e 2}

with

\frac{N_{i n t}}{2}

components to fit

e_{r}

. We then combine the two GMMs

Θ_{e 1}

and

Θ_{e 2}

to form the new, larger

\frac{N_{i n t}}{2} * 2 = N_{i n t}

-component GMM

Θ_{i n t}

of Equation (14), where {

1, 2, \dots, n, \dots, N_{i n t}

}, corresponding to Gaussian components

{(μ_{1}^{e}, Σ_{1}^{e}) \dots, (μ_{\frac{N_{i n t}}{2}}^{e}, Σ_{\frac{N_{i n t}}{2}}^{e}), (μ_{\frac{N_{i n t}}{2} + 1}^{e_{r}}, Σ_{\frac{N_{i n t}}{2} + 1}^{e_{r}}), \dots, (μ_{N_{i n t}}^{e_{r}}, Σ_{N_{i n t}}^{e_{r}})}

through optimization goal Equation (15) for contrastive supervision later, which means the first

\frac{N_{i n t}}{2}

Gaussian (

Θ_{e 1}

) components of

Θ_{i n t}

are generated by all endpoints e while the second

\frac{N_{i n t}}{2}

Gaussian components of

Θ_{i n t}

(

Θ_{e 2}

) are generated by the

10 %

long-tail endpoints

e^{r}

.

\{\begin{matrix} Θ_{i n t} = {Θ_{e 1}, Θ_{e 2}}; \\ Θ_{e 1} = {(μ_{1}^{e}, Σ_{1}^{e}), \dots, (μ_{\frac{N_{i n t}}{2}}^{e}, Σ_{\frac{N_{i n t}}{2}}^{e}}; \\ Θ_{e 2} = {(μ_{\frac{N_{i n t}}{2} + 1}^{e_{r}}, Σ_{\frac{N_{i n t}}{2} + 1}^{e_{r}}) \dots, (μ_{N_{i n t}}^{e_{r}}, Σ_{N_{i n t}}^{e_{r}})} . \end{matrix}

(14)

\{\begin{matrix} log L (Θ_{e_{1}} | e) = \sum_{i = 1}^{N_{t r}} log (\sum_{n = 1}^{\frac{N_{i n t}}{2}} λ_{n}^{e} N (e^{i}; (μ_{n}^{e}, Σ_{n}^{e}))); \\ log L (Θ_{e_{2}} | e_{r}) = \sum_{i = 1}^{\frac{N_{t r}}{10}} log (\sum_{n = \frac{N_{i n t}}{2} + 1}^{N_{i n t}} λ_{n}^{e_{r}} N (e_{r}^{i}; (μ_{n}^{e_{r}}, Σ_{n}^{e_{r}}))) . \end{matrix}

(15)

In intention-clustered label

P_{i n t}

prediction for sample i,

Θ_{i n t}

is used to find the intention Gaussian component

P_{0}

with maximum posterior probability via Equation (16):

\begin{matrix} P_{i n t} = \underset{P_{0}}{\arg \max} log P (e^{i} | Θ_{i n t}^{P_{0}}) = \underset{P_{0}}{\arg \max} N (e^{i}; Θ_{i n t}^{P_{0}}) P_{0} \in (1, N_{i n t}) . \end{matrix}

(16)

Having obtained the clustered labels

P_{i n t}

, we proceed to leverage them in our contrastive learning framework to supervise the learning of intention features

z_{i n t}

defined in Equation (13). Notably, a final MLP

S_{p r o j}

prevents conflicts between motion forecasting loss and contrastive learning loss, which means

z_{i n t}

is passed through

S_{p r o j}

as the last layer for the prototypical contrastive learning but not in intention prediction. We denote the PCL encoder module list as

f_{θ}

= [

S_{i n t}

,

S_{a t t n}

,

S_{p r o j}

].

Traditional unsupervised contrastive learning methods like MoCo [39] get feature input without gradients. To align with them, we pretrain the model with only Winner-Takes-All (WTA) ADE loss for trajectory points

L_{p r e d}

, the calculation of which is described in Equation (17), where

{\hat{p}}_{t}^{(M)}

means the M-th-modality predicted trajectory point of timestep t outputted by our frequency-based decoder D (whose details are shown in Section 3.4) and

p_{t}

means the ground truth.

\{\begin{matrix} p_{t} = D (z_{i n t}, z_{s o c}); \\ L_{p r e d} ({\hat{p}}_{t}, p_{t}) = min_{M} \sum_{t = t_{h} + 1}^{t_{p} + t_{f}} {∥ {\hat{p}}_{t}^{(M)} - p_{t} ∥}_{2} . \end{matrix}

(17)

The WTA strategy chooses one slot among M slots with the best ADE where the gradients exclusively backpropagate to preserve diversity. Then, we freeze (*) all parameters of the pretrained backbone prediction encoder

B_{e n c}^{*}

shown in Figure 2a, which eliminates the need to design a dual momentum encoder

f_{θ}^{'}

containing the dual parameters of the previous abnormal social interaction modeling when we train the PCL because the input of PCL

z_{s o c}

(which actually is the output of

B_{e n c}^{*}

as well) receives no grad.

Instead of conducting contrastive loss directly on the past trajectory context, we conduct our prototypical contrastive learning on multi-modal slots by using the WTA strategy in advance to maintain prediction diversity. Consequently, the feature clustering and updating procedure must be performed per iteration rather than per epoch because the specific slot with the best

f_{A D E}

defined in Equation (17) to be used in loss computation remains undetermined and distinct for different samples. Since the clustering operates solely on x and y dimensions of intention, the computation is highly efficient.

We then focus on how to define our prototypical contrastive learning loss (PCL loss)

L_{P r o t o N C E}

. In Equation (21),

L_{P r o t o N C E}

consists of instance-wise term

L_{i n s}

and instance-prototype term

L_{p r o t o}

. The first term brings the sample features within the class closer, while the second term maximizes inter-cluster separation. Standard PCL [39,40] assigns a sample to a single discrete class label

P_{i n t} = [P_{0}]

, whereas our goal intention follows a continuous distribution. For each sample i, we use a finite set of discrete components

P_{i n t}^{i} = [P_{0}^{i}, P_{1}^{i}, \dots, P_{K - 1}^{i}]

with length K to mimic the continuity. As a result, our approach handles continuous intention distributions through K-nearest GMM component allocation in the PCL loss, effectively simulating distributional continuity with multiple discrete elements but not a single component. We verify the effectiveness of the modeling in our experiments. In detail, given an endpoint intention

e^{i}

, we only find the component

P_{i n t}

with the highest posterior log likelihood via Equation (16).

For highly efficient computation, we want to directly look up the other

K - 1

components

[P_{1}, \dots, P_{K - 1}]

based on

P_{0}

. In detail, we calculate the pairwise Kullback–Leibler (KL) divergence

K L_{i n t} \in R^{N_{i n t} * N_{i n t}}

of any two multivariant Gaussian components in

Θ_{i n t}

.

The indexes of the K-nearest Gaussian distributions

K_{i d} \in R^{N_{i n t} * K}

for each intention component are calculated to look up the KL divergence of the K-nearest neighbors (including itself)

K L_{i d} \in R^{N_{i n t} * K}

from

K L_{i n t}

, which is used to construct continuous labels and calculate our continuous contrastive loss later.

Denoting

v_{i}

as the intention embeddings (

z_{i n t}

, defined in Equation (13)) of sample i, we construct positive sample feature pairs

(v_{i}, v_{i +})

and negative sample feature pairs

(v_{i}, v_{j})

. Notably, to simulate continuity of endpoints in the (x,y) two-dimensional space,

v_{i +}

adopts a hierarchical structure containing all samples belonging to the K-nearest components of

Θ_{i n t}

. In detail,

v_{i +}^{k}

means arbitrary samples belonging to the k-th nearest component of sample i. We look up

K L_{i d}

and denote KL divergences between

P_{i n t}

and

P_{0}

of sample i as {

K L (P_{0}^{i} | P_{0}^{i})

,

K L (P_{1}^{i} | P_{0}^{i})

, …,

K L (P_{K - 1}^{i} | P_{0}^{i})

}. Equation (18) shows our

L_{i n s}

, where

v_{j}

denotes an arbitrary sample in the same batch as i and r denotes batch size.

σ

is a softmax function used to assign weights based on the KL divergence between Gaussian components, and

τ

is the temperature coefficient.

L_{i n s} = - \sum_{i = 1}^{r} \sum_{k = 1}^{K} \frac{1}{| N_{i}^{k} |} \sum_{i_{+} = 1}^{| N_{i}^{k} |} σ (- K L (P_{k}^{i} | P_{0}^{i})) \cdot log \frac{exp (v_{i} \cdot v_{i_{+}}^{k} / τ)}{\sum_{j = 1}^{r} exp (v_{i} \cdot v_{j} / τ)} .

(18)

The prototypical features

C = [c_{1}, \dots, c_{n}, \dots, c_{N_{i n t}}]

are updated per batch sample according to their maximum-likelihood intention-clustered labels

P_{0}

via Equation (19), where

α

is the momentum coefficient if the batch has any sample with the label and

I (P_{0}^{j} = = n)

is the indicator function to judge whether

P_{0}

of sample j belongs to the n-th intention Gaussian component.

c_{n}^{'} = α \cdot c_{n} + (1 - α) \cdot \frac{Σ_{j} I (P_{0}^{j} = = n) \cdot v_{j}}{| I (P_{0}^{j} = = n) |} .

(19)

Equation (20) shows our

L_{p r o t o}

, where

c_{i}^{k}

is the prototype of the cluster which the k-th nearest neighbor GMM component of sample i belongs to, and

c_{j}

is the prototype of an arbitrary cluster j. In summary, our approach refines intention modeling by specifically targeting challenging edge-case scenarios through the training approach above. Algorithm 1 shows the whole process of our rare intention prototypical contrastive learning.

L_{p r o t o} = - \sum_{i = 1}^{r} \sum_{k = 1}^{K} σ (- K L (P_{k}^{i} | P_{0}^{i})) \cdot log \frac{exp (v_{i} \cdot c_{i}^{k} / τ)}{\sum_{j = 1}^{N_{i n t}} exp (v_{i} \cdot c_{j} / τ)} .

(20)

L_{P r o t o N C E} = L_{i n s} + L_{p r o t o} .

(21)

Algorithm 1 Intention Prototypical Contrastive Learning

Input: KL divergence of K-nearest intention GMM components $K L_{i d}$ , past trajectories X, predicted trajectory in advance ${\hat{p}}_{t}$ , ground truth of future trajectory $p_{t}$ , past timesteps $t_{h}$ , future timesteps $t_{f}$ , cluster centroid feature C, momentum coefficient $α$ .
Parameter: intention GMM $Θ_{i n t}$ , frozen backbone encoder $B_{e n c}^{*}$ , PCL encoder $f_{θ}$ , momentum PCL encoder $f_{θ}^{'}$ .

1:: Let $f_{θ}^{'} = f_{θ}$
2:: C←0.
3:: while not MaxEpoch do
4:: for x in DataLoader(X) do
5:: e = $p_{t_{h} + t_{f}}$ . {Regard the last point of future trajectory e as intention.}
6:: Let $P = \underset{n}{t o p K} log P (e | Θ_{i n t}^{n})$ {Intention pseudo-labels $P_{i n t}$ as Equation (16).}
7:: $z_{s o c} = B_{e n c}^{*} (X)$ .
8:: $z_{i n t} = f_{θ} (z_{s o c})$ , $z_{i n t}^{'} = f_{θ}^{'} (z_{s o c})$ .
9:: $M^{*} = \underset{M}{a r g m i n} L_{p r e d} ({\hat{p}}_{t}, p_{t}) .$ {Lookup min ADE index $M^{*}$ in advance as Equation (17).}
10:: Update prototype features C according to Equation (19).
11:: Calculate $L_{P r o t o N C E}$ based on $z_{i n t} [M^{*}], z_{i n t}^{'} [M^{*}], C$ as Equations (18), (20) and (21).
12:: $θ = S G D (θ, L_{P r o t o N C E})$ .
13:: Momentum update $θ^{'}$ based on $θ$ .
14:: end for
15:: end while

3.4. Frequency-Sensitive Decoder Combining Interactions and Intentions

In this part, we propose a novel frequency-based decoder D to integrate the social interaction part

z_{s o c}

introduced in Section 3.2 and the intention part

z_{i n t}

introduced in Section 3.3, corresponding to

P (F^{i} | z_{i n t}, z_{s o c})

, then obtain final motions

F^{i} = D (z_{i n t}, z_{s o c})

to execute. We have

z_{s o c} \in R^{t_{h} * d}

and

z_{i n t} \in R^{M * d}

, where

t_{h}

is past timesteps and M is the number of multi-modalities. As in Equation (22), we regress endpoints directly based on

z_{i n t}

with an FC layer intention decoder

D_{i n t}

to obtain goal endpoints

e \in R^{M * 2}

.

e = \sum_{t = t_{h} + 1}^{t_{h} + t_{f}} v e l_{t} = D_{i n t} (z_{i n t}) .

(22)

The key is to adopt an endpoint-driven method, and the rest of the trajectory points will be conditionally completed under the drive of the terminal location in the frequency domain through Discrete Fourier Transform (DFT). By decomposing a signal into its constituent sinusoids, DFT allows deep learning models to identify and leverage repetitive patterns and structures within the data that are often obscure in the original time domain [51]. To interpolate intermediate trajectory points, we demonstrate a gate-weighting, as shown in Equation (23):

W = [W_{s}, W_{i}] = D_{σ} (\underset{S o c i a l s}{\underset{⏟}{D_{s w} ((z_{i n t} @ z_{s o c}) @ z_{s o c})}}, \underset{I n t e n t i o n i}{\underset{⏟}{D_{i w} (z_{i n t})}}),

(23)

where

D_{s w}

and

D_{i w}

are MLPs for social part s and intention part i, respectively, and

D_{σ}

is an MLP with a softmax activation function used to normalize weights between the two parts. The output social weight and intention weight are gathered as

W = [W_{s} ∥ W_{i}]

, which is used to generate the Alternate-Current Component (ACC) of the velocity in the frequency domain after passing through an FC layer

D_{a c}

. The ACC is generated via Equation (24), which includes the Fourier components except the first Direct-Current Component (DCC):

A C C = D_{a c} (W_{s} * s + W_{i} * i) .

(24)

Because of the theorem that the real part of the DCC after the Discrete Fourier Transform (DFT) equals the sum of all points in the temporal series [52], we concatenate goal endpoints e with the zero image part to form the DCC, which can be regarded as the accumulation of instantaneous velocity

v e l_{t}

at all future steps. Finally, we concatenate the DCC with the ACC and employ an inverse Discrete Fourier Transform (iDFT) layer to reconstruct future velocity profiles whose cumulative sum is exactly the DCC, that is, the endpoint e, so that we obtain accurate predicted trajectory

F^{i}

, as shown in Equation (25):

F^{i} = C u m S u m (i D F T (\underset{D C C}{\underset{⏟}{e + 0 j}}, A C C)) .

(25)

3.5. Loss Function

L_{P r o t o N C E}

could hardly bring more benefits to easy samples, so we adopt a gate

θ^{'}

to stop PCL loss on easy samples. In contrast to indicating a deterministic hardness of the samples [15], we determine hardness of the samples based on

L_{p r e d}

, predicted in advance, which can be dynamically adjusted during the training process. Having denoted the WTA strategy’s advanced calculated

L_{p r e d}

, we define our loss function

L

via Equation (26), where

λ

is an indicator value defined via Equation (27).

θ^{'}

is the threshold to filter out hard samples.

L = L_{p r e d} + λ \cdot L_{P r o t o N C E} .

(26)

\{\begin{matrix} λ = 1 & L_{p r e d} > θ^{'} and is not pretraining; \\ λ = 0 & L_{p r e d} \leq θ^{'} or is pretraining . \end{matrix}

(27)

4. Experiment

4.1. Experimental Setup

Datasets. We use two pedestrian motion forecasting datasets, ETH [53]-UCY [54] and SDD [55], in this work. Recent studies on ETH-UCY primarily used cross-dataset validation, which means training on four scenarios and testing on the held out one. We retain this setting in our approach.

ETH-UCY is a dataset of pedestrian walking scenes consisting of five sub-scenarios: eth, hotel, univ, zara1, and zara2. A sample interval of 0.4 s is conducted in the length of the previous observer, $t_{h} =$ 3.2 s/0.4 s = 8 steps, and future prediction of $t_{f} =$ 4.8 s/0.4 s = 12 steps.
Stanford Drone Dataset (SDD) is a drone dataset of human behaviors on campus. A total of 60 drone videos are used to extract 290,243 trajectories (8 steps as observed steps and 12 steps as future steps to predict) partitioned into 60% to train, 20% to validate, and 20% to test.

As in previous studies [12,31,49,56], some preprocessor layers are used to transform the coordinates of the trajectories into scene-centric ones. ’Move’ means normalizing trajectory points

p_{t}

at timestep t based on the current absolute position of ego agent

p_{t_{h}}^{i}

via Equation (28) to get moved trajectory points

p_{t}^{'} = (x^{'}, y^{'})

. ’Rotate’ means rotating the moved historical trajectory

p_{t}^{'} = (x^{'}, y^{'})

based on the current target agent’s heading

θ

to get final preprocessed trajectory points

p_{t}^{″} = (x^{″}, y^{″})

via Equation (29).

p_{t}^{'} = p_{t} - p_{t_{h}}^{i} .

(28)

p_{t}^{″} = (\begin{matrix} x^{″} \\ y^{″} \end{matrix}) = (\begin{matrix} c o s θ & s i n θ \\ - s i n θ & c o s θ \end{matrix}) (\begin{matrix} x^{'} \\ y^{'} \end{matrix}) .

(29)

Backbone prediction networks. Since our method can be regarded as a plugin, we will briefly introduce some of the latest outperforming backbone prediction networks that are used:

Multi-Style Network (MSN) [10] provides multi-style predictions with a series of style channels, each of which is bound to a unique behavior.
View Vertically ( $V^{2}$ -Net) [12] transforms agents’ trajectories into the frequency domain to obtain potential characteristics which could not be extracted in the time domain.
E- $V^{2}$ Net [56] introduces Haar Transform instead of Fourier Transform and proposes a bilinear structure to model dimension interactions.
SocialCircle(Plus) (-SC/-SCP) [31,49] is an alternative structure which can be plugged into the existing SOTA methods mentioned above to promote their performance. Inspired by the location of marine animals, SocialCircle aggregates social interactions according to their relative directions. Moreover, SocialCirclePlus concentrates on physical interactions as well as social interactions.

These approaches predict key points followed by linear speed interpolation [10,12,31,49,56], achieving competitive ADE/FDE on ETH-UCY and the SDD.

Evaluation Protocol and Implementation Details. In order to eliminate the influence of hyperparameters on the experimental results, we fix all the official optimal hyperparameters of the backbone prediction networks to reproduce the results before our approach is activated. The epoch with the best ADE of key points is used to represent the experimental results of a single trial. As in previous work, we adopt leave-one-out cross-dataset validation for ETH-UCY to verify whether our method has a good generalization ability to new scenarios. Specifically, we choose four subsets of ETH-UCY (eth, hotel, univ, zara1, zara2) as a training set and the remaining one as a validation subset. Training and evaluation are carried out on a 24G VRAM NVIDIA 4090 GPU. In our model, the 50 nearest neighbors of the ego agent are used to compute social interactions. The number of abnormal meta-components n is set to the same as the observation steps (eight in our study) for convenient concatenation.

ϵ_{a b n}

during the offline stage is set to −2 and

ϵ_{a b n 2}

is set to 0. It should be noted that we do not make any additional changes to the backbone structures other than the improvements described in the paper. It takes us about 30 min to train completely once on the univ dataset and about 3 h to train completely once on the SDD.

Metrics. The pedestrian motion forecasting task measures the prediction accuracy of the generated M (M = 20) trajectories with the best average displacement error

m i n A D E_{20}

and the best final displacement error

m i n F D E_{20}

. Their calculation can be denoted as Equations (30) and (31), where the hat mark ‘^{^}’ means the predicted trajectory points.

m i n A D E_{20} = min_{M} \sum_{t = t_{h} + 1}^{t_{p} + t_{f}} {∥ {\hat{p}}_{t}^{(M)} - p_{t} ∥}_{2} .

(30)

m i n F D E_{20} = min_{M} {∥ {\hat{p}}_{t_{h} + t_{f}}^{(M)} - p_{t_{h} + t_{f}} ∥}_{2} .

(31)

Training Details. Table 2 shows training details such as the preprocessing steps and hyperparameters for several backbone predictors and datasets. Our settings are the same as the original settings for backbone predictors without any change. Notably, the recent state-of-the-art methods shown in Table 2 adopt a strategy which only predicts key points and then utilizes linear speed interpolation to complete the whole trajectory. ‘K’ in Table 2 denotes the timesteps of key points.

4.2. Comparisons to State-of-the-Art Methods

We first design extensive experiments to verify the effectiveness of abnormal interactions modeling.

Cross-Validation Improvements with Abnormal Priors on ETH-UCY. As shown in Table 3, our learned abnormal social priors enhance the performance on ETH-UCY.

V^{2}

-Net-SCP-abn outperforms PPT by 8% ADE and 2.5% FDE. Although MSN-SCP lags behind

V^{2}

-Net-SCP/E-

V^{2}

-Net-SCP, abnormal interaction modeling still improves it by 2.4% ADE and 4.0% FDE.

SDD. As shown in Table 4,

V^{2}

-Net-SCP-abn surpasses

V^{2}

-Net-SCP by 3.2% ADE and 3.4% FDE. Although MSN models perform below

V^{2}

-Net, our abnormal interaction modeling still enhances MSN-SC by 2.3% ADE and 2.8% FDE. Notably, MSN-SC-abn matches the performance of MSN-SCP without physical input (e.g., RGB images). Even with simple backbones, our abnormal interaction plugin boosts a Transformer model by 10.7% ADE and 7.3% FDE.

Performance on Long-Tail Cases. Table 5 demonstrates the effectiveness of our rare intention modeling ‘-r’ through cross-dataset validation on ETH-UCY. As shown in Table 6, while FEND [15] improves the performance of the top 5% of cases with the degradation of 95% majority-class accuracy, our E-

V^{2}

-Net-SCP-abn-r achieves comparable performance gains in challenging cases and effectively preserves the accuracy of the majority class, establishing a new SOTA (6.19/9.71 ADE/FDE) on the SDD benchmark as well. Despite omitting frequency-domain analysis and sophisticated architectures, our rare intention extraction approach ‘-r’ attains a performance (6.88/10.43) comparable to complex models when implemented on simple Transformer baselines. Further improvements of long-tailed samples on other backbone models are observed.

4.3. Discussions and Ablation Studies

Discussion I: Threshold Analysis of Abnormal Interactions. Our ablation study reveals two key threshold-dependent patterns in Table 7: (1) For offline abnormal interaction detection (

ϵ_{a b n}

), V²-Net-SCP-abn achieves peak performance (1.1% ADE/1.8% FDE gain) at

ϵ_{a b n}

= −2, with degradation beyond this threshold (0.8% ADE/0.7% FDE loss in (−2.0)) due to noisy abnormal interaction meta-components. (2) The optimal inference filtering threshold (

ϵ_{a b n 2}

) for abnormal interaction extraction exhibits architecture dependence, requiring 7.1k samples for V²-Net-SCP-abn versus 5.1k for E-V²-Net-SCP-abn, while insufficient extraction consistently degrades model performance. The study establishes

ϵ_{a b n_{2}}

= −2 as the optimal choice for SDD.

Discussion II: Settings of the Rare Intention Modeling. In Table 8, we analyze the effect of the number of GMM components with rare intention

N_{i n t}

and the abnormal interaction activation threshold (

ϵ_{a b n}

and

ϵ_{a b n 2}

) for

L_{P r o t o N C E}

on the univ dataset. We can see that, for

V^{2}

-Net-SCP-abn,

N_{i n t} = 512

, and

θ^{'} = 0.6

is the best choice. This is because, when

θ^{'}

is too small, more easy samples will disturb the generation of features of the long-tailed class. On the other hand, when

θ^{'}

is too large, fewer long-tailed samples are taken into account. Both factors affect the accuracy of the model. Apart from that, selecting the proper component number of intention GMM is essential. Intuitively, excessive clustering dilutes prototype representativeness, while insufficient clustering overlooks minority-class prototypes.

Discussion III: Selection of K-Nearest Intention GMM Components and $R_{int}$ Rare Intention Definition. Next, we verify the effectiveness of the K-nearest intention GMM component selection shown in Table 9. On the SDD, discrete waypoint modeling (K = 1) greatly improves long-tail performance but significantly reduces overall precision, whereas our continuous approach (K = 5) generates smoother long-tail cluster centroids to address this limitation. We also discover the effect of the K value on the ADE/FDE metrics. In the context of cross-validation where the univ subset is set as a test set, K = 5 yields the highest model performance for

V^{2}

-Net-SCP-abn. Additionally,

R_{i n t}

= 0.05 gets the best results on the SDD.

Discussion IV: Inference Time. We test the inference time of the SDD, which has the most complex scenarios, on an NVIDIA GeForce RTX 4090 GPU. Note that all hyperparameters are same as those in Table 2. From Table 9, it can be seen that the average inference time and fast inference time do not increase because our rare intention modeling does not introduce any extra layer.

4.4. Qualitative Analysis

Visualized predictions. Figure 4a illustrates backbone predictions with our abnormal social interaction plugin in three SDD scenes: little1, hyang6, and bookstore6. Although different backbone prediction networks could generate varied predictions, all predictions with the abnormal interaction module preserve the quality and diversity of the original backbone predictions. Figure 4b shows a visual comparison of the same scene with and without our abnormal social interaction plugin. In the first row of Figure 4b, ego-biker should take full advantage of abnormal social interaction

i_{1}

on the left side of the scene because it contains information about walkable roads. The MSN-SCP predictions on the backbone in the scene have a large gap between the ground truth because

i_{1}

is averaged with other interactions in its SocialCircle space, so it cannot be given more attention. Agents tend to mimic the motion style of other agents in the same scene. In Figure 4b row 2, there are neighbor tracks that need to be imitated to walk around trees. The red circle prediction of

V^{2}

-Net-SCP-abn succeeds in imitating the tendency of neighbor agents and fits the ground truth better than

V^{2}

-Net-SCP. In general, places with few neighbor agents can be regarded as ‘uninhabited’ areas. People do not always like to go to an uninhabited place alone, including our ego agent. In Figure 4b row 3, all neighbor agents are gathered on the road near the roof while the rest of the scene could appear as ’uninhabited’ areas. Trans-SCP with our plugins narrows the gap to the ground truth by paying more attention to the crowded area.

Stress Testing. We verify the capability of our abnormal interaction modeling by manually adding an abnormal manual neighbor to observe our model reflection. Figure 5 visualizes and compares several toy examples of real-world SDD scenes. In Figure 5a, despite adding a side-by-side walking pedestrian close to the ego pedestrian, the

V^{2}

-Net-SCP predictions do not show the avoidance tendency in relation to the added neighbor, while our

V^{2}

-Net-SCP-abn predictions do show one. In Figure 5b,c, a manual interaction with high velocity toward the ego pedestrian guides our predictions.

V^{2}

-Net-SCP-abn and Trans-SCP-abn appear to be highly effective in preventing possible collisions, while backbone predictions do not appear. In Figure 5d, a high-speed neighbor is added to influence the ego agent’s right turn motion. Our MSN-SCP-abn predictions exhibit a larger spacing d to the abnormal neighbor than the backbone predictions. The results confirm that our approach extracts abnormal interactions and adapts across networks.

Interpretability of Abnormal Interactions. Our abnormal interaction representation demonstrates interpretability through the semantically distinct clusters identified via GMM’s two-stage clustering (Figure 6): (1) static agents (purple cluster, normalized velocity < 0.2), (2) agents approaching ego (green cluster, relative angle ≈

π

), and (3) high-velocity neighbors (sky-blue cluster, normalized velocity > 0.5). This method effectively isolates abnormal interactions with clear behavioral semantics.

As illustrated in Figure 7 and the pie chart below, we select three scenarios involving abnormal social interactions and analyze the contributions of different semantic components to the model’s predictions. A common trend observed across all three scenarios is that the predicted trajectories of surrounding agents consistently avoid the abnormal interactions (highlighted in blue). However, the underlying semantic reasons for this avoidance vary by scenario. In Scenario 1, the “high-velocity” component is the dominant factor, accounting for 71% of the overall contribution. This suggests that the model relies primarily on the speed of the interacting agents to predict avoidance. In Scenario 2, where the target pedestrian’s orientation is nearly perpendicular (approx.

π / 2

) to the direction of the abnormal interaction, we see a shift in the contributing factors. Compared to a baseline of normal interactions, the contribution of the ’opposite direction’ component increases by 34%. Simultaneously, the ’high-velocity’ component still maintains a high absolute contribution of 45%. This combination implies that, in such an unusual configuration, the model integrates both the high speed and the opposing direction of the agents to forecast their behavior. Scenario 3 presents a different situation: Here, the speed of the abnormal interaction is notably lower than in the first two scenarios. Consequently, the model assigns greater importance to stationary agents, with the contribution of the “static agent” component rising by 34% relative to the baseline. This indicates that, when a dynamic interaction is slow, the presence of static elements in the scene becomes a more critical cue for prediction. Finally, it is worth noting that a substantial portion of the model’s decision-making remains unaccounted for by these interpretable components. In scenario 3, over 57% of the contributory factors are classified as ’other abnormal components’ with no clearly visible semantic meaning. This points to a limitation in the current semantic framework and highlights an area for future investigation.

Generation of Intention GMM Components. The 10% of the intentions with the worst fit in the first stage are retained as long tails and they are subjected to a second GMM fit to obtain the center of the long-tail cluster. These long-tail intention clusters are combined with the clusters obtained from the first stage of clustering into the final intention GMM model to better reflect the distribution of long-tail intention. As shown in Figure 8, the above clustering method (right) gives a large weight to the sparsely distributed long-tail intention endpoints (i.e., the GMM clustering center in the right figure is more dispersed).

4.5. Robustness Evaluation and Comparative Results

Figure 9 is the box plot of the ADE/FDE results of 15 parallel experiments on a univ dataset. The ADEs/FDEs of our method get Q1, Q2, Q3 better than the baseline corresponding to the lower bottom edge, the center line of the box body, and the upper edge. Moreover, all our results do not appear in *, indicating that there are no outliers. Figure 10 shows FDE metrics in a test dataset with training epochs. Our plugin not only reduces testing errors but also makes the training results more stable. For the hotel subset, the best FDE reduces by 0.01 from 0.14 to 0.13. For the univ subset, the FDE metrics of the red curve experiments do not increase with the number of training epochs, which proves a better generalization performance.

5. Outlook

We will explore more frequency-based methods to couple abnormal interaction and rare intention modeling in our future study. This direction holds promise for developing a more holistic understanding of outlier events in multi-agent systems.

6. Conclusions

We unravel the fundamental causes of hard case scenarios in pedestrian trajectory prediction as (1) abnormal social interactions and (2) rare intentions in challenging scenarios. We propose a method to obtain abnormal social interactions. Moreover, an improved PCL algorithm facilitates the learning of rare intentions under continuous pseudo-label settings. A frequency-sensitive goal-driven decoder fuses both factors. Compared to the SOTA, our method outperforms on both the full dataset and the long-tail subset, advancing trajectory prediction.

Author Contributions

Conceptualization, C.Y. and X.D.; methodology, C.Y.; software, C.Y.; validation, C.Y., J.L., and X.D.; formal analysis, C.Y.; investigation, C.Y.; resources, X.D.; data curation, J.L.; writing—original draft preparation, C.Y.; writing—review and editing, C.Y. and X.D.; visualization, C.Y.; supervision, X.D.; project administration, X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The publicly available datasets analyzed for this study can be found in the following repositories. The ETH dataset [53] and UCY dataset [54] are available at http://www.vision.ee.ethz.ch/datasets/ (accessed on 6 March 2026) and https://graphics.cs.ucy.ac.cy/research/downloads/crowd-data (accessed on 6 March 2026). The Stanford Drone Dataset (SDD) [55] is available at https://cvgl.stanford.edu/projects/uav_data/ (accessed on 6 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GMM	Gaussian Mixed Model
SCP	SocialCircle-Plus
ACC	Alternate-Current Component
DCC	Direct-Current Component
PCL	Prototypical Contrastive Learning

References

Sreenu, G.; Durai, S. Intelligent video surveillance: A review through deep learning techniques for crowd analysis. J. Big Data 2019, 6, 1–27. [Google Scholar] [CrossRef]
Pokle, A.; Martín-Martín, R.; Goebel, P.; Chow, V.; Ewald, H.M.; Yang, J.; Wang, Z.; Sadeghian, A.; Sadigh, D.; Savarese, S.; et al. Deep local trajectory replanning and control for robot navigation. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2019; pp. 5815–5822. [Google Scholar]
Samir, M.; Assi, C.; Sharafeddine, S.; Ebrahimi, D.; Ghrayeb, A. Age of information aware trajectory planning of UAVs in intelligent transportation systems: A deep learning approach. IEEE Trans. Veh. Technol. 2020, 69, 12382–12395. [Google Scholar] [CrossRef]
Alhariqi, A.; Gu, Z.; Saberi, M. Calibration of the intelligent driver model (IDM) with adaptive parameters for mixed autonomy traffic using experimental trajectory data. Transp. B Transp. Dyn. 2022, 10, 421–440. [Google Scholar]
Abbas, M.T.; Jibran, M.A.; Afaq, M.; Song, W.C. An adaptive approach to vehicle trajectory prediction using multimodel Kalman filter. Trans. Emerg. Telecommun. Technol. 2020, 31, e3734. [Google Scholar] [CrossRef]
Herrero, D.A.; Pedroche, D.S.; Herrero, J.G.; López, J.M.M. AIS trajectory classification based on IMM data. In Proceedings of the 2019 22th International Conference on Information Fusion (FUSION); IEEE: New York, NY, USA, 2019; pp. 1–8. [Google Scholar]
Tomar, R.S.; Verma, S.; Tomar, G.S. SVM based trajectory predictions of lane changing vehicles. In Proceedings of the 2011 International Conference on Computational Intelligence and Communication Networks; IEEE: New York, NY, USA, 2011; pp. 716–721. [Google Scholar]
Lee, D.; Ott, C.; Nakamura, Y. Mimetic communication model with compliant physical contact in human—Humanoid interaction. Int. J. Robot. Res. 2010, 29, 1684–1704. [Google Scholar] [CrossRef]
Cui, H.; Qi, H.; Zhou, J. DBN-MACTraj: Dynamic Bayesian Networks for Predicting Combinations of Long-Term Trajectories with Likelihood for Multiple Agents. Mathematics 2024, 12, 3674. [Google Scholar] [CrossRef]
Wong, C.; Xia, B.; Peng, Q.; Yuan, W.; You, X. MSN: Multi-style network for trajectory prediction. IEEE Trans. Intell. Transp. Syst. 2023, 24, 9751–9766. [Google Scholar] [CrossRef]
Zhou, Z.; Ye, L.; Wang, J.; Wu, K.; Lu, K. Hivt: Hierarchical vector transformer for multi-agent motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 8823–8833. [Google Scholar]
Wong, C.; Xia, B.; Hong, Z.; Peng, Q.; Yuan, W.; Cao, Q.; Yang, Y.; You, X. View vertically: A hierarchical network for trajectory prediction via fourier spectrums. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 682–700. [Google Scholar]
Girgis, R.; Golemo, F.; Codevilla, F.; Weiss, M.; D’Souza, J.A.; Kahou, S.E.; Heide, F.; Pal, C. Latent variable sequential set transformers for joint multi-agent motion prediction. arXiv 2021, arXiv:2104.00563. [Google Scholar]
Zhang, S.; Zhao, G.; Lyu, F.; Wang, S.; Zhang, Z.; Zhao, F.; Li, J.; Shan, C.; Wang, L. MambaPTP: Exploring the Potential of Mamba for Pedestrian Trajectory Prediction. IEEE Trans. Circuits Syst. Video Technol. 2025, 36, 3795–3807. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, P.; Bai, L.; Xue, J. Fend: A future enhanced distribution-aware contrastive learning framework for long-tail trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 1400–1409. [Google Scholar]
Zhang, J.; Pourkeshavarz, M.; Rasouli, A. Tract: A training dynamics aware contrastive learning framework for long-tail trajectory prediction. In Proceedings of the 2024 IEEE Intelligent Vehicles Symposium (IV); IEEE: New York, NY, USA, 2024; pp. 3282–3288. [Google Scholar]
Wu, W.; Feng, X.; Gao, Z.; Kan, Y. Smart: Scalable multi-agent real-time motion generation via next-token prediction. Adv. Neural Inf. Process. Syst. 2024, 37, 114048–114071. [Google Scholar]
Deo, N.; Wolff, E.; Beijbom, O. Multimodal Trajectory Prediction Conditioned on Lane-Graph Traversals. In Proceedings of the 5th Annual Conference on Robot Learning; PMLR: Cambridge, MA, USA, 2021. [Google Scholar]
Lan, Z.; Ren, Y.; Yu, H.; Liu, L.; Li, Z.; Wang, Y.; Cui, Z. Hi-SCL: Fighting long-tailed challenges in trajectory prediction with hierarchical wave-semantic contrastive learning. Transp. Res. Part C Emerg. Technol. 2024, 165, 104735. [Google Scholar] [CrossRef]
Romano, F.; Cimini, D.; Di Paola, F.; Gallucci, D.; Larosa, S.; Nilo, S.T.; Ricciardelli, E.; Iisager, B.D.; Hutchison, K. The evolution of meteorological satellite cloud-detection methodologies for atmospheric parameter retrievals. Remote Sens. 2024, 16, 2578. [Google Scholar] [CrossRef]
Zhou, Y.; Wu, H.; Cheng, H.; Qi, K.; Hu, K.; Kang, C.; Zheng, J. Social graph convolutional LSTM for pedestrian trajectory prediction. IET Intell. Transp. Syst. 2021, 15, 396–405. [Google Scholar] [CrossRef]
Zhang, S.; Wu, J.; Dong, J.; Liu, L. Social-Interaction GAN: Pedestrian Trajectory Prediction. In Proceedings of the International Conference on Wireless Algorithms, Systems, and Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 429–440. [Google Scholar]
Mohamed, A.; Qian, K.; Elhoseiny, M.; Claudel, C. Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 14424–14432. [Google Scholar]
He, Z.; Li, W.; Gan, X.; Chen, Z.; Wu, Y.; Zhang, Y. Decoupled Pedestrian Trajectory Prediction Network with Near-Aware Attention. Knowl.-Based Syst. 2025, 333, 114913. [Google Scholar] [CrossRef]
Yuan, Y.; Weng, X.; Ou, Y.; Kitani, K.M. Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 9813–9823. [Google Scholar]
Sun, Y.; Xiao, D.; Huang, M.; Wang, J.; Tong, C.; Luo, J.; Pu, H. Transferable Multi-Level Spatial-Temporal Graph Neural Network for Adaptive Multi-Agent Trajectory Prediction. Knowl.-Based Syst. 2026, 338, 115451. [Google Scholar] [CrossRef]
Yang, H.; Chen, Y.; Cai, J.; Yang, Y.; Zhou, L.; Tian, J.; Li, Y.; Xun, Y.; Zhao, X. Cross-domain pedestrian trajectory prediction via behavioral pattern-aware multi-instance GCN. Knowl.-Based Syst. 2025, 329, 114266. [Google Scholar] [CrossRef]
Zhou, Z.; Wang, J.; Li, Y.H.; Huang, Y.K. Query-centric trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 17863–17873. [Google Scholar]
Wang, R.; Lin, W.; Ren, G.; Cao, Q.; Zhang, Z.; Deng, Y. Interaction-aware vehicle trajectory prediction using spatial-temporal dynamic graph neural network. Knowl.-Based Syst. 2025, 327, 114187. [Google Scholar] [CrossRef]
Rowe, L.; Ethier, M.; Dykhne, E.H.; Czarnecki, K. Fjmp: Factorized joint multi-agent motion prediction over learned directed acyclic interaction graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 13745–13755. [Google Scholar]
Wong, C.; Xia, B.; Zou, Z.; Wang, Y.; You, X. Socialcircle: Learning the angle-based social interaction representation for pedestrian trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 19005–19015. [Google Scholar]
Cui, Y.; Guo, D.; Han, Y. MELON: Hierarchical Multi-Agent Trajectory Prediction with Spatio-Temporal Uncertainty Adaptation. Knowl.-Based Syst. 2025, 334, 115143. [Google Scholar] [CrossRef]
Lin, X.; Liang, T.; Lai, J.; Hu, J.F. Progressive pretext task learning for human trajectory prediction. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 197–214. [Google Scholar]
Wei, C.; Wu, G.; Barth, M.J.; Abdelraouf, A.; Gupta, R.; Han, K. KI-GAN: Knowledge-Informed Generative Adversarial Networks for Enhanced Multi-Vehicle Trajectory Forecasting at Signalized Intersections. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 7115–7124. [Google Scholar]
Guo, L.; Ge, P.; Shi, Z. Multi-object trajectory prediction based on lane information and generative adversarial network. Sensors 2024, 24, 1280. [Google Scholar] [CrossRef]
Gu, T.; Chen, G.; Li, J.; Lin, C.; Rao, Y.; Zhou, J.; Lu, J. Stochastic trajectory prediction via motion indeterminacy diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 17113–17122. [Google Scholar]
Liu, Y.; Dong, X.; Lin, Y.; Ye, M. Diftraj: Diffusion inspired by intrinsic intention and extrinsic interaction for multi-modal trajectory prediction. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence; Curran Associates, Inc.: Red Hook, NY, USA, 2024. [Google Scholar]
Mao, W.; Xu, C.; Zhu, Q.; Chen, S.; Wang, Y. Leapfrog diffusion model for stochastic trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2023; pp. 5517–5526. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 9729–9738. [Google Scholar]
Li, J.; Zhou, P.; Xiong, C.; Hoi, S.C. Prototypical contrastive learning of unsupervised representations. arXiv 2020, arXiv:2005.04966. [Google Scholar]
Yang, Z.; Pan, J.; Yang, Y.; Shi, X.; Zhou, H.Y.; Zhang, Z.; Bian, C. Proco: Prototype-aware contrastive learning for long-tailed medical image classification. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2022; pp. 173–182. [Google Scholar]
Lin, S.; Liu, C.; Zhou, P.; Hu, Z.Y.; Wang, S.; Zhao, R.; Zheng, Y.; Lin, L.; Xing, E.; Liang, X. Prototypical graph contrastive learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 2747–2758. [Google Scholar] [CrossRef]
Wang, P.; Han, K.; Wei, X.S.; Zhang, L.; Wang, L. Contrastive learning based hybrid networks for long-tailed image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021; pp. 943–952. [Google Scholar]
Du, C.; Wang, Y.; Song, S.; Huang, G. Probabilistic contrastive learning for long-tailed visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5890–5904. [Google Scholar] [CrossRef]
Yang, Y.; Zha, K.; Chen, Y.; Wang, H.; Katabi, D. Delving into deep imbalanced regression. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2021; pp. 11842–11851. [Google Scholar]
Ding, Z.; Xu, Y.; Xu, W.; Parmar, G.; Yang, Y.; Welling, M.; Tu, Z. Guided variational autoencoder for disentanglement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 7920–7929. [Google Scholar]
Lee, J.; Kim, E.; Lee, J.; Lee, J.; Choo, J. Learning debiased representation via disentangled feature augmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 25123–25133. [Google Scholar]
Ngweta, L.; Maity, S.; Gittens, A.; Sun, Y.; Yurochkin, M. Simple disentanglement of style and content in visual representations. In Proceedings of the International Conference on Machine Learning; PMLR: Cambridge, MA, USA, 2023; pp. 26063–26086. [Google Scholar]
Wong, C.; Xia, B.; Zou, Z.; You, X. Socialcircle+: Learning the angle-based conditioned interaction representation for pedestrian trajectory prediction. arXiv 2024, arXiv:2409.14984. [Google Scholar]
Kingma, D.P.; Salimans, T.; Welling, M. Variational dropout and the local reparameterization trick. Adv. Neural Inf. Process. Syst. 2015, 28, 2575–2583. [Google Scholar]
Xu, Y.; Hu, W.; Wang, S.; Zhang, X.; Wang, S.; Ma, S.; Guo, Z.; Gao, W. Predictive generalized graph Fourier transform for attribute compression of dynamic point clouds. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 1968–1982. [Google Scholar] [CrossRef]
Rao, K.R.; Kim, D.N.; Hwang, J.J. Fast Fourier Transform-Algorithms and Applications; Springer Science & Business Media: Dordrecht, The Netherlands, 2011. [Google Scholar]
Pellegrini, S.; Ess, A.; Schindler, K.; Van Gool, L. You’ll never walk alone: Modeling social behavior for multi-target tracking. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision; IEEE: New York, NY, USA, 2009; pp. 261–268. [Google Scholar]
Lerner, A.; Chrysanthou, Y.; Lischinski, D. Crowds by example. In Proceedings of the Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2007; Volume 26, pp. 655–664. [Google Scholar]
Robicquet, A.; Sadeghian, A.; Alahi, A.; Savarese, S. Learning social etiquette: Human trajectory understanding in crowded scenes. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part VIII 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 549–565. [Google Scholar]
Wong, C.; Xia, B.; Peng, Q.; You, X. Another vertical view: A hierarchical network for heterogeneous trajectory prediction via spectrums. arXiv 2023, arXiv:2304.05106. [Google Scholar] [CrossRef]

Figure 1. (a) Our social interaction space (DVA space) designed to reflect velocity, relative distance, and relative direction where we extract abnormal social interactions. (b) Long-tail intention example: Most pedestrians head to flats (blue) or schools (orange), which are the majority of map candidates, while the red pedestrian heads to a low-probability destination—the hospital—which is an outlier point of the map candidates.

Figure 2. (a) Abnormal social interaction extraction module. (b) Long-tail intention contrastive learning module. (c) Details of our novel frequency goal-driven decoder to fuse outputs of (a,b).

Figure 3. The workflow of our two-stage abnormal interaction meta-component extraction.

Figure 4. Qualitative results. (a) Visualization predictions of our abnormal plugins under varying backbone prediction networks. (b) Visualization comparisons with and without abnormal interactions.

Figure 5. Stress testing, which significantly alters the circular trajectory modes by introducing abnormal manual neighbors. In scenario (a–d), our ego agent shows avoidance tendency to manual abnormal neighbor added.

Figure 6. Abnormal interaction semantics when conducting cross-dataset validation on the eth subset. (orange: anomalies from hotel/univ/zara1/zara2). Purple cluster: Static Agents. Green cluster: Opposite Direction. Blue cluster: High velocity.

Figure 7. Contribution weights of each abnormal interaction semantic component.

Figure 8. Comparison between the baseline intention generator and our two-stage GMM approach for rare intention modeling (all blue points represent intentions from hotel/univ/zara1/zara2 and red points represent intention GMM components).

Figure 9. Box plot of ADE/FDE results of MSN-SCP and MSN-SCP-abn on the univ subset of ETH-UCY with abnormal interaction modeling.

Figure 10. FDE metric of the univ and hotel test set with training procedure. We made N (N = 5) parallel experiments. Different colors means parallel experiments of each group. ‘-abn’ means with our abnormal interaction plugin.

Table 1. Key symbol definitions.

Section	Symbol	Explanation
Section 3.2	$r_{d i s}^{i \leftarrow j}$	Relative distance of pedestrian j to i.
	$r_{v e l}^{j}$	Absolute velocity of pedestrian j to i.
	$r_{θ}^{i \leftarrow j}$	Relative angle of pedestrian j to i.
	$ρ_{d i s_v e l}$	Correlation coefficient between relative distance and absolute velocity.
	$ρ_{v e l_θ}$	Correlation coefficient between absolute velocity and relative angle.
	$ρ_{d i s_θ}$	Correlation coefficient between relative distance and relative angle.
	$Θ_{a b n}$	GMM to extract abnormal social interactions.
	$R_{m e t a}^{i \leftarrow j}$	Abnormal social interaction meta-component (relative distance, absolute velocity, and relative angle) of pedestrian j to i.
	$A^{i}$	Abnormal neighbor set for pedestrian i.
	$s_{n}^{i}$	The n-th aggregated Gaussian component of the abnormal neighbors feature for pedestrian i.
	$λ_{n}$	Weight coefficient for the n-th abnormal social interaction component whose addition is 1.
Section 3.3	$Θ_{i n t}$	GMM to extract rare intentions.
	$Θ_{e 1}$	Intention GMM for the first stage to capture glocal goal distributions.
	$Θ_{e 2}$	Intention GMM for the second stage to capture long-tail intentions.
	$N_{i n t}$	Number of intention components in $Θ_{i n t}$ .
	$e^{i}$	Intention and endpoint to predict for pedestrian i.
	$P_{i n t}^{i}$	Clustered intention labels by $Θ_{i n t}$ for pedestrian i.
	$v_{i}$	Intention embeddings for pedestrian i.
	$v_{j}$	Arbitrary sample in the same batch as i.
	$\| N_{i}^{k} \|$	Number of neighbors which belong to the k-th nearest intention cluster for sample i.
	$c_{i}^{k}$	Prototype of cluster which the k-th neighbor GMM component of pedestrian i belongs to.
	$α$	Momentum coefficient in our PCL algorithm.
	$τ$	Temperature coefficient in our PCL algorithm.

Table 2. Implementation details we used in this work. K is short for key points. ✓ means the preprocessing steps we conducted before model training.

	Size		Preprocessor		Backbone	Hyperparameters
	$∥ N_{train} ∥$	$∥ N_{test} ∥$	Move	Rotate		bsz	lr	Epochs	$K$
eth	36,784	2614	✓	✓	$V^{2}$ -Net-SCP	1000	$3 \times 10^{- 4}$	200	4 8 11
					E- $V^{2}$ -Net-SCP	1000	$3 \times 10^{- 4}$	200	4 8 11
					MSN-SCP	1000	$3 \times 10^{- 4}$	200	11
hotel	38,323	1075	✓	✓	$V^{2}$ -Net-SCP	1000	$4 \times 10^{- 4}$	200	4 8 11
					E- $V^{2}$ -Net-SCP	1000	$3 \times 10^{- 4}$	200	4 8 11
					MSN-SCP	1000	$4 \times 10^{- 4}$	200	11
univ	15,064	24,334	✓	✓	$V^{2}$ -Net-SCP	1000	$6 \times 10^{- 4}$	300	4 8 11
					E- $V^{2}$ -Net-SCP	1000	$1 \times 10^{- 3}$	200	4 8 11
					MSN-SCP	1000	$3 \times 10^{- 4}$	200	11
zara1	37,042	2356	✓	-	$V^{2}$ -Net-SCP	1000	$3 \times 10^{- 4}$	200	4 8 11
					E- $V^{2}$ -Net-SCP	1000	$4 \times 10^{- 4}$	200	4 8 11
					MSN-SCP	1000	$3 \times 10^{- 4}$	200	11
zara2	33,488	5910	✓	-	$V^{2}$ -Net-SCP	1000	$3 \times 10^{- 4}$	200	4 8 11
					E- $V^{2}$ -Net-SCP	1000	$2 \times 10^{- 4}$	250	4 8 11
					MSN-SCP	1000	$3 \times 10^{- 4}$	200	11
SDD	251,617	38,626	✓	-	$V^{2}$ -Net-SC	1000	$3 \times 10^{- 4}$	200	4 8 11
					$V^{2}$ -Net-SCP	1000	$3 \times 10^{- 4}$	200	4 8 11
					E- $V^{2}$ -Net-SC	1000	$2 \times 10^{- 4}$	200	4 8 11
					E- $V^{2}$ -Net-SCP	1000	$2 \times 10^{- 4}$	200	4 8 11
					MSN-SC	1000	$2 \times 10^{- 4}$	200	11
					MSN-SCP	1000	$2 \times 10^{- 4}$	200	11
					Trans-SC	1000	$4 \times 10^{- 4}$	250	4 8 11
					Trans-SCP	1000	$4 \times 10^{- 4}$	250	4 8 11

Table 3. The averaged ADE/FDE results across 5 ETH-UCY subsets in cross-dataset validation. ‘-abn’ indicates our abnormal interaction module; bold shows our improvement.

Models (ETH-UCY)	Year	-abn	${ADE}_{20}$	${FDE}_{20}$
EigenTraj	2023	-	0.21	0.34
TUTR	2023	-	0.21	0.36
LED	2023	-	0.21	0.33
MSN	2023	-	0.21	0.34
EqMotion	2023	-	0.21	0.35
PPT	2024	-	0.20	0.31
MSN-SCP	2025	-	0.215	0.374
MSN-SCP	2025	✓	0.208	0.355
$V^{2}$ -Net-SCP	2025	-	0.188	0.312
$V^{2}$ -Net-SCP	2025	✓	0.185	0.302
E- $V^{2}$ -Net-SCP	2025	-	0.189	0.309
E- $V^{2}$ -Net-SCP	2025	✓	0.185	0.305

Table 4. ADE/FDE results on the SDD with abnormal social interaction modeling ‘-abn’. Bold means improvement.

Models (SDD)	ADE/FDE	Models (SDD)	ADE/FDE
LED	8.48/11.36	LB-EBM	8.87/15.61
FlowChain	9.93/17.17	AgentFormer	10.18/16.91
UPDD	6.59/13.50	IMP	8.98/15.54
RAN	10.97/19.95	EigenTraj	7.42/12.49
LG-Traj	7.80/12.79	PPT	7.03/10.65
$V^{2}$ -Net	7.12/11.39	E- $V^{2}$ -Net	6.57/10.49
$V^{2}$ -Net-SC	6.71/10.66	E- $V^{2}$ -Net-SC	6.54/10.36
$V^{2}$ -Net-SC-abn	6.59/10.60	E- $V^{2}$ -Net-SC-abn	6.48/10.32
$V^{2}$ -Net-SCP	6.59/10.39	E- $V^{2}$ -Net-SCP	6.44/10.22
$V^{2}$ -Net-SCP-abn	6.38/10.04	E- $V^{2}$ -Net-SCP-abn	6.38/10.08
MSN	7.69/12.16	Transformer	17.44/33.36
MSN-SC	7.49/12.12	Trans-SC	16.47/32.08
MSN-SC-abn	7.32/11.78	Trans-SC-abn	15.57/30.93
MSN-SCP	7.32/11.76	Trans-SCP	16.11/31.43
MSN-SCP-abn	7.25/11.51	Trans-SCP-abn	15.70/31.16

Table 5. Cross-dataset validation on ETH-UCY for abnormal interaction modeling ‘-abn’ and rare intention modeling ‘-r’. ‘Dataset’: validation subset; baseline:

V^{2}

-Net-SCP-abn. Bold means the best ADE/FDE metric of each dataset.

Table 5. Cross-dataset validation on ETH-UCY for abnormal interaction modeling ‘-abn’ and rare intention modeling ‘-r’. ‘Dataset’: validation subset; baseline:

V^{2}

-Net-SCP-abn. Bold means the best ADE/FDE metric of each dataset.

Dataset	$- abn$	$- r$	Top 1%	Top 5%	Top 10%	All
eth	-	-	1.186/2.379	0.796/1.540	0.650/1.240	0.261/0.412
	✓	-	1.189/2.318	0.774/1.496	0.638/1.201	0.255/0.400
	✓	✓	1.070/2.131	0.713/1.407	0.591/1.150	0.258/0.421
hotel	-	-	0.643/1.128	0.411/0.712	0.331/0.563	0.109/0.158
	✓	-	0.644/1.217	0.391/0.724	0.321/0.565	0.108/0.152
	✓	✓	0.592/1.085	0.369/0.646	0.305/0.519	0.108/0.155
univ	-	-	1.617/3.376	0.986/2.053	0.767/1.562	0.251/0.446
	✓	-	1.432/2.938	0.897/1.828	0.708/1.429	0.250/0.444
	✓	✓	1.419/2.874	0.825/1.610	0.657/1.258	0.251/0.446
zara1	-	-	1.234/2.589	0.660/1.355	0.499/0.992	0.180/0.308
	✓	-	1.075/2.160	0.596/1.132	0.462/0.853	0.174/0.282
	✓	✓	1.054/2.070	0.584/1.106	0.453/0.843	0.174/0.289
zara2	-	-	1.267/2.684	0.723/1.478	0.548/1.075	0.137/0.232
	✓	-	1.192/2.506	0.701/1.441	0.535/1.044	0.137/0.233
	✓	✓	1.170/2.387	0.676/1.345	0.515/0.972	0.136/0.233
all	-	-	1.189/2.431	0.715/1.428	0.559/1.086	0.188/0.311
	✓	-	1.106/2.228	0.672/1.324	0.533/1.018	0.185/0.302
	✓	✓	1.061/2.109	0.633/1.223	0.504/0.948	0.185/0.309

Table 6. ADE/FDE (m) for top 1–10% of long-tail samples in the SDD. ‘-abn’: abnormal interaction; ‘-r’: rare intention. Bold: the best test set performance for each comparison group. Underline: the best long-tail performance for each comparison group.

Models (SDD)	$- abn$	$- r$	Top 1% ↓	Top 5% ↓	Majority (95%) ↓	Top 10% ↓	Majority (90%) ↓	All ↓
Y-Net	-	-	65.82/134.01	34.72/67.46	6.54/8.96	-	-	7.93/11.88
Y-Net + FEND	-	-	57.58/108.61	31.27/57.98	6.64/9.24	-	-	7.87/11.68
MSN-SCP	-	-	80.43/124.90	41.19/70.43	5.55/8.68	29.95/51.75	4.82/7.33	7.33/11.77
	✓	-	80.80/120.16	40.40/67.57	5.52/8.56	29.30/49.62	4.81/7.28	7.26/11.51
	✓	✓	78.01/105.87	38.14/57.97	5.31/8.04	27.65/43.28	4.65/6.90	6.95/10.54
$V^{2}$ -Net-SCP	-	-	67.00/121.30	35.76/63.99	5.05/7.55	26.27/45.90	4.40/6.42	6.59/10.37
	✓	-	63.60/115.40	33.79/60.40	4.94/7.39	25.03/43.74	4.31/6.30	6.38/10.04
	✓	✓	62.69/110.50	32.96/58.05	4.84/7.16	24.35/42.09	4.24/6.10	6.25/9.70
E- $V^{2}$ -Net-SCP	-	-	69.61/127.92	36.10/65.62	4.88/7.30	26.25/46.50	4.24/6.19	6.44/10.22
	✓	-	63.34/112.78	33.93/60.33	4.93/7.43	25.13/43.88	4.30/6.32	6.38/10.08
	✓	✓	60.47/108.30	32.13/57.22	4.82/7.21	23.89/41.74	4.22/6.15	6.19/9.71
Trans-SCP	-	-	168.55/345.48	95.17/197.17	11.94/22.71	70.12/144.96	10.10/18.82	16.10/31.43
	✓	-	167.92/340.14	93.45/194.35	11.62/22.59	68.86/143.12	9.80/18.74	15.71/31.18
	✓	✓	69.63/115.91	35.83/60.32	5.36/7.80	26.33/44.01	4.72/6.70	6.88/10.43

Table 7. Ablation study of the abnormal interaction extraction threshold on the SDD. ∥

N_{a b n}^{t e s t}

∥ means the number of abnormal interactions extracted during inference. Bold means the best ADE/FDE metric.

Table 7. Ablation study of the abnormal interaction extraction threshold on the SDD. ∥

N_{a b n}^{t e s t}

∥ means the number of abnormal interactions extracted during inference. Bold means the best ADE/FDE metric.

Model	$ϵ_{abn}$	∥ $N_{abn}^{train}$ ∥	$ϵ_{abn 2}$	∥ $N_{abn}^{test}$ ∥	ADE/FDE
$V^{2}$ -Net-SCP-abn	−4	16.6k	−4	2.7k	6.45/10.22
	−2	53.9k	0	5.2k	6.43/10.14
	−2	53.9k	−2	7.1k	6.38/10.04
	0	309.0k	0	87.8k	6.43/10.11
E- $V^{2}$ -Net-SCP-abn	−4	16.6k	−4	2.8k	6.49/10.26
	−2	57.2k	0	5.1k	6.38/10.08
	−2	57.2k	−2	7.8k	6.41/10.14
	0	293.4k	0	35.2k	6.44/10.22

Table 8. Ablation study of rare intention modeling ‘-r’ on univ. Our baseline model is

V^{2}

-Net-SCP-abn.

N_{i n t}

is the intention GMM components number and

θ^{'}

is the activation threshold for

L_{P r o t o N C E}

. Bold means the best ADE/FDE metric.

Table 8. Ablation study of rare intention modeling ‘-r’ on univ. Our baseline model is

V^{2}

-Net-SCP-abn.

N_{i n t}

is the intention GMM components number and

θ^{'}

is the activation threshold for

L_{P r o t o N C E}

. Bold means the best ADE/FDE metric.

Plugin	$N_{int}$	$θ^{'}$	Top 1%	Top 5%	Top 10%	All
-	-	-	1.62/3.38	0.99/2.05	0.77/1.56	0.25/0.44
-abn	-	-	1.43/2.94	0.90/1.83	0.71/1.43	0.25/0.44
-abn-r	512	0.4	1.57/3.26	0.92/1.87	0.71/1.41	0.25/0.43
	512	0.6	1.42/2.87	0.83/1.61	0.66/1.26	0.25/0.45
	512	0.8	1.51/3.05	0.88/1.75	0.69/1.33	0.25/0.45
	256	0.6	1.52/3.14	0.89/1.76	0.69/1.34	0.25/0.44
	1024	0.6	1.54/3.17	0.90/1.78	0.69/1.34	0.25/0.43

Table 9. Ablation study of K-nearest GMM components for rare intention modeling on univ dataset. Baseline:

V^{2}

-Net-SCP-abn. AI: average inference time (ms). FI: fast inference time (ms). Bold means the best ADE/FDE metric of each dataset.

Table 9. Ablation study of K-nearest GMM components for rare intention modeling on univ dataset. Baseline:

V^{2}

-Net-SCP-abn. AI: average inference time (ms). FI: fast inference time (ms). Bold means the best ADE/FDE metric of each dataset.

Dataset	K	$R_{int}$	Top 1%	Top 5%	Top 10%	All	AI	FI
SDD	-	-	63.60/115.40	33.79/60.40	25.03/43.74	6.38/10.04	43.7	40.7
	1	0.1	56.47/94.44	31.30/54.70	23.79/41.81	6.77/11.36	42.8	40.8
	5	0.05	61.95/111.65	32.93/58.85	24.41/42.62	6.29/9.88	42.6	40.9
	5	0.1	62.69/110.50	32.96/58.05	24.35/42.09	6.25/9.70	43.1	40.6
	5	0.2	62.73/112.54	33.16/59.26	24.54/42.91	6.30/9.88	43.2	40.7
univ	1	0.1	1.43/2.92	0.83/1.61	0.65/1.26	0.25/0.45	-	-
	2	0.1	1.46/2.93	0.85/1.64	0.67/1.27	0.25/0.44	-	-
	5	0.1	1.42/2.87	0.83/1.61	0.66/1.26	0.25/0.45	-	-
	6	0.1	1.50/3.02	0.87/1.69	0.68/1.30	0.25/0.44	-	-
	7	0.1	1.54/3.19	0.92/1.83	0.71/1.38	0.25/0.44	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, C.; Liu, J.; Dong, X. Disentangling Interaction and Intention for Long-Tail Pedestrian Trajectory Prediction. Computers 2026, 15, 186. https://doi.org/10.3390/computers15030186

AMA Style

Yang C, Liu J, Dong X. Disentangling Interaction and Intention for Long-Tail Pedestrian Trajectory Prediction. Computers. 2026; 15(3):186. https://doi.org/10.3390/computers15030186

Chicago/Turabian Style

Yang, Chengkai, Jincheng Liu, and Xingping Dong. 2026. "Disentangling Interaction and Intention for Long-Tail Pedestrian Trajectory Prediction" Computers 15, no. 3: 186. https://doi.org/10.3390/computers15030186

APA Style

Yang, C., Liu, J., & Dong, X. (2026). Disentangling Interaction and Intention for Long-Tail Pedestrian Trajectory Prediction. Computers, 15(3), 186. https://doi.org/10.3390/computers15030186

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Disentangling Interaction and Intention for Long-Tail Pedestrian Trajectory Prediction

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Problem Formulation

3.2. Abnormal Social Interaction Modeling

3.3. Rare Intention Modeling

3.4. Frequency-Sensitive Decoder Combining Interactions and Intentions

3.5. Loss Function

4. Experiment

4.1. Experimental Setup

4.2. Comparisons to State-of-the-Art Methods

4.3. Discussions and Ablation Studies

4.4. Qualitative Analysis

4.5. Robustness Evaluation and Comparative Results

5. Outlook

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI