1. Introduction
Pedestrian trajectory prediction is a cornerstone technology for enabling safe and reliable autonomous systems, such as self-driving cars and social robots, which must navigate seamlessly within human-populated environments [
1]. The core challenge lies in accurately forecasting a pedestrian’s future path based on their observed past motion and the surrounding social and physical context. While humans perform this task with intuitive ease, computational models must grapple with the inherent uncertainties of human decision-making [
2], which are influenced by a complex interplay of personal goals, social norms, and environmental constraints [
3,
4].
Driven by the success of deep learning, significant progress has been made in this field. Early approaches [
5] focused on modeling the social interactions between pedestrians using techniques like Social LSTM and graph neural networks. Subsequent work incorporated physical scene information through semantic maps or occupancy grids [
6], and generative models like Generative Adversarial Networks [
3] and Variational Autoencoders were introduced to produce multimodal predictions [
7,
8], acknowledging that a pedestrian’s future is not deterministic but a distribution of possibilities [
9].
Some existing approaches [
10,
11,
12,
13] regard social environmental influences as valuable contextual cues and attempt to capture these environmental correlations, but such methods face fundamental limitations from a causal inference standpoint. The social environment frequently functions as a confounding factor that concurrently affects both the pedestrian’s historical trajectory
and future trajectory
, establishing a back-door path
as depicted in
Figure 1 (green block). Consequently, even in the absence of direct causal relationships between specific historical trajectories
X and future trajectories
Y, the social environment
S can introduce spurious correlation between them. Consider crowded intersection scenarios where pedestrians exhibit similar parallel movement patterns—this alignment often results from environmental constraints like narrow passages rather than intentional social interactions. Models dependent purely on likelihood estimation frequently misinterpret these environmentally induced coincidental patterns as social interactions between pedestrians [
14], causing prediction inaccuracies when environmental conditions change. Current research [
15,
16] still lacks comprehensive understanding and effective mechanisms to mitigate this confounding effect.
Furthermore, as shown in
Figure 1 (the orange block), although contemporary generative models [
11,
13] demonstrate multimodal prediction capabilities, their diversity primarily relies on stochastic sampling from latent distributions or extracting common patterns from training datasets [
17]. This modeling paradigm remains essentially passive and retrospective in nature. Such models are constrained to extrapolate within patterns derived from historical observational data, lacking the capacity for active reasoning about potential future occurrences. This limitation becomes particularly evident when handling novel or counterfactual scenarios insufficiently represented in training data. For example, when a previously straight-walking pedestrian encounters an unexpected obstacle, the existing models [
7,
8,
18] struggle to proactively reason about such plausible yet previously unobserved avoidance behaviors. Therefore, a more proactive uncertainty modeling mechanism that goes beyond mere historical data extrapolation is needed.
Additionally, in traditional feature fusion methods [
19,
20,
21,
22], such as simple concatenation or attention mechanisms, historical trajectory features and social environment features are typically interacted with without distinction. This leads to biased information learned from the social environment, carrying the aforementioned spurious associations, directly contaminating the representation of the historical trajectory, as shown in
Figure 1 (the blue block). The model ultimately learns a feature representation entangled with both genuine causal signals and spurious correlative signals, undermining its generalizability and interpretability [
23]. This problematic fusion process, leading to a biased feature representation, is highlighted as a key issue that our causal intervention module is designed to rectify.
To address the aforementioned challenges, this paper proposes a novel Causal Intervention and Counterfactual Reasoning (CICR) framework for multimodal pedestrian trajectory prediction. As illustrated in
Figure 1 (the purple block), our core idea is to shift the prediction task from the traditional associative learning paradigm to a causal inference paradigm.
The proposed framework employs a hierarchical architecture comprising three specialized modules. The Multisource Encoder tackles the bias coupling issue by separately extracting trajectory features through a GRU and social context features through interaction modeling, preventing direct contamination between modalities. The Causal Intervention-based Feature Fusion Module specifically addresses spurious correlations by implementing front-door criterion through cross-attention mechanisms, generating deconfounded trajectory representations that capture genuine causal relationships rather than environmental biases. Finally, the Counterfactual Reasoning Decoder overcomes passive uncertainty modeling by simulating hypothetical scenarios through a policy network and multi-head attention, enabling proactive reasoning about diverse future possibilities. This comprehensive framework establishes a new paradigm for trajectory prediction that not only systematically resolves the fundamental limitations of existing approaches but also achieves superior predictive performance across diverse scenarios.
Our contributions can be summarized as follows:
We propose a Multisource Encoder Module. This module separately extracts robust multi-scale spatio-temporal features from the pedestrian’s own trajectory and comprehensive interaction and semantic features from the social environment, providing a solid foundation for subsequent causal analysis and fusion.
We present a Causal Intervention Fusion Module. To eliminate spurious correlations, this module formally models the social environment as a confounder using a Structural Causal Model and implements the front-door criterion via a Cross-Attention Fusion mechanism. Its role is to perform causal intervention, yielding a deconfounded trajectory representation that captures the true causal relationship between historical and future paths.
We introduce a Counterfactual Reasoning Decoder. This decoder moves beyond passive extrapolation by proactively simulating hypothetical future scenarios. It uses a multi-head attention mechanism to enable the target pedestrian’s motion intent to interact with sampled future path nodes, thereby generating diverse and plausible trajectory predictions that account for future uncertainty.
Extensive experiments and comprehensive ablation studies are conducted on several human trajectory prediction benchmarks, including ETH [
24], UCY [
5], the Stanford Drone Dataset (SDD) [
25], and the ActEV/VIRAT Dataset (AVD) [
26], to evaluate the effectiveness of our proposed method and validate the contributions of its key components.
3. Methodology
3.1. Description of Symbols
For reference, a list of symbols used in this work is given in
Table 1.
3.2. Architecture of CICR
This work presents a novel pedestrian trajectory prediction framework that effectively addresses confounding bias from social environments and overcomes limitations in modeling future uncertainties through the integrated application of causal reasoning and counterfactual reasoning. As illustrated in
Figure 2, our framework employs a hierarchical architecture consisting of three core components: a Multisource Encoder, a Causal Intervention Fusion Module, and a Counterfactual Reasoning Decoder.
The framework begins with the Multisource Encoder, which extracts multi-level features from raw input data through parallel processing branches. The trajectory encoding branch captures pedestrians’ intrinsic motion patterns via sequence modeling, while the environment encoding branch integrates social interactions and scene semantics, providing comprehensive feature representations for subsequent analysis.
Following this, the Causal Intervention Fusion Module eliminates spurious correlations induced by social environments through its two components: the CIM and CAF. By constructing causal graphs and employing attention mechanisms, this module performs feature-level causal interventions to obtain deconfounded trajectory representations, ensuring the learning of genuine causal relationships.
Subsequently, the Counterfactual Reasoning Decoder generates diverse trajectory predictions by simulating multiple plausible future scenarios. Through sampling potential future paths and enabling cross-modal interaction between historical features and hypothetical situations, this module produces multimodal predictions that maintain physical realism while enhancing prediction diversity.
The complete framework operates through an end-to-end training paradigm, establishing a coherent pipeline from feature extraction and causal de-biasing to multimodal prediction. This integrated approach significantly improves result diversity and interpretability while maintaining high prediction accuracy.
3.3. Problem Formulation
Given a sequence of the historical trajectory information of
n pedestrians, including the historical locations of the
pedestrian
from time step 1 to
, where
represents the historical observation of trajectory length and
denotes the first observed historical location of the
pedestrian, the corresponding ground truth is denoted as
. The vector coordinate of the
pedestrian is represented as
constructed by
and
, where
denotes the polar radius value and
denotes the polar angle value,
. The predictions of the
location of the
pedestrian are denoted as
based on
. The prediction of the
pedestrian is denoted as
, where
denotes the prediction length. A series of methods has been developed to predict future trajectories using relative position or velocity information [
4]. The social interaction information of the
pedestrian can be defined as a function of the trajectories of the surrounding pedestrians
, where
is a social information aggregation function (such as the social pooling in Social-LSTM), and
. The formal definition of pedestrian trajectory prediction is as follows:
where
denotes the future trajectory predictied by the model
.
3.4. Multisource Encoder Module
The multisource encoding module primarily consists of two components: TIE and SEE. The pedestrian trajectory prediction task presents significant challenges, primarily stemming from strong temporal dependencies and the underutilization of historical features. Therefore, extracting more discriminative features from limited data is crucial for improving prediction accuracy. To address these issues, we propose a multi-scale encoding mechanism. This encoder first processes the features of the original trajectory data. To comprehensively describe pedestrian dynamic behavior, we extract multimodal features based on the input trajectory sequence, including spatio-temporal features and keypoint features of pedestrian motion. Specifically, the spatio-temporal features aim to capture the potential periodic patterns in pedestrian movement trajectories. As indicated by Autoformer [
37], time series forecasting typically involves periodic patterns, and pedestrian trajectory prediction is no exception. Empirical observations show that pedestrian trajectories often exhibit slight arcs or approximate straight-line movements, reflecting a certain degree of regularity. Consequently, we construct the spatio-temporal feature representation by recording the sequential positional information for each frame from
to
. Based on the annotated pedestrian positions in the dataset, we compute the corresponding temporal features, providing a foundation for subsequent feature fusion and causal inference. Furthermore, keypoint features are extracted from the videos using the pre-trained HRNet model [
50] to obtain pedestrian keypoint information, which is then fused with the spatio-temporal features.
To achieve efficient encoding while maintaining prediction performance, we select a GRU as the core module of the encoder. The GRU exhibits excellent capability in modeling temporal dependencies and offers high computational efficiency, making it suitable for resource-constrained practical scenarios. Given that pedestrian motion inherently contains significant uncertainty, and such stochasticity is naturally present in the raw trajectory data, the model should directly learn the intrinsic dynamic variations within the data, rather than artificially suppressing random information through smoothing operations. To further capture short-term dependencies in both the temporal and feature dimensions, we employ average pooling to aggregate historical trajectory information within a sliding time window, thereby extracting more robust local dynamic features. To mitigate the potential loss of critical details caused by pooling operations, we introduce a linear transformation to remap the original input, achieving multi-scale information fusion. Subsequently, the feature dimensions are adjusted to generate a sequence representation that meets the input requirements of the GRU. Finally, the GRU encodes this feature sequence to extract long-range dependencies, completing the multi-level, sequential trajectory feature modeling.
Traditional methods often rely on historical behavior and social interaction information for trajectory prediction. The social environment is a key confounding variable that affects prediction performance, primarily by inducing the model to learn spurious correlations between historical and future trajectories. Specifically, the social environment (denoted as variable S) simultaneously influences both the historical trajectory X and the future trajectory Y of a pedestrian, forming a back-door path . This means that even if no direct causal relationship exists between certain historical trajectories X and future trajectories Y, S can still introduce spurious statistical associations by influencing both. For example, in crowded scenarios, parallel movement between pedestrians might not stem from group behavior but rather from movement patterns constrained by the environment. If the model relies solely on likelihood estimation, it may incorrectly associate crowded environments with group behavior, thereby reducing prediction accuracy. To address this, our designed social environment encoder not only encodes the information of surrounding pedestrians but also integrates the semantic information of the scene to more comprehensively characterize the influence of the social environment.
3.5. Causal Intervention Fusion Module
The social environment exerts a significant confounding effect on future trajectory prediction. If not adequately addressed, this effect can mislead the model into learning spurious correlations between historical and future trajectories that do not genuinely exist, thereby severely compromising the reliability and accuracy of prediction performance. Consequently, prior to fusing features from different modalities, it is imperative to design effective mechanisms from a causal perspective to identify and eliminate these spurious statistical associations within the model.
3.5.1. Structural Causal Model
To systematically analyze and formalize the aforementioned issue, we utilize a SCM, aiming to clearly characterize the intrinsic causal mechanisms between variables. This SCM comprises three core variables (nodes): X (denoting the historical trajectory), Y (denoting the future trajectory), and S (denoting the social environment). The directed edges in the model explicitly represent the causal relationships between variables, i.e., cause → effect. We elaborate on each directed edge as follows:
: The future trajectory can be largely inferred from the rich cues embedded within the historical trajectory. The historical trajectory sequence contains dynamic information of pedestrian motion, such as instantaneous velocity, trends in acceleration change, and the potential starting position of the future trajectory, among other critical features. A typical example is that pedestrians in obstacle-free environments tend to maintain approximately straight walking paths without frequently altering their direction of movement.
: The social environment significantly influences and shapes a pedestrian’s future trajectory decisions. Essentially, the social environment determines the motion patterns a pedestrian is likely to adopt in a specific scenario. For instance, upon detecting another pedestrian approaching head-on, a pedestrian typically proactively changes direction to avoid a collision; additionally, based on social norms, pedestrians tend to adjust their walking speed and path to maintain a comfortable interpersonal distance.
: For reasons analogous to , a pedestrian’s historical trajectory is also profoundly influenced by their prevailing social environment. It is particularly important to note that the social environment itself is dynamic. For example, a pedestrian might have been walking alone during the observed historical period but could merge into a group and walk in parallel with others in the future.
Based on the above analysis, we arrive at a key conclusion: in the trajectory prediction task, the social environment
S acts as a typical confounding variable. As clearly illustrated in the causal diagram in the CIM of
Figure 2, there exists a back-door path from
X to
Y via
S:
. The presence of this back-door path indicates a serious estimation bias problem: even if certain historical trajectories
X themselves have a low likelihood of producing unreasonable future trajectories
Y, the social environment
S, as a common cause, can still create an association between
X and
Y, leading the model to make inaccurate predictions. Learning features from such an intrinsically biased social environment will inevitably cause the model to incorporate and amplify spurious correlations.
Therefore, to thoroughly eliminate the interference from spurious correlations introduced by the back-door path, we employ causal intervention techniques to block the confounding effect of the social environment
S. In this process, we utilize the front-door criterion, an essential tool within the causal inference paradigm. The path structure present in the graph provides the basis for our intervention. Similar to the approach in related work [
51], we first decouple the spatio-temporal features of the trajectory and then represent them as integrated spatio-temporal features. Herein, temporal features and spatial features serve as mediator variables for the latent features, representing the refined temporal and spatial features extracted by specific networks, respectively. By introducing and leveraging these mediator variables, the model can learn the genuine causal relationship between the input data and the latent state more fairly and robustly.
3.5.2. Feature Fusion
Following the causal inference paradigm described above, we formalize the causal intervention operation on the input historical trajectory as . The specific operation is implemented as follows: The trajectory features and social environment features are obtained separately through our designed multi-source encoder. These features specifically include the spatio-temporal features and keypoint features of the trajectory, as well as the scene semantic features and pedestrian interaction features of the social environment.
We employ a CAF module to effectively integrate these two types of features. The cross-attention mechanism is particularly adept at modeling complex associations between entities such as the trajectory representation and the social environment representation .
We customize the application of the back-door adjustment to the trajectory prediction task through the following details: Firstly, the trajectory features and social environment variables are projected into a common latent space via linear transformations to ensure their comparability. Secondly, the similarity matrix between the projected trajectory features
and the projected social environment features
is precisely computed using matrix multiplication. Finally, the computed similarity matrix is used as weighting to aggregate the social environment features, and the aggregated feature map is subsequently projected back to the original space, completing the fusion process.
where
represents the final output tensor of the Cross-Attention Fusion module.
,
, and
are learnable linear transformation parameters in the CAF module, responsible for projecting the input features
x and
s into the same latent space.
represents another linear transformation responsible for projecting the fused features to the two-dimensional coordinate space.
denotes a scaling factor used to control the magnitude of the attention weights, where
d is the feature dimension. Since accurately obtaining the true distribution
for each group of social environment configurations
is infeasible in practical applications, we make a fairness assumption: We assume that each social environment configuration can be fairly integrated into every scene, meaning the occurrence probability of social environments follows a uniform distribution
, where
n denotes the total number of environment types. This assumption helps prevent the model from overfitting to specific environments.
3.6. Counterfactual Reasoning Decoder
3.6.1. Counterfactual Reasoning
Traditional feature fusion methods based on historical scenes exhibit significant limitations: models can only perform feature interactions within the scope of the observed historical context and cannot effectively handle the uncertainty of future scenes. However, the essence of pedestrian trajectory prediction requires the model to consider multiple reasonable future possibilities. To address this, we introduce a counterfactual reasoning mechanism, enabling the target pedestrian to focus more on potential areas in future scenes that are relevant to the historical context, thereby generating more forward-looking trajectory predictions.
The implementation of this innovative process relies on three key components: first, a candidate set of possible future locations and their corresponding positional encodings; second, the motion state encoding of the target pedestrian; and finally, a mechanism establishing the relationship between the two. Specifically, we traverse and explore the scene semantic graph and diversely sample all navigable nodes within it, yielding a diverse set of potential path nodes that cover various future possibilities.
In the feature fusion stage, we employ a multi-head attention mechanism to perform counterfactual reasoning across multiple subspaces, which naturally enriches feature diversity and enables comprehensive coverage of plausible futures. In the specific implementation, we use 16 parallel attention heads. The motion encoding of the target pedestrian is linearly projected to obtain the Query vector
, while the features of the future candidate locations are linearly projected to obtain the Key
and Value
vectors. Each attention head independently calculates its attention weights. The formula for each head is as follows:
The calculation for each attention head is as follows:
The output of the multi-head attention layer is the fused feature
:
Here,
represents the output projection matrix. This process not only implements counterfactual reasoning, bridging the gap from historical observation to future possibilities, but also represents the final stage of fusing features from different modalities. After completing all the steps, the model obtains a top-level feature representation of the target pedestrian within the environment. This representation is then passed to the decoder to produce the final prediction results.
3.6.2. Multimodal Trajectory Decoder
Through the aforementioned complex encoding and reasoning processes, we obtain a final latent feature vector rich in information. It must be emphasized that predicting future pedestrian trajectories is inherently a probabilistic problem having uncertainty, as future trajectory points possess multiple possibilities. Traditional point estimation methods often only capture the primary mode of the conditional distribution and cannot fully reflect the uncertainty of the future.
However, the result obtained based on maximum likelihood estimation may not correspond to the true motion intention, as it might only represent a local optimum within the probability distribution. Even if this estimate achieves a high likelihood value on the training data, systematic errors may still arise due to insufficient model capacity or optimization difficulties. Therefore, we need to enhance the model’s expressive power to break through the limitations of local optima and better approximate the global optimum.
To address this issue, we employ a MLP as the decoder to directly output the multimodal trajectory distribution over future time steps. The mathematical expression for this decoder is as follows:
where
denotes the fused feature obtained from the aforementioned counterfactual reasoning, and
represents the multiple future trajectories predicted by the model. This design enables the model to simultaneously generate multiple reasonable and diverse future trajectory hypotheses, each corresponding to different future scene possibilities. Thereby, it more comprehensively captures the uncertain nature of pedestrian motion, enhancing the practicality and reliability of the prediction results.
5. Conclusions
This paper proposed CICR, a novel trajectory prediction framework that addresses key limitations of existing methods through causal inference and counterfactual reasoning. Our approach explicitly tackles the confounding bias introduced by social environments via causal intervention, and it enables proactive reasoning about future uncertainties through counterfactual scenario simulation. Experimental results demonstrate that CICR achieves superior performance across multiple benchmarks, with quantitative evaluations showing an average ADE/FDE of 0.17/0.24 on ETH/UCY and 12.27/15.65 on the cross-domain AVD dataset, confirming its particular strengths in long-term prediction and cross-domain generalization.
While the proposed CICR framework demonstrates strong prediction performance, it still has certain limitations that point to valuable directions for future research. Firstly, the current model incorporates modules such as multi-source encoding, cross-attention fusion, and counterfactual reasoning. To further reduce its complexity, we plan to explore techniques like model light weighting and knowledge distillation in future work. This will enhance inference speed and better adapt the framework to real-time application scenarios having stringent latency requirements. Secondly, the current work primarily focuses on interactions among pedestrians. A valuable future direction would be to extend the CICR framework to a broader range of agent trajectory prediction tasks, such as mixed interactive scenarios involving various traffic participants like vehicles and bicycles, to validate its generalizability.