V-PTP-IC: End-to-End Joint Modeling of Dynamic Scenes and Social Interactions for Pedestrian Trajectory Prediction from Vehicle-Mounted Cameras

Bai, Siqi; Fang, Yuwei; Li, Hongbing

doi:10.3390/s25237151

Open AccessArticle

V-PTP-IC: End-to-End Joint Modeling of Dynamic Scenes and Social Interactions for Pedestrian Trajectory Prediction from Vehicle-Mounted Cameras

by

Siqi Bai

¹

,

Yuwei Fang

¹

and

Hongbing Li

^2,*

¹

School of Engineering, Sichuan Normal University, Chengdu 610021, China

²

College of Resources and Environment, Aba Teachers University, Wenchuan 623002, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(23), 7151; https://doi.org/10.3390/s25237151 (registering DOI)

Submission received: 26 September 2025 / Revised: 14 November 2025 / Accepted: 21 November 2025 / Published: 23 November 2025

(This article belongs to the Section Vehicular Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

V-PTP-IC: End-to-end model fusing interaction-aware participants, lightweight scenes, and SORT-SIFT ego-motion compensation for robust driver-view pedestrian trajectory prediction.
Outperforms baselines on AAD/PIE datasets: ADE gains of 27.23%/25.73%; FDE gains of 33.88%/32.85%.

What is the implication of the main finding?

Advances autonomous driving safety: Enables precise, socially compliant forecasts, cutting urban pedestrian collision risks by 20–30% in dynamic traffic.

Abstract

Pedestrian trajectory prediction from a vehicle-mounted perspective is essential for autonomous driving in complex urban environments yet remains challenging due to ego-motion jitter, frequent occlusions, and scene variability. Existing approaches, largely developed for static surveillance views, struggle to cope with continuously shifting viewpoints. To address these issues, we propose V-PTP-IC, an end-to-end framework that stabilizes motion, models inter-agent interactions, and fuses multi-modal cues for trajectory prediction. The system integrates Simple Online and Realtime Tracking (SORT)-based tracklet augmentation, Scale-Invariant Feature Transform (SIFT)-assisted ego-motion compensation, graph-based interaction reasoning, and multi-head attention fusion, followed by Long Short-Term Memory (LSTM) decoding. Experiments on the JAAD and PIE datasets demonstrate that V-PTP-IC substantially outperforms existing baselines, reducing ADE by 27.23% and 25.73% and FDE by 33.88% and 32.85%, respectively. This advances dynamic scene understanding for safer autonomous systems.

Keywords:

pedestrian trajectory prediction; SORT; SIFT; dynamic scene features

1. Introduction

With the rapid progress of autonomous driving technologies, pedestrian trajectory prediction has become essential for ensuring road safety and reliable decision-making. According to the World Health Organization, approximately 1.27 million fatalities occur globally each year due to road traffic accidents, with pedestrians accounting for over 20% of cases [1]. Accurate trajectory prediction thus plays a crucial role in mitigating collision risks and supporting intelligent path planning.

Most existing trajectory prediction methods are designed for fixed surveillance cameras [2]. However, vehicle-mounted scenarios differ fundamentally, continuous ego-motion breaks static coordinate assumptions, frequent occlusions obscure pedestrian visibility, and dynamic urban backgrounds demand more context-aware modeling [3,4,5]. These factors collectively reduce the effectiveness of existing static-view methods and call for solutions tailored to vehicle-centric perception. Recent advances include Generative Adversarial Networks(GAN)-based trajectory generators, graph neural networks for social interaction modeling, and pseudo oracle mechanisms for top-down datasets. Yet such approaches either assume fixed viewpoints, focus on intention classification rather than continuous trajectories, or overlook ego-motion stabilization—limitations that hinder their applicability to real-world driving videos.

To overcome these challenges, we propose V-PTP-IC (Vehicle-view Pedestrian Trajectory Prediction with Interaction Consideration), an end-to-end trajectory prediction framework designed specifically for vehicle-mounted perspectives. The model incorporates ego-motion compensation, interaction-aware graph reasoning, and unified feature fusion to cope with dynamic viewpoints and occlusions.

The main contributions of this work are summarized as follows:

Propose an end-to-end vehicle-view trajectory prediction framework integrating interaction modeling and adaptive ego-motion compensation.
Introduce a SIFT-based static keypoint matching strategy to correct camera-induced motion jitter and improve stability.
Design a GCN-based relational graph to model dynamic interactions among traffic participants.
Present a unified feature fusion module that combines motion, depth, scene, and interaction cues via multi-head attention.
Achieve superior performance on JAAD and PIE datasets, reducing ADE by 25–27% and FDE by 32–34% compared with baselines.

The remainder of this paper is organized as follows: Section 2 reviews related work; Section 3 details the methodology of V-PTP-IC; Section 4 presents experimental results; and Section 5 concludes the paper.

2. Related Work

Pedestrian trajectory prediction aims to forecast future movement paths based on historical trajectory data, while accounting for interactions with surrounding elements, including other pedestrians, vehicles, and static obstacles. With the growing adoption of visual sensing technologies, existing research can be broadly categorized into two main perspectives: fixed-view (or global-view) and vehicle-mounted-view approaches.

2.1. Fixed/Global-View Methods

Under fixed-camera or global-view settings, pedestrian trajectory prediction methods are generally classified into three categories: physics-based models, traditional machine learning methods, and deep learning models.

Physics-based models, such as constant velocity (CV), constant acceleration (CA), reasoning-based models [6], expert systems [7], and other handcrafted methods [8], offer computational efficiency owing to their simplified kinematic assumptions. However, these assumptions constrain their capacity to capture nonlinear pedestrian movements and complex social interactions, thereby limiting applicability in crowde d or highly dynamic environments. Extensions based on the Markov Decision Process (MDP) [9,10] enhance adaptability by explicitly modeling state transitions and decision-making behaviors, yet they remain inadequate for highly interactive or visually complex scenes.

Traditional machine learning techniques improve prediction performance via probabilistic inference and data-driven learning, encompassing four primary paradigms: sequential modeling (e.g., Hidden Markov Models [11]), regression-based approaches (e.g., Gaussian Processes [12]), discriminative classifiers (e.g., Support Vector Machines [13]), and cluster-based group recognition methods. Among these, clustering-based approaches have been particularly influential in modeling pedestrian flow patterns. For instance, hierarchical clustering combined with Dynamic Time Warping (DTW) effectively manages variable-length trajectories, with average linkage exhibiting robust performance despite considerable computational overhead [14]. Density-based methods, such as DBSCAN, provide superior scalability and robustness to outliers, facilitating real-time crowd monitoring without predefined cluster numbers [15]. More recently, spatio-temporal fusion clustering techniques—like QuickBundles and Hesitation Points (HP) [16]—have enhanced discrimination of similar trajectories across different time periods. Nevertheless, these methods depend on handcrafted features and extensive parameter tuning (e.g., DTW thresholds, HP parameters, DBSCAN epsilon and minPts), which hampers generalizability and has spurred the adoption of deep learning.

Deep learning approaches have emerged as the dominant paradigm, owing to their ability to automatically learn spatio-temporal dependencies and social interaction patterns from large-scale datasets. Early deep models such as Social-LSTM [17] and Graph Convolutional Networks(GCN)-based frameworks [18] demonstrated that pedestrian-to-pedestrian interactions can be encoded directly from trajectory histories. More recently, Transformer-based architectures have become increasingly influential. AgentFormer [19] jointly models temporal and social dimensions with agent-aware attention, while HiVT [20] adopts a hierarchical vectorized Transformer to efficiently handle multi-agent motion in traffic scenes. SocialCircle [21] introduces an angle-based social interaction representation that can be plugged into existing predictors to better capture fine-grained interaction patterns. Together, these works reflect a clear trend toward unified attention-driven architectures that simultaneously model motion dynamics and social context. Additionally, the TPPO [22] framework incorporates a pseudo oracle mechanism within a GAN-based architecture to generate socially compliant and multimodal trajectories in bird’s-eye-view scenarios. Although TPPO delivers strong performance under static global viewpoints, its reliance on viewpoint invariance renders it unsuitable for vehicle-mounted scenes, where ego-motion and perspective changes substantially distort coordinate consistency. These limitations highlight the need for adaptive strategies to handle dynamic viewpoints, thereby motivating the solutions proposed in this work.

2.2. Vehicle-Mounted-View Methods

Research on pedestrian trajectory prediction from vehicle-mounted camera perspectives remains limited, yet this setting poses substantially greater challenges than fixed-camera scenarios [23]. Key difficulties include viewpoint variability and image distortion: the continuous motion of vehicle-mounted cameras induces non-stationary coordinate systems and frequent perspective shifts, while wide-angle lenses often cause fisheye distortion. To address these, calibration procedures and fisheye-specific camera models have been proposed [24], although they are sensitive to noise and demand precise parameter tuning.

Another critical challenge involves occlusion and the accurate modeling of multi-pedestrian interactions. Vehicle-mounted views are more prone to partial visibility from moving objects and road infrastructure than fixed views. Recent works, such as Social-STGCNN [25], have shown strong capabilities in capturing interaction dependencies under occlusion. Extending this, RAIDN [26] introduces a graph-structured real-time pedestrian crossing intention prediction model. Via a dual-branch architecture that separately encodes pedestrian actions and their interactions with surrounding traffic participants, RAIDN emphasizes binary intention classification and offers insights for improving driving safety. While its interaction graph provides a solid foundation for multi-agent relation modeling, the method is limited to discrete intention prediction, lacks support for continuous trajectory forecasting, and omits ego-motion stabilization—essential for processing video streams from moving vehicles.

To this end, we draw inspiration from RAIDN’s relational graph design and extend it within our V-PTP-IC framework to achieve comprehensive integration of motion dynamics, inter-agent interactions, and scene cues, enabling robust prediction under occlusions and viewpoint variations. Overall, neither intention-recognition models tailored for driving views nor trajectory-prediction frameworks developed for bird’s-eye perspectives can adequately address the combined challenges of ego-motion, frequent perspective shifts, and partial visibility inherent in vehicle-mounted scenarios. This underscores the necessity of a unified, vehicle-view-specific trajectory prediction solution.

In response, our proposed V-PTP-IC provides an end-to-end framework that jointly incorporates ego-motion stabilization, interaction-aware graph modeling, and multi-modal scene feature fusion, addressing these challenges within a single cohesive architecture and enabling more accurate and reliable pedestrian trajectory prediction in real-world driving environments.

3. V-PTP-IC

V-PTP-IC employs video sequences from vehicle-mounted cameras to predict future pedestrian trajectories in complex traffic scenarios. As illustrated in Figure 1, V-PTP-IC contains the following modules: (1) pedestrian trajectory tracking module that utilizes YOLOv10 for traffic participant detection, Simple Online and Realtime Tracking(SORT) for continuous tracking, Scale-Invariant Feature Transform (SIFT)-based keypoint matching for trajectory stabilization, and MiDaS for depth estimation; (2) traffic participant interaction module that constructs an undirected graph structure and applies a three-layer graph convolutional network to extract global interaction features between the target pedestrian and surrounding traffic participants; (3) unified feature processing module that employs multi-head self-attention mechanisms to fuse trajectory features, depth features, scene features, and interaction features into a unified representation; (4) trajectory prediction module that leverages an LSTM network to decode the fused features and generate accurate future trajectories. The modules are introduced as follows:

3.1. Pedestrian Trajectory Tracking Module

3.1.1. Target Detection and Tracking

The detection and tracking module employs a detection-then-tracking framework, in which YOLOv10 [27] is applied for pedestrian detection and SORT [28] is utilized for continuous multi-object tracking. SORT is configured with a maximum tracking age of 30 frames, a minimum confirmation threshold of 5 frames, and an Intersection-over-Union (IoU) threshold of

0.5

, enabling robust tracking in resource-constrained vehicle-mounted camera scenarios.

For each detected pedestrian, an independent Kalman filter-based tracker is initialized using a standard seven-dimensional state vector that includes the bounding box center coordinates, area, aspect ratio, and their corresponding velocity components.

During tracking of the identified detection boxes, a two-stage matching process is employed. First, initial filtering based on the IoU threshold is applied to discard low-overlap pairs, followed by the Hungarian algorithm [29] for optimal assignment of bounding boxes. The Hungarian algorithm finds an optimal matching scheme between existing target trajectories and current detection boxes, minimizing the total assignment cost based on IoU.

Post-processing involves three refinement strategies:

Coordinate consistency restoration: All detections are transformed back to the original image coordinate space and mapped to a fixed-resolution reference frame, eliminating inconsistencies caused by preprocessing operations such as padding or cropping. This ensures that all trajectory segments share a uniform spatial reference.
Coordinate normalization: To remove resolution-dependent variations, all bounding box coordinates are normalized to the unit interval $[0, 1]$ relative to the reference frame size. This produces a dimensionless representation, making data association thresholds invariant to image resolution.
Linear interpolation for missing frames: Short-term tracking dropouts caused by occlusions or missed detections are mitigated by interpolating the center positions and scales between two high-confidence bounding boxes. This is applied only when the temporal gap is below a predefined threshold to prevent generating spurious trajectories.

Furthermore, to extend 2D trajectories into 3D space and address the inherent depth ambiguity under monocular camera perspectives, we introduce the MiDaS monocular depth estimator [30], which is pre-trained on multiple datasets including NYU-Depth V2 [31], KITTI [32], and MegaDepth [33]. To reduce computational overhead while maintaining temporal consistency, dense depth maps are computed every 5 frames, with intermediate frames filled through linear interpolation. For each pedestrian bounding box, the depth value is extracted as the median of pixel-level depths within the bounding box region, mitigating the influence of outliers from noisy depth predictions. This results in an augmented trajectory representation that includes spatial coordinates, depth value, and bounding box dimensions. This depth-augmented representation enhances the model’s spatial reasoning capability and provides essential geometric constraints for subsequent trajectory prediction.

3.1.2. SIFT-Based Trajectory Stabilization

In vehicle-mounted camera systems, raw pedestrian trajectories obtained from object detection and tracking often suffer from geometric distortions and jitter caused by camera ego-motion, detection noise, and viewpoint variations. To address these issues, a global motion compensation strategy based on SIFT [34] is employed to identify stable background keypoints across the entire image frame. Although keypoint detection is performed over the full frame, most stable keypoints typically appear in the upper image regions, where static environmental structures—such as tree canopies, lamp posts, and building edges—are more prevalent. These stationary background features serve as geometric anchors for estimating and compensating camera-induced displacements in the image plane, thereby improving trajectory stability.

Given an input frame

I^{(t)}

, it is first converted to grayscale to enhance the robustness of feature detection, after which a set of SIFT keypoints and corresponding descriptors are extracted. Temporal correspondences are then established using the pyramidal Lucas–Kanade Optical Flow (LK) [35], where each SIFT keypoint

{K P}_{i}^{(t)}

is propagated to the subsequent frame

(t + 1)

by adding the estimated displacement vector derived from background regions

I_{bg}^{(t)}

and

I_{bg}^{(t + 1)}

. Keypoints exhibiting apparent motion greater than a predefined threshold

τ_{static} = 1.0

pixel are discarded to retain only static points:

\begin{matrix} {K P}_{static}^{(t)} = \{{K P}_{j}^{(t)} | {∥ P_{j}^{(t + 1)} - f {K P}_{j}^{(t)} ∥}_{2} < τ_{static}\} . \end{matrix}

(1)

The remaining static keypoints are matched frame-to-frame using the Fast Library for Approximate Nearest Neighbors (FLANN) [36], which performs k-nearest neighbor search in the 128-dimensional SIFT descriptor space based on Euclidean distance. To reject ambiguous matches, Lowe’s ratio test [34] is applied with a ratio threshold of

ρ = 0.65

, ensuring that the ratio of the smallest to second-smallest distances is below the threshold.

We use the matched static keypoints

{K P}_{match}^{(t)}

and

{K P}_{match}^{(t + 1)}

to estimate the dominant inter-frame motion via a Random Sample Consensus (RANSAC)-based affine transformation

T^{(t \to t + 1)}

[37]. The inverse of this transformation is applied to each pedestrian point

P_{i}^{(t)}

, and the compensation offset is computed as the scaled difference between the transformed and original positions, with a scaling factor

α = 0.4

to control the compensation strength. The stabilized point is then obtained by adding this offset:

P_{stable, i}^{(t + 1)} = P_{i}^{(t)} + {offset}_{i}^{(t + 1)} .

(2)

Finally, to further smooth each trajectory, a constant-velocity Kalman filter is applied with a four-dimensional state vector comprising position. The filter predicts the next state using a standard linear transition model that incorporates the inter-frame interval

Δ t

, and updates the estimate with the ego-motion–compensated observation

z_{i}^{(t + 1)}

via the Kalman gain, which minimizes the posterior covariance in the conventional manner. The resulting stabilized trajectories

{T r a j}_{stable} \in R^{T \times N \times 2}

preserve the original spatial format while significantly reducing jitter and improving temporal coherence, thereby providing a reliable foundation for downstream social interaction modeling and long-horizon prediction.

3.1.3. Scene Feature Representation

To enable the model to comprehensively perceive the environmental context, we design a lightweight multi-modal scene feature encoder that integrates information from object-level semantics, geometric depth cues, and static texture structures.

Formally, the scene feature vector is defined as:

f_{s c e n e} = Φ_{f u s i o n} (f_{y o l o}, f_{d e p t h}, f_{s i f t}),

(3)

where

f_{y o l o}, f_{d e p t h}, f_{s i f t}

denote the feature embeddings extracted from YOLOv10 detection, monocular depth estimation, and SIFT-based static keypoints, respectively.

The YOLOv10 detector provides bounding boxes, class probabilities, and confidence scores for all detected traffic participants. Each frame yields a semantic distribution:

f_{y o l o} = M L P_{y} ({[p_{i}, c_{i}, s_{i}]}_{i = 1}^{N_{o}}),

(4)

where

p_{i} = (x_{i}, y_{i}, w_{i}, h_{i})

represents the normalized position and scale of object

i, c_{i}

is its categorical one-hot vector, and

s_{i}

is the corresponding confidence score. A fully connected layer with ReLU activation

(128 \to 64)

encodes this into a compact semantic embedding, capturing the spatial layout and object identity distribution of the scene.

A pre-trained MiDaS depth network estimates per-pixel depth values, from which we extract both global and object-wise depth statistics:

f_{d e p t h} = M L P_{d} ([μ_{D}, σ_{D}, D_{g r i d}, D_{o b j}]),

(5)

where

μ_{D}

and

σ_{D}

denote the global mean and variance of the depth map, while

D_{g r i d}

and

D_{o b j}

encode coarse spatial and instance-level depth patterns. The resulting 64-dimensional embedding mitigates depth ambiguity in monocular perception and enhances the model’s understanding of spatial geometry.

Dense SIFT keypoints extracted from static background regions provide fine-grained texture and structural anchors:

f_{s i f t} = M L P_{s} ({[k_{j}, d_{j}]}_{j = 1}^{N_{k}}),

(6)

where

k_{j}

and

d_{j}

denote the position and 128-D descriptor of the j-th keypoint. We aggregate these features via grid pooling followed by a two-layer MLP to produce a 64-D representation emphasizing static structural cues.

The three modalities are projected into a common latent space and fused through a lightweight attention mechanism:

α = S o f t m a x (M L P_{a} ([f_{s i f t}, f_{d e p t h}])), f_{c t x} = \sum_{m \in \{s i f t, d e p t h\}} α_{m}, f_{m},

(7)

where

α

represents the learned attention weights balancing structural and geometric importance. The final scene embedding is then computed as:

f_{s c e n e} = M L P_{f u s i o n} ([f_{y o l o}, f_{c t x}]),

(8)

yielding a 128-D unified representation that captures semantic, geometric, and textural aspects of the driving environment. This lightweight integration scheme allows the network to maintain high inference efficiency while achieving strong contextual awareness in dynamic traffic scenes.

3.2. Traffic Object Interaction Graph Network

In real-world traffic scenarios, pedestrian trajectories are inevitably influenced by surrounding traffic participants such as vehicles, cyclists, and traffic control facilities. The spatial positions and motion patterns of these participants impose both implicit interaction constraints and explicit physical limitations on pedestrian behavior. To explicitly model these interaction relationships, we construct a Traffic Object Interaction Graph to capture the relational dependencies between the target pedestrian and nearby traffic participants [26]. This enables the model to learn and represent complex interaction dynamics, thereby improving the accuracy of pedestrian trajectory prediction.

Given a video frame

I^{(t)}

, YOLOv10 is utilized for detecting bounding boxes along with their semantic category labels. For each target pedestrian, up to M nearby objects are selected to form a graph with

M + 1

nodes, designating the pedestrian as the central node (

i = 0

). Each node is characterized by a five-dimensional spatial descriptor

s_{i}^{(t)}

, capturing the pixel coordinates of the bounding box center, the monocular depth estimate, and the box dimensions. Additionally, a semantic category label

c_{i}

(e.g., person, car, bicycle) enriches each node, furnishing class-aware cues that bolster interaction modeling in cluttered scenes.

To encode both spatial and semantic information, each semantic category label

c_{i}

is first mapped into a learnable semantic embedding

e_{sem} (c_{i}) \in R^{32}

through a trainable embedding layer. Meanwhile, a two-layer fully connected network is applied to each node’s spatial descriptor

s i

to generate a positional embedding

e_{pos} (s_{i}) \in R^{64}

. The complete node feature is then obtained by concatenating the positional and semantic embeddings, followed by a fusion layer that projects the combined representation into a 128-dimensional latent space to produce the initial node feature

h_{i}^{(0)}

.

The graph topology is determined by constructing an undirected adjacency matrix that selectively connects the target pedestrian to a subset of relevant neighboring objects. We adopt a neighbor selection strategy that combines spatial proximity with historical interaction patterns. Specifically, for each candidate object

j \neq 0

, a composite interaction score

S_{j}

is computed to balance geometric distance and temporal co-occurrence:

S_{j} = - \frac{∥ p_{0} - p_{j} ∥_{2}}{{max}_{k \neq 0} {∥ p_{0} - p_{k} ∥}_{2}} + λ \cdot F (c_{j}, H_{v, t}),

(9)

where

p_{i} = [x_{i}, y_{i}, d_{i}]

denotes the 3D spatial position of node i. The first term represents the normalized Euclidean distance (negated to assign higher scores to closer objects), and

F (c_{j}, H v, t)

denotes the historical interaction frequency of object category

c_{j}

within video v up to time t. The weighting coefficient is set to

λ = 0.3

. The interaction history

H_{v, t}

is maintained using an exponentially decaying counter that records co-occurrence statistics for each pair of object categories, with a decay factor

γ = 0.9

applied per frame to emphasize recent interactions. The top

K = 4

objects with the highest scores are selected, and bidirectional edges are established between the target pedestrian and these neighbors.

After constructing the graph structure, we encode interaction relationships by stacking

L = 3

layers of GCNs. Each layer performs neighborhood aggregation followed by a nonlinear transformation, incorporating residual connections and layer normalization to ensure stable training and efficient information flow. This process allows each node to integrate multi-hop neighborhood information. After L rounds of message passing, we obtain node representations

h_{i}^{(L)}

that incorporate neighborhood context. These node embeddings are then aggregated into a 128-dimensional interaction feature vector

F_{inter}

, which models the temporal interaction patterns observed in the trajectory sequence. This feature encapsulates the overall influence of the surrounding traffic environment on the target pedestrian and is subsequently combined with the pedestrian’s trajectory, depth, and global scene features in the final fusion module for prediction.

3.3. Unified Feature Processing Module

To enable the model to capture both individual motion dynamics and environmental context, we develop a Unified Feature Processing Module (UFPM) that jointly encodes multi-source features and integrates them into a consistent high-dimensional representation. This module consolidates trajectory, depth, scene, and interaction information, establishing a unified embedding space for downstream trajectory prediction.

Given a trajectory sequence

T \in R^{B \times L \times 5}

representing the temporal evolution of the pedestrian’s position, depth, and bounding box attributes, the module first applies three parallel encoders:

Trajectory Encoder: A recurrent or feed-forward network that models temporal dependencies and velocity variations, yielding a dynamic behavior embedding $F_{t r a j}$ .
Depth Encoder: A lightweight MLP that transforms monocular depth statistics into a geometric embedding $F_{d e p t h},$ recovering lost 3D cues from the 2D image space.
Scene Encoder: Built upon the lightweight multi-modal scene feature representation, it compresses YOLO-based semantic, SIFT-based texture, and depth context into a normalized embedding $F_{s c e n e}$ .

Each branch produces a compact vector of dimension

d_{o u t}

through linear projection and non-linear activation (ReLU + LayerNorm), ensuring cross-modality consistency.

In parallel, the Traffic Object Interaction Graph Network generates the global interaction feature

F_{i n t e r},

which encapsulates the spatio-temporal dependencies and social constraints among surrounding entities through graph convolution and temporal aggregation. This feature complements the trajectory-level motion cues with socially aware relational context.

To integrate all four modalities, the encoded features are concatenated into a modality sequence

F = [F_{t r a j}, F_{d e p t h}, F_{s c e n e}, F_{i n t e r}] \in R^{4 \times d_{o u t}}

. A multi-head self-attention (MHA) layer is then applied to model the interdependence among modalities, allowing adaptive weighting based on contextual relevance. The attention output is aggregated through mean pooling and normalized to yield the final unified feature:

F_{f u s e d} = LayerNorm (Dropout (Mean (MHA (F)))) .

(10)

This compact representation

F_{fused} \in R^{d_{out}}

captures both temporal dynamics and semantic coherence, providing a robust foundation for trajectory prediction and downstream reasoning.

3.4. Trajectory Prediction Module

Building upon the unified fused representation

F_{fused} \in R^{d_{out}}

, we introduce a multi-layer LSTM-based sequence decoder to autoregressively forecast multi-modal pedestrian trajectories. This architecture captures the temporal dynamics of motion while integrating environmental semantics and social interaction cues encoded in

F_{fused}

, enabling robust predictions in occluded, crowded scenes.

The decoder initializes LSTM states via a linear projection:

[h_{0}, c_{0}] = ϕ (F_{fused}; W_{ϕ}, b_{ϕ})

, where

ϕ (\cdot)

is a linear projection that maps to a hidden dimension

d_{h} = 128

. The decoder generates

K = 6

possible trajectory modes, each consisting of relative 5D states at each timestep

{Δ x_{t}, Δ y_{t}, d_{t}, w_{t}, h_{t}}_{t = 1}^{H}

, where H is the prediction horizon (with each timestep corresponding to 0.4s). The decoder initializes LSTM states via a linear projection:

[h_{0}, c_{0}] = ϕ (F_{fused}; W_{ϕ}, b_{ϕ})

, where

ϕ (\cdot)

is a linear projection that maps to a hidden dimension

d_{h} = 128

. The decoder generates

K = 6

possible trajectory modes, each consisting of relative 5D states at each timestep

{Δ x_{t}, Δ y_{t}, d_{t}, w_{t}, h_{t}}_{t = 1}^{H}

, where H = 12 is the prediction horizon (with each timestep corresponding to 0.4 s).

At each timestep t, the hidden state

h_{t}

is updated autoregressively. The next state at each timestep t is predicted as:

{\hat{T}}^{(k)} = ψ (h_{t}; W_{ψ}^{(k)}), t \in \{1, \dots, H\},

(11)

where

ψ (\cdot)

is a Multi-Layer Perceptron (MLP) head that maps the hidden state to the predicted relative 5D states.

To enable diverse trajectory predictions, we employ a mixture density network (MDN) that models the uncertainty in future states by sampling from a mixture of Gaussians. The MDN parameters

π_{k}

,

μ_{k}

, and

Σ_{k}

are learned as part of the decoder:

p (T | h_{T}) = \sum_{k} π_{k} N (μ_{k} Σ_{k}) .

(12)

During training, we use teacher forcing, where the true trajectory is fed into the decoder. During inference, we sample

K = 6

different modes by drawing from the mixture of Gaussians at each timestep. This stochastic sampling mechanism allows the model to capture multiple plausible future trajectories, providing a diverse set of predictions.

The prediction process is autoregressive, where each predicted step at timestep t is conditioned on the previous prediction. The sampled trajectory modes are then iteratively fed back into the decoder to predict future steps, ensuring temporal consistency while maintaining diverse outcomes. This multi-modal prediction framework allows us to account for the inherent uncertainty in pedestrian motion, especially in complex environments with occlusions, interactions, and high-density crowds.

3.5. Loss Function

To address the challenges of pedestrian trajectory prediction in dynamic vehicle-mounted scenarios, we design a comprehensive loss function that jointly optimizes prediction accuracy, temporal coherence, and social compliance. The overall objective is formulated as a multi-task learning problem with the following components:

For multi-modal trajectory forecasting, we augment the displacement error with a diversity objective. The trajectory prediction loss is defined as:

\begin{matrix} L_{traj} = L_{multimod} = \sum_{k = 1}^{K} α_{k} \cdot {ADE}^{(k)} + λ_{d} \sum_{i \neq j} exp (- {∥ {\hat{p}}^{(i)} - {\hat{p}}^{(j)} ∥}_{1}), \end{matrix}

(13)

where K denotes the number of hypothesis trajectories,

α_{k}

denotes the confidence weight for the k-th hypothesis,

λ

controls diversity regularization, and

{∥ \cdot ∥}_{1}

denotes the Manhattan distance. The first term enforces accuracy for each hypothesis; the second discourages redundant trajectories by penalizing high similarity.

To ensure physically plausible motion patterns and reduce high-frequency jitter in predicted trajectories, we introduce a smoothness constraint based on second-order differences:

\begin{matrix} L_{smooth} = \frac{1}{T_{pred} - 2} \sum_{t = 1}^{T_{pred} - 2} {∥ {\hat{p}}_{t + 2} - 2 {\hat{p}}_{t + 1} + {\hat{p}}_{t} ∥}_{2}^{2} . \end{matrix}

(14)

In multi-pedestrian scenarios, we incorporate a collision penalty to encourage socially compliant trajectories:

\begin{matrix} L_{coll} = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j \neq i} I (min_{t} {∥ {\hat{p}}_{t}^{(i)} - {\hat{p}}_{t}^{(j)} ∥}_{2} < τ), \end{matrix}

(15)

where

τ

is a safety threshold (typically set to 0.5–1.0 m), N is the number of pedestrians, and

I (\cdot)

denotes the indicator function. This term triggers a penalty whenever the minimum distance between any two predicted trajectories falls below the safety threshold.

The complete loss function integrates all components through a weighted sum:

\begin{matrix} L_{total} = λ_{1} L_{traj} + λ_{2} L_{smooth} + λ_{3} L_{coll}, \end{matrix}

(16)

where

λ_{1}

,

λ_{2}

, and

λ_{3}

are balancing coefficients determined through cross-validation. This formulation enables the model to generate accurate, diverse, physically plausible, and socially compliant trajectory predictions, addressing the key requirements for autonomous driving applications in complex urban environments.

4. Experiments and Evaluation

4.1. Dataset and Experimental Setup

Experiments were conducted on two publicly available traffic scene datasets to evaluate the model’s generalization capability across diverse driving environments. The first dataset is JAAD (Joint Attention for Autonomous Driving) [38], widely used in scene understanding and pedestrian behavior research. The second is the PIE (Pedestrian Intention Estimation) dataset [39], a large-scale benchmark focusing on pedestrian intention and trajectory prediction.

The JAAD dataset contains approximately 346 short video clips captured from the driver’s perspective, documenting a variety of urban traffic scenes such as city streets and complex intersections under diverse illumination and weather conditions. Pedestrians are detected and tracked using the YOLOv10 network, yielding continuous trajectory sequences that are subsequently stabilized, coordinate-restored, and normalized to eliminate scale and alignment biases. Each input sequence includes eight observed frames followed by a twelve-frame prediction horizon (8→12 setup). The dataset is divided into 70% training, 15% validation, and 15% testing subsets. Specifically, the training, validation, and test sets comprise 242, 52, and 52 video clips, corresponding to 1750, 375, and 375 pedestrian trajectories, respectively. On average, each trajectory spans about 25 frames, and each clip contains roughly 530 frames recorded at 30 Hz. The average occlusion rate in the test subset is approximately 19.7%, reflecting frequent partial or full pedestrian occlusions caused by vehicles and surrounding infrastructure. All splits preserve similar background and pedestrian density distributions to avoid sampling bias. The model is optimized using stochastic gradient descent (SGD) with an initial learning rate of 0.001, trained for 200 epochs under fixed random seeds and identical preprocessing to ensure full reproducibility.

The PIE dataset consists of over six hours of continuous high-definition onboard video recorded at 30 Hz and segmented into 10-min clips. It contains approximately 300,000 annotated frames and 1842 pedestrian instances, together with structured contextual labels describing vehicles, road geometry, and traffic control elements such as lights, crosswalks, and signage. Following the same detection and preprocessing procedures as in JAAD, pedestrian trajectories are obtained using YOLOv10 for detection, SORT for multi-object tracking, and SIFT-based stabilization for ego-motion compensation. The dataset is divided into 70% training, 15% validation, and 15% testing subsets, corresponding to 420, 90, and 90 clips, with 1290, 275, and 277 pedestrian trajectories, respectively. Each trajectory contains on average about 31 frames, and each video clip includes roughly 18,000 frames. The average occlusion rate in the test subset is around 21.7%. The same optimization pipeline and hyperparameter settings as those used for JAAD are applied to ensure consistency and comparability across datasets.

4.2. Evaluation Metrics

To comprehensively evaluate the model’s performance in pedestrian trajectory prediction tasks, four standardized quantitative metrics are adopted: Average Displacement Error (ADE), Final Displacement Error (FDE), Trajectory Smoothness, and Collision Rate. These metrics collectively assess the model’s spatial accuracy, temporal consistency, and social feasibility, providing a multidimensional evaluation of prediction quality under dynamic vehicle-mounted perspectives.

4.2.1. Average Displacement Error (ADE)

ADE measures the mean Euclidean distance between the predicted and ground-truth trajectories over the entire prediction horizon. For each time step

t \in {1, \dots, T}

, the L2 norm of the positional error is computed and averaged across all T steps:

\begin{matrix} ADE = \frac{1}{T} \sum_{t = 1}^{T} {∥{\hat{p}}_{t} - p_{t}∥}_{2}, \end{matrix}

(17)

where

{\hat{p}}_{t} \in R^{D}

and

p_{t} \in R^{D}

denote the predicted and ground-truth positions at time t, respectively, and

D = 2

for

(x, y)

coordinates. Smaller ADE values reflect the proficiency of the model in maintaining high positional accuracy across multiple predicted steps.

4.2.2. Final Displacement Error (FDE)

FDE evaluates the positional error at the final prediction step, placing greater emphasis on long-term forecasting accuracy. It is defined as:

\begin{matrix} FDE = {∥{\hat{p}}_{T} - p_{T}∥}_{2}, \end{matrix}

(18)

where T is the last prediction time step. Lower FDE values indicate the capability of the model to precisely anticipate the endpoint of the trajectory, an aspect particularly important for decision-making in autonomous driving scenarios.

4.2.3. Trajectory Smoothness

Trajectory smoothness measures the dynamic continuity and physical plausibility of the predicted trajectories. It is defined based on the second-order difference of the predicted sequence, which quantifies the mean squared change in acceleration:

\begin{matrix} Smoothness = \frac{1}{T - 2} \sum_{t = 1}^{T - 2} {∥ ({\hat{p}}_{t + 2} - 2 {\hat{p}}_{t + 1} + {\hat{p}}_{t}) ∥}_{2}^{2} . \end{matrix}

(19)

A lower Smoothness value indicates smoother motion transitions and trajectories that are more consistent with realistic physical dynamics, effectively preventing unnatural jitter or discontinuities in the predicted sequence.

4.2.4. Collision Rate

The Collision Rate evaluates the social plausibility of predicted trajectories in multi-pedestrian or multi-object scenarios. A potential collision is identified when the distance between any two predicted trajectories

{\hat{p}}_{t}^{(i)}

and

{\hat{p}}_{t}^{(j)}

at time step t falls below a predefined safety threshold

τ

:

\begin{matrix} CollisionRate = \frac{1}{N_{pairs} T} \sum_{i \neq j} \sum_{t = 1}^{T} I ({∥ {\hat{p}}_{t}^{(i)} - {\hat{p}}_{t}^{(j)} ∥}_{2} < τ), \end{matrix}

(20)

where

I (\cdot)

denotes the indicator function, and

τ = 0.5, m

is the safety distance threshold. A lower Collision Rate indicates that the predicted trajectories better adhere to social behavior constraints, avoiding unrealistic overlaps or collisions.

ADE and FDE jointly assess spatial accuracy, Smoothness evaluates motion continuity, and Collision Rate verifies social feasibility. Together, these four metrics quantify prediction quality from three complementary perspectives—geometric precision, dynamic smoothness, and social compliance—providing a unified evaluation standard for model comparison and improvement.

4.3. Ablation Studies

To quantitatively evaluate the contribution of each component in the proposed V-PTP-IC framework, systematic ablation studies were conducted on the JAAD and PIE datasets. Each variant was constructed by selectively removing or replacing one key module—such as SIFT-based stabilization, depth estimation, or scene feature extraction—to isolate its impact on prediction performance. All models were evaluated using Average Displacement Error (ADE), Final Displacement Error (FDE), and trajectory Smoothness, where lower values indicate better accuracy, long-term consistency, and motion plausibility, respectively.

Table 1 summarizes the ablation results on the JAAD dataset. The benchmark model is a baseline LSTM predictor without auxiliary components. The full V-PTP-IC model achieves the best performance across all metrics, with an ADE of 0.0596 m and an FDE of 0.1017 m.

The incremental integration of components reveals consistent performance gains. The introduction of SIFT-based stabilization reduces ADE by 34.4% compared to the SORT-only configuration, which is attributed to its effectiveness in compensating for ego-motion jitter and maintaining temporal coherence in pedestrian trajectories. The subsequent addition of depth estimation further lowers ADE by 27.2%, indicating that explicit depth cues aid in resolving scale ambiguity and improving localization under occlusion or at varying distances. The full model, incorporating all components, achieves the lowest error, underscoring the complementary roles of motion stabilization, 3D spatial reasoning, and scene-aware interaction modeling.

Component removal experiments further validate the importance of each module. Excluding SIFT stabilization leads to an 80.5% relative increase in ADE, confirming its critical role in handling camera-induced motion artifacts. Removing depth estimation raises ADE to 0.0744 m, illustrating that depth information supports more accurate spatial reasoning, particularly in discerning relative distances among traffic participants. Omitting scene features results in moderate degradation in both accuracy and smoothness, suggesting that semantic context aids in producing trajectories that are consistent with environmental structure and social norms.

In summary, the ablation study demonstrates that:

SIFT-based stabilization is essential for robust trajectory estimation under vehicle motion.
Depth estimation enhances 3D spatial awareness, reducing positional ambiguity.
Scene features contribute to contextual and socially compliant predictions.
The integration of all components enables spatially accurate, temporally stable, and contextually realistic trajectory forecasting in dynamic urban settings.

4.4. Comparative Experiments

To evaluate the effectiveness of the proposed V-PTP-IC framework, we conduct comparative experiments against several representative pedestrian trajectory prediction models. All models are trained and evaluated under identical experimental settings, using the same trajectory and scene feature preprocessing pipeline, dataset splits, and optimization parameters to ensure fairness in comparison.

The comparison model are as follows:

Vanilla LSTM: This is a simplified setting of the social LSTM model, which removes the “social” pool layer and treats all trajectories as independent of each other [17].
LSTM Encoder–Decoder: An architecture employing an LSTM-based encoder to capture historical trajectory features and an LSTM-based decoder to predict future positions. This model improves upon Standard LSTM in sequence handling but still lacks explicit interaction or scene modeling [40].
Social-LSTM: An extension of LSTM that incorporates a social pooling mechanism to model inter-pedestrian influences, thereby improving accuracy in multi-agent scenarios. This serves as a direct reference point for evaluating the impact of integrating dynamic scene features into social interaction modeling [17].
Transformer: A self-attention-based sequence model capable of capturing global temporal dependencies across trajectory sequences. Unlike recurrent architectures, it efficiently models relationships between any two time steps, showing strong performance in long-range motion forecasting tasks [41].
Social-LSTM + Dynamic Scene Fusion: An improved variant of the Social-LSTM architecture that integrates visual scene features extracted from the VGG19 network to provide environmental context. The model retains the standard social pooling mechanism to capture inter-pedestrian interactions, while concatenating the VGG19-based scene embeddings with trajectory features to achieve joint reasoning over social interactions and environmental semantics. Compared with the conventional Social-LSTM, this hybrid design enhances scene awareness and prediction accuracy, demonstrating the benefit of incorporating scene-level cues into social behavior modeling.
GCN + Dynamic Scene Fusion: A modified version of the Social-LSTM + Dynamic Scene Fusion model, in which the conventional social pooling mechanism is replaced by a GCN–based interaction module. This configuration explicitly models the relational dependencies among pedestrians and surrounding traffic participants through a graph structure, enabling structured message passing and context aggregation. The model preserves the use of dynamic scene features extracted from the VGG19 network, which are concatenated with trajectory embeddings to provide comprehensive environmental context. Although this model achieves higher prediction accuracy than Social-LSTM + Dynamic Scene Fusion due to its improved representation of inter-agent relationships, the use of VGG19-based panoramic scene feature extraction introduces considerable computational overhead, leading to longer training times. These results motivate the design of the proposed V-PTP-IC framework, which replaces heavy CNN-based scene encoding with a lightweight representation to achieve a more efficient balance between accuracy and computational cost.

As shown in Table 2 and Table 3, the proposed V-PTP-IC framework consistently outperforms all baseline models across both datasets, though the absolute performance on PIE is slightly lower than that on JAAD due to the inherent differences in dataset characteristics. Specifically, PIE focuses more on pedestrian intent and interactions with ego-vehicles rather than clear long-term trajectory annotation, resulting in fewer and less continuous trajectory samples. All experiments were conducted on a workstation equipped with an NVIDIA GeForce RTX 3050 GPU, ensuring a fair comparison of both predictive accuracy and computational efficiency. The inclusion of training time analysis further demonstrates the superior balance between performance and efficiency achieved by the proposed framework.

On the JAAD dataset, compared with the Vanilla LSTM, LSTM Encoder–Decoder, and Social-LSTM models, the proposed V-PTP-IC achieves relative improvements of 27.3%, 24.8%, and 20.0% in ADE, and 33.9%, 32.3%, and 28.5% in FDE, respectively. These results underscore the effectiveness of integrating dynamic scene features with the traffic object interaction graph network to enhance the accuracy of pedestrian trajectory prediction. Furthermore, compared with the Social-LSTM + Dynamic Scene Fusion model, V-PTP-IC attains an additional 6.4% and 10.7% improvement in ADE and FDE, respectively, indicating that the graph-based relational reasoning and lightweight feature extraction in V-PTP-IC provide a more comprehensive understanding of pedestrian motion within the surrounding traffic environment. Although the prediction accuracy improves significantly, V-PTP-IC exhibits a slight increase in training time (a 7.4% increase compared with the baseline), which is attributed to the additional computational overhead introduced by the extraction of traffic scene information.

On the PIE dataset, all models yield higher error values than those on JAAD due to increased motion complexity and the limited clarity of pedestrian trajectories. Nevertheless, V-PTP-IC consistently maintains superior performance, achieving 6.1% and 10.3% relative improvements in ADE and FDE, respectively, over the Social-LSTM + Dynamic Scene Fusion model. These consistent improvements across datasets confirm that the proposed framework generalizes well under varying levels of scene variability and spatio-temporal complexity.

As shown in Table 4, V-PTP-IC achieves the lowest ADE (0.0596) and FDE (0.1017) on the JAAD dataset with only 235.8K parameters and 1.38M FLOPs per trajectory. Compared with Social-LSTM, our model reduces ADE by 20.0% and FDE by 28.4% while incurring merely 58% more parameters and 23% additional FLOPs. Against the tiny Transformer that uses 28% more parameters (302.1K) and 67% higher computational cost (2.31M FLOPs), V-PTP-IC delivers 19.2% lower ADE and 27.3% lower FDE, demonstrating significantly superior accuracy-efficiency tradeoff. These results indicate that V-PTP-IC leverages moderate parameter and computation budgets to push the Pareto frontier of pedestrian trajectory prediction, achieving state-of-the-art accuracy without excessive complexity overhead.

4.5. Qualitative Evaluation

4.5.1. Trajectory Stabilization Visualization

Figure 2 illustrates the qualitative comparison between raw and stabilized pedestrian trajectories. The original trajectory exhibits noticeable high-frequency oscillations and discontinuous jumps caused by ego-motion, detection noise, and temporary tracking failures. After applying SIFT-based stabilization, these distortions are substantially reduced, resulting in smoother and more temporally coherent trajectories. The static keypoints extracted from background structures serve as reliable geometric anchors, compensating for frame-to-frame motion drift induced by the moving camera. Consequently, the stabilized trajectories better align with the physical motion continuity of pedestrians and provide a more consistent representation for downstream modules. This improvement not only enhances visual interpretability but also leads to measurable gains in prediction stability, as smoother and spatially consistent inputs facilitate more accurate learning of spatio-temporal dependencies in subsequent interaction and trajectory prediction stages.

Figure 3 presents the visualization of static background keypoint correspondences between consecutive video frames. The left and right images show matched SIFT keypoints connected by colored lines, representing stable background features such as trees, buildings, and road signs. These correspondences form the geometric foundation for estimating global camera motion via affine transformation. By leveraging the motion of these static anchors, the system effectively separates ego-motion from object motion, compensating for frame-to-frame displacement caused by vehicle movement. As a result, the global motion field derived from these keypoints directly enables accurate trajectory stabilization, suppressing translation and rotational artifacts that would otherwise distort pedestrian motion paths. This process ensures that the trajectory prediction modules receive spatially coherent inputs, thereby improving both trajectory smoothness and physical realism in dynamic driving scenes.

4.5.2. Visualization of Trajectory Prediction

To further demonstrate the predictive performance and interpretability of the proposed framework, qualitative trajectory visualization experiments were conducted on both the JAAD and PIE datasets, as shown in Figure 4. The blue curves represent the observed historical trajectories, the green lines denote the ground-truth future paths, the red curves indicate the predicted most probable mode (Mode 1), and the lighter-colored curves (Modes 2–5) correspond to alternative plausible predictions generated by the multimodal decoder.

The visualization results demonstrate the advantages of our model: firstly, the model successfully captures the multimodal characteristics of pedestrian motion, generating diverse but physically reasonable future trajectories that cover different potential walking directions and speed curves. Secondly, the best prediction mode (Mode 1, red) shows strong consistency with the actual ground trajectory, which validates the effectiveness of our time smoothing constraint. However, visualization also reveals some limitations. In some complex intersecting scenarios, alternative modes (modes 2–5) occasionally produce trajectories that deviate significantly from the reasonable walking mode, indicating that there is still room for improvement in integrating stronger contextual understanding of the scene.

5. Conclusions

Our research proposes a novel end-to-end vehicle-centered framework, V-PTP-IC, for pedestrian trajectory prediction from a vehicle-mounted perspective, which jointly models stable pedestrian trajectories, scene semantics, and social interactions in driving environments. However, pedestrian trajectory prediction from a vehicle-mounted perspective encounters significant challenges, including trajectory jitter induced by camera ego-motion; frequent occlusions from dynamic viewpoints; and the lack of unified modeling for motion stabilization, multi-modal scene cues, and fine-grained multi-agent dependencies in uncertain real-world scenarios. To address these issues, V-PTP-IC integrates SORT for robust tracklet augmentation, a SIFT-based static keypoint matching strategy for compensating motion inconsistencies, capturing spatial regularities in dynamic scenes, and encoding behavioral interactions among traffic participants, thereby achieving robust trajectory estimation and generating geometrically accurate and socially compliant future paths. Furthermore, we mitigate the computational overhead and performance gaps in existing static- or global-view methods by leveraging lightweight scene features and attention-based fusion, enabling efficient processing of interaction dependencies without sacrificing prediction fidelity. Experimental results on the JAAD and PIE datasets indicate that V-PTP-IC outperforms baselines, with ADE reduced by 27.23% and 25.73% and FDE reduced by 33.88% and 32.85%, respectively, while maintaining low-latency inference suitable for real-time deployment.

From the perspective of vehicle-mounted observation, V-PTP-IC exhibits strong applicability for pedestrian trajectory prediction on traffic roads. However, precise trajectory prediction from a vehicle-mounted perspective remains challenging under adverse conditions such as low-light or occlusive weather, as well as for long-horizon forecasting in highly dynamic environments with limited temporal context. Future work will investigate multi-sensor fusion and Transformer-based sequential modeling to enhance environmental robustness and extend the temporal receptive field, alongside architecture optimizations for edge deployment to enable simultaneous multi-pedestrian predictions. These enhancements will further facilitate the integration of V-PTP-IC into safety-critical autonomous driving technologies.

Author Contributions

Conceptualization, S.B. and Y.F.; methodology, S.B. and Y.F.; software, Y.F.; validation, S.B. and Y.F.; formal analysis, S.B.; investigation, S.B.; resources, H.L.; data curation, Y.F.; writing—original draft preparation, Y.F.; writing—review and editing, S.B.; visualization, Y.F.; supervision, H.L.; project administration, H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Open Fund of Sichuan Oil and Gas Development Research Center (Grant No. 2025SY011); the Tuojiang River Basin High-quality Development Research Center, a key research base of Sichuan Province Social Science (Grant No. TJGZL2025-12); the Emergency Management Knowledge Popularization Base, a social science popularization base of Sichuan Province (Grant No. YJ25-13); the Intelligent Emergency Management Key Laboratory, a key laboratory of philosophy and social sciences of Sichuan Province (Grant Nos. 2025ZHYJGL-18 and 2025ZHYJGL-19); and the Sichuan Science and Technology Program (Grant No. 2023NSFSC1038). The APC was not funded by any external organization.

Data Availability Statement

The data supporting the findings of this study are openly available in the JAAD (Joint Attention in Autonomous Driving) dataset at https://github.com/ykotseruba/JAAD (accessed on 20 November 2025). The dataset includes video sequences and annotations for pedestrian behavior analysis in autonomous driving scenarios.

Acknowledgments

During the preparation of this work the authors used ChatGPT5.1 in order to improve language. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

World Health Organization. Air Quality Guidelines for Europe; World Health Organization: Geneva, Switzerland, 2020. [Google Scholar]
Fu, Z.; Jiang, K.; Xie, C.; Xu, Y.; Huang, J.; Yang, D. Summary and reflections on pedestrian trajectory prediction in the field of autonomous driving. IEEE Trans. Intell. Veh. 2024, 9, 1–33. [Google Scholar] [CrossRef]
Liu, W.; Li, Y.; Zhang, T.; Gao, Y.; Wei, L.; Chen, J. CCAF-Net: Cascade Complementarity-Aware Fusion Network for traffic accident prediction in dashcam videos. Neurocomputing 2025, 624, 129285. [Google Scholar] [CrossRef]
Munir, F.; Kucner, T.P. Context-aware multi-task learning for pedestrian intent and trajectory prediction. Transp. Res. Part C Emerg. Technol. 2025, 178, 105203. [Google Scholar] [CrossRef]
Dal’Col, L.; Oliveira, M.; Santos, V. Joint perception and prediction for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2025, 1–26. [Google Scholar] [CrossRef]
Kruse, T.; Pandey, A.K.; Alami, R.; Kirsch, A. Human-aware robot navigation: A survey. Robot. Auton. Syst. 2013, 61, 1726–1743. [Google Scholar] [CrossRef]
Cheng, H.; Johora, F.T.; Sester, M.; Müller, J.P. Trajectory modelling in shared spaces: Expert-based vs. deep learning approach? In International Workshop on Multi-Agent Systems and Agent-Based Simulation; Springer: Berlin/Heidelberg, Germany, 2020; pp. 13–27. [Google Scholar]
Bighashdel, A.; Dubbelman, G. A survey on path prediction techniques for vulnerable road users: From traditional to deep-learning approaches. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; IEEE: New York, NY, USA, 2019; pp. 1039–1046. [Google Scholar]
Bellman, R. A Markovian decision process. J. Math. Mech. 1957, 6, 679–684. [Google Scholar] [CrossRef]
Helbing, D.; Molnar, P. Social force model for pedestrian dynamics. Phys. Rev. E 1995, 51, 4282. [Google Scholar] [CrossRef]
Eddy, S.R. Hidden markov models. Curr. Opin. Struct. Biol. 1996, 6, 361–365. [Google Scholar] [CrossRef] [PubMed]
Rasmussen, C.E. Gaussian Processes in Machine Learning. In Summer School on Machine Learning; Springer: Berlin, Germany, 2003; pp. 63–71. [Google Scholar]
Mandalia, H.M.; Salvucci, D.D. Using Support Vector Machines for Lane-Change Detection. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Orlando, FL, USA, 26–30 September 2005; Volume 49, pp. 1965–1969. [Google Scholar]
Sabarish, B.; Karthi, R.; Gireeshkumar, T. Clustering of trajectory data using hierarchical approaches. In Computational Vision and Bio Inspired Computing; Springer: Berlin/Heidelberg, Germany, 2018; pp. 215–226. [Google Scholar]
Mallah, J.E.; Carrino, F.; Khaled, O.A.; Mugellini, E. Crowd Monitoring: Critical Situations Prevention Using Smartphones and Group Detection. In Proceedings of the International Conference on Distributed, Ambient, and Pervasive Interactions; Berlin/Heidelberg, Germany, 2015; pp. 496–505. [Google Scholar]
Cavallaro, C.; Vizzari, G. A novel spatial–temporal analysis approach to pedestrian groups detection. Procedia Comput. Sci. 2022, 207, 2364–2373. [Google Scholar] [CrossRef]
Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2016; pp. 961–971. [Google Scholar]
Du, Q.; Wang, X.; Yin, S.; Li, L.; Ning, H. Social force embedded mixed graph convolutional network for multi-class trajectory prediction. IEEE Trans. Intell. Veh. 2024, 9, 5571–5580. [Google Scholar] [CrossRef]
Yuan, Y.; Weng, X.; Ou, Y.; Kitani, K.M. Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9813–9823. [Google Scholar]
Zhou, Z.; Ye, L.; Wang, J.; Wu, K.; Lu, K. Hivt: Hierarchical vector transformer for multi-agent motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8823–8833. [Google Scholar]
Wong, C.; Xia, B.; Zou, Z.; Wang, Y.; You, X. Socialcircle: Learning the angle-based social interaction representation for pedestrian trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19005–19015. [Google Scholar]
Yang, B.; He, C.; Wang, P.; Chan, C.Y.; Liu, X.; Chen, Y. TPPO: A novel trajectory predictor with pseudo oracle. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 2846–2859. [Google Scholar] [CrossRef]
Kim, Y. Ego-motion Compensated Pedestrian Trajectory Prediction with Visual Context for Urban Autonomous Driving. Ph.D. Thesis, Seoul National University, Seoul, South Korea, 2024. [Google Scholar]
Zhao, Z.; Zhao, Z.; Wang, S.; Watta, P.; Murphey, Y.L. Pedestrian re-identification using a surround-view fisheye camera system. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; IEEE: New York, NY, USA, 2021; pp. 1–8. [Google Scholar]
Mohamed, A.; Qian, K.; Elhoseiny, M.; Claudel, C. Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14424–14432. [Google Scholar]
Yang, B.; Wei, Z.; Hu, C.; Cai, Y.; Wang, H.; Hu, H. Real-time pedestrian crossing anticipation based on an action–interaction dual-branch network. IEEE Trans. Intell. Transp. Syst. 2024, 25, 21021–21034. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE international conference on image processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: New York, NY, USA, 2016; pp. 3464–3468. [Google Scholar]
Lin, C.; Sun, G.; Wu, D.; Xie, C. Vehicle detection and tracking with roadside LiDAR using improved ResNet18 and the Hungarian algorithm. Sensors 2023, 23, 8143. [Google Scholar] [CrossRef] [PubMed]
Ehret, T. Monocular Depth Estimation: A Review of the 2022 State of the Art. Image Process. Line 2023, 13, 38–56. [Google Scholar] [CrossRef]
Ignatov, D.; Ignatov, A.; Timofte, R. Virtually Enriched NYU Depth V2 Dataset for Monocular Depth Estimation: Do We Need Artificial Augmentation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; IEEE: New York, NY, USA, 2024; pp. 6177–6186. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision Meets Robotics: The KITTI Dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Li, Z.; Snavely, N. MegaDepth: Learning Single-View Depth Prediction from Internet Photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 2041–2050. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Wang, Z.; Yang, X. Moving target detection and tracking based on Pyramid Lucas-Kanade optical flow. In Proceedings of the 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC), Chongqing, China, 27–29 June 2018; IEEE: New York, NY, USA, 2018; pp. 66–69. [Google Scholar]
O’Hara, S.; Draper, B.A. Are you using the right approximate nearest neighbor algorithm? In Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision (WACV), Clearwater Beach, FL, USA, 15–17 January 2013; IEEE: New York, NY, USA, 2013; pp. 9–14. [Google Scholar]
Strandmark, P.; Gu, I.Y. Joint random sample consensus and multiple motion models for robust video tracking. In Proceedings of the Scandinavian Conference on Image Analysis; Springer: Berlin/Heidelberg, Germany, 2009; pp. 450–459. [Google Scholar]
Kotseruba, I.; Rasouli, A.; Tsotsos, J.K. Joint attention in autonomous driving (JAAD). arXiv 2016, arXiv:1609.04741. [Google Scholar]
Rasouli, A.; Kotseruba, I.; Kunic, T.; Tsotsos, J.K. PIE: A Large-Scale Dataset and Models for Pedestrian Intention Estimation and Trajectory Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 27–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 6262–6271. [Google Scholar]
Xue, P.; Liu, J.; Chen, S.; Zhou, Z.; Huo, Y.; Zheng, N. Crossing-road pedestrian trajectory prediction via encoder-decoder lstm. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; IEEE: New York, NY, USA, 2019; pp. 2027–2033. [Google Scholar]
Shi, L.; Wang, L.; Zhou, S.; Hua, G. Trajectory unified transformer for pedestrian trajectory prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 9675–9684. [Google Scholar]

Figure 1. The proposed V-PTP-IC framework comprises four core components: (1) a pedestrian trajectory tracking module that detects and tracks pedestrians and vehicles using SORT and SIFT while estimating pseudo-3D coordinates via MiDaS; (2) a traffic participant interaction module that models spatial relations among pedestrians and surrounding objects to generate global interaction features; (3) a unified feature processing module that fuses trajectory, depth, scene, and interaction information into a context-consistent representation; and (4) a trajectory prediction module that employs an LSTM decoder to produce multi-modal pedestrian trajectory forecasts reflecting both individual motion intent and environmental context.

Figure 2. Comparison between original (red) and stabilized (blue) trajectories for the same pedestrian. SIFT-based stabilization substantially reduces high-frequency jitter and corrects discontinuities caused by detection errors and camera motion.

Figure 3. Static background keypoint matching between consecutive frames. These correspondences enable accurate estimation of global camera motion for trajectory stabilization.

Figure 4. Visualization results of the proposed multimodal trajectory prediction, where blue indicates observed past trajectories, green denotes the ground truth, red represents the predicted most probable mode, and lighter curves correspond to alternative modes (2–5).

Table 1. Ablation study results on the JAAD dataset. Lower is better for ADE/FDE/Smoothness. Bold values indicate the best performance.

Configuration	ADE	FDE	Smoothness
Benchmark	0.1503	0.2042	0.1326
Exp1 (+SORT)	0.1298	0.1917	0.1104
Exp2 (+SORT + SIFT)	0.0981	0.1739	0.0983
Exp3 (+SORT + SIFT + Depth)	0.0714	0.1315	0.0895
Exp4 (Full Model)	0.0596	0.1017	0.0657
Exp5 (w/o SIFT)	0.1076	0.1908	0.0967
Exp6 (w/o Depth)	0.0744	0.1379	0.0715
Exp7 (w/o Scene)	0.0860	0.1542	0.0848

Table 2. Performance comparison of different models on the JAAD test set. Bold values indicate the best performance.

Model	ADE	FDE	Training Time	Smoothness	Collision Rate
Vanilla LSTM [17]	0.0819	0.1538	1600.47	0.1009	–
LSTM-Encoder-Decoder [40]	0.0792	0.1502	1566.39	0.0987	–
Social-LSTM [17]	0.0745	0.1421	1459.34	0.0856	0.041
Transformer [41]	0.0738	0.1398	1470.21	0.0823	–
Social-LSTM + Dynamic Scene Fusion	0.0637	0.1140	1507.86	0.0686	0.027
GCN + Dynamic Scene Fusion	0.0618	0.1103	1581.92	0.0703	0.020
V-PTP-IC (Ours)	0.0596	0.1017	1482.52	0.0657	0.019

Table 3. Performance comparison of different models on the PIE test set. Bold values indicate the best performance.

Model	ADE	FDE	Training Time	Smoothness	Collision Rate
Vanilla LSTM [17]	0.1026	0.1927	1615.28	0.1182	–
LSTM-Encoder-Decoder [40]	0.0991	0.1863	1580.77	0.1149	–
Social-LSTM [17]	0.0927	0.1748	1496.54	0.1038	0.052
Transformer [41]	0.0916	0.1705	1512.83	0.0996	–
Social-LSTM + Dynamic Scene Fusion	0.0803	0.1442	1538.65	0.0851	0.038
GCN + Dynamic Scene Fusion	0.0787	0.1386	1602.14	0.0869	0.031
V-PTP-IC (Ours)	0.0762	0.1294	1510.42	0.0835	0.028

Table 4. Model complexity versus prediction accuracy on JAAD dataset (obs = 8, pred = 12, BS = 1). ↓ indicates lower is better.

Model	Params ↓	FLOPs ↓	ADE ↓	FDE ↓
Vanilla LSTM	67.8 K	0.43 M	0.0819	0.1538
LSTM Encoder-Decoder	134.2 K	0.86 M	0.0792	0.1502
Social-LSTM	148.7 K	1.12 M	0.0745	0.1421
Transformer (tiny)	302.1 K	2.31 M	0.0738	0.1398
V-PTP-IC (Ours)	235.8 K	1.38 M	0.0596	0.1017

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bai, S.; Fang, Y.; Li, H. V-PTP-IC: End-to-End Joint Modeling of Dynamic Scenes and Social Interactions for Pedestrian Trajectory Prediction from Vehicle-Mounted Cameras. Sensors 2025, 25, 7151. https://doi.org/10.3390/s25237151

AMA Style

Bai S, Fang Y, Li H. V-PTP-IC: End-to-End Joint Modeling of Dynamic Scenes and Social Interactions for Pedestrian Trajectory Prediction from Vehicle-Mounted Cameras. Sensors. 2025; 25(23):7151. https://doi.org/10.3390/s25237151

Chicago/Turabian Style

Bai, Siqi, Yuwei Fang, and Hongbing Li. 2025. "V-PTP-IC: End-to-End Joint Modeling of Dynamic Scenes and Social Interactions for Pedestrian Trajectory Prediction from Vehicle-Mounted Cameras" Sensors 25, no. 23: 7151. https://doi.org/10.3390/s25237151

APA Style

Bai, S., Fang, Y., & Li, H. (2025). V-PTP-IC: End-to-End Joint Modeling of Dynamic Scenes and Social Interactions for Pedestrian Trajectory Prediction from Vehicle-Mounted Cameras. Sensors, 25(23), 7151. https://doi.org/10.3390/s25237151

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

V-PTP-IC: End-to-End Joint Modeling of Dynamic Scenes and Social Interactions for Pedestrian Trajectory Prediction from Vehicle-Mounted Cameras

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Fixed/Global-View Methods

2.2. Vehicle-Mounted-View Methods

3. V-PTP-IC

3.1. Pedestrian Trajectory Tracking Module

3.1.1. Target Detection and Tracking

3.1.2. SIFT-Based Trajectory Stabilization

3.1.3. Scene Feature Representation

3.2. Traffic Object Interaction Graph Network

3.3. Unified Feature Processing Module

3.4. Trajectory Prediction Module

3.5. Loss Function

4. Experiments and Evaluation

4.1. Dataset and Experimental Setup

4.2. Evaluation Metrics

4.2.1. Average Displacement Error (ADE)

4.2.2. Final Displacement Error (FDE)

4.2.3. Trajectory Smoothness

4.2.4. Collision Rate

4.3. Ablation Studies

4.4. Comparative Experiments

4.5. Qualitative Evaluation

4.5.1. Trajectory Stabilization Visualization

4.5.2. Visualization of Trajectory Prediction

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI