OPTP-System: A Lightweight Pedestrian Trajectory Prediction System for Complex Occlusion Environments

Lin, Zijian; Huang, Hong; Zhang, Yirui; Zhao, Wenfeng

doi:10.3390/electronics15112247

Open AccessArticle

OPTP-System: A Lightweight Pedestrian Trajectory Prediction System for Complex Occlusion Environments

by

Zijian Lin

,

Hong Huang

,

Yirui Zhang

and

Wenfeng Zhao

^*

College of Artificial Intelligence and Low-Altitude Technology, South China Agricultural University, Guangzhou 510630, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(11), 2247; https://doi.org/10.3390/electronics15112247

Submission received: 15 March 2026 / Revised: 16 April 2026 / Accepted: 16 May 2026 / Published: 22 May 2026

Download

Browse Figures

Versions Notes

Abstract

Pedestrian trajectory prediction in complex occlusion environments remains a critical challenge for autonomous driving systems. Although high-precision prediction models have achieved notable success, they often entail substantial computational overhead and struggle to maintain both accuracy and physical plausibility under real-world occluded conditions. To address these limitations, this paper proposes OPTP-System, a lightweight prediction framework that integrates YOLOv11 with DeepSORT for robust multi-pedestrian tracking in occluded scenes. An extended Kalman filter (EKF)-based motion prediction module is employed to generate trajectory forecasts, while the EKF-derived prior knowledge guides detection re-searching in occluded regions. Furthermore, feedback from trajectory smoothing refines detection confidence, substantially enhancing the model’s capability for continuous tracking and prediction under severe occlusion. Experimental results under challenging occlusion settings (exceeding 50% occlusion) show that the proposed model reduces ADE and FDE by 30.0% and 29.3%, respectively, compared to state-of-the-art methods. These findings demonstrate that OPTP-System achieves superior prediction accuracy while maintaining computational efficiency, offering a practical solution for reliable pedestrian trajectory prediction in complex traffic environments.

Keywords:

pedestrian trajectory prediction; complex occlusion environment; Extended Kalman Filter; multi-target tracking

1. Introduction

Pedestrian trajectory prediction is a fundamental technology for autonomous driving and smart surveillance systems. The fundamental aim of this task is to effectively predict the future intentions and movement trajectories of pedestrians [1]. The accuracy of these predictive models is important in determining the safety, smooth operation and intelligence of systems for human–robot interaction [2]. Nevertheless, one of the main challenges in real-world urban environments is pedestrian occlusion. To be specific, the occlusion between multiple pedestrians interferes with their trajectory continuity, significantly affecting the performance of existing prediction methods. The issue, therefore, has become a major bottleneck hindering the practical employment of this technology.

The current techniques of multi-target trajectory prediction can be classified broadly into two types. The first category consists of physics-based models. The Kalman filter is a prime example of these. The ship trajectory prediction is conducted using a multi-modal Kalman filter (KF). For instance, Wang et al. [3] used an interactive multi-modal KF to model the ship trajectory. In the ship trajectory prediction, interaction amongst agents is considered. Similarly, the modeling of a trajectory of a large object is achieved by incorporating rigorous kinematic equations. Dai et al. [4] advanced the Kalman filter’s application to pedestrian trajectory prediction, proposing a zero-velocity update in the process. The method showed high computational efficiency and predictive accuracy in stable or linear motion assumptions in a short-term predicting scenario. An adaptive Kalman filter method for pedestrian trajectory inference was proposed by Chen et al. [5] to enhance prediction performance in complicated scenes. The adjustment of filter parameters based on the detection of pedestrians enhances the adaptability to different pedestrian gaits and further extends the effective prediction horizon. Physics-based methods, which rely on kinematic equations for state recursion, are highly efficient with regard to computation for short-term prediction. Besides that, these methods ensure the smoothness and physical plausibility of the overall trajectories. Nonetheless, due to the non-existence of effective observations, conventional Kalman filters will quickly explode in error during prolonged occlusion. In addition, they find it difficult to model nonlinear pedestrian behavior driven by intent, such as obstacle avoidance and turning.

The second category encompasses data-driven approaches, with deep learning models at their core, which learn complex motion patterns and social interactions directly from historical trajectory data, thereby exhibiting superior capability in modeling nonlinear motion. For instance, Wang and Du [6,7] proposed a graph neural network-based prediction method specifically designed to model social interactions among pedestrians. In a seminal work, Li et al. [8] introduced the TF-MI network, which marked the first application of the Transformer encoder to pedestrian trajectory prediction. By fusing temporal-frequency features of human pose with multi-scale interaction information, their work revealed the critical role of pose dynamics in inferring motion intent. Beyond these, a wide array of other architectures, including recurrent neural networks (RNNs), graph neural networks (GNNs), and Transformers, have been extensively explored in this domain [9,10,11,12,13]. However, these data-driven methods are often heavily reliant on the statistical distributions present in the training data and lack explicit constraints based on physical laws of motion. This can lead to the generation of trajectories that violate basic motion common sense, such as discontinuous displacements or abrupt velocity jumps. Furthermore, in scenarios with severe occlusion, these methods struggle to extract reliable pose and interaction information, which is essential for accurate tracking and intent inference, consequently hindering their prediction precision.

Relationship with existing YOLOv11 + DeepSORT works. Recently, Liu et al. [14] explored the combination of YOLOv11 and DeepSORT, proposing a lightweight adaptive object tracking algorithm that achieves good tracking performance on the MOT16/MOT20 datasets. Different from Liu et al. [14], this paper focuses on pedestrian trajectory prediction under extreme occlusion, rather than generic object tracking. The core contributions of this paper—the bidirectional tracking–prediction feedback mechanism and the stratified evaluation across occlusion levels—are not present in Liu et al. [14]. To the best of our knowledge, this work is the first to introduce a bidirectional feedback mechanism into the YOLOv11 + DeepSORT framework specifically for trajectory prediction under extreme occlusion.

To address the aforementioned challenges, this paper proposes a lightweight pedestrian trajectory prediction framework specifically designed for crowded urban scenes. Diverging from the prevalent trend of pursuing model complexity, we revisit foundational models and construct an efficient and robust prediction system through deep collaborative optimization of YOLOv11, DeepSORT, and the Extended Kalman Filter (EKF). The core novelty of this work lies in its systematic design to tackle occlusion: we not only enhance the performance of individual modules under occlusion but, more importantly, establish a bidirectional feedback mechanism between them. This enables the system to maintain reliable trajectory prediction, even in complex occlusion scenarios. The main contributions of this paper are summarized as follows:

(1) Deeply fused detection-tracking-prediction pipeline. We propose an integrated pipeline where the detection stability of YOLOv11 is improved, the identity preservation capability of DeepSORT under occlusion is enhanced, and a short-term memory-based motion continuation mode is designed for EKF. This synergistic approach significantly boosts the overall system robustness in occlusion-heavy environments.

(2) A new information flow in both directions between modules. We develop a novel bidirectional feedback loop: the EKF prediction results will be used to promote the re-detection of occluded targets, and the smoothness of tracking trajectories will be leveraged to optimize the confidence score evaluation of detection boxes. This mechanism of cooperation prevents the errors aggravated due to occlusion.

(3) Under severe occlusion exceeding 50%, the proposed model achieves 30.0% and 29.3% reductions in ADE and FDE, respectively, compared to state-of-the-art methods, while requiring only 6.6 G FLOPS. This lightweight architecture substantially improves deployment efficiency on automotive embedded platforms. Experimental results demonstrate that the method maintains high prediction accuracy while delivering superior real-time tracking performance and system stability.

2. Materials and Methods

2.1. Pedestrian Detection Framework Based on YOLOv11

YOLOv11, the most recent version of the framework, was made available by Ultralytics on 30 September 2024 [15]. It is a comprehensive unified framework for five computer vision tasks. The various tasks are as follows: object detection, instance segmentation, image classification, pose estimation, and oriented object detection. Through architectural optimizations of core components, along with better design choices and training methods, this version achieves substantial improvements in feature recognition accuracy and inference efficiency. Interestingly, YOLOv11 obtains a greater mean Average Precision (mAP) on COCO while enjoying lower computational burden. Aside from these performance enhancements, its improved usability allows for integration and deployment in downstream applications [15], expanding its practical utility.

YOLOv11 retains the classic “Backbone–Neck–Head” three-stage architecture characteristic of the YOLO series, while introducing key upgrades in each component. In the backbone network, YOLOv11 builds upon the foundation of YOLOv8 [1] with further optimizations. Notably, it extensively adopts the CSPNet (Cross Stage Partial Networks) design philosophy [15], which is manifested in the model as the C3k2 module. This module achieves richer gradient flow information by partitioning feature maps and performing partial fusion across different stages, thereby enhancing its feature representation capability while reducing computational cost. It employs the SiLU activation function, which generally provides smoother optimization landscapes and better performance than traditional ReLU in deep models.

In the neck, YOLOv11 utilizes an enhanced version of PAN-FPN (Path Aggregation Network–Feature Pyramid Network), which implements both top-down and bottom-up feature fusion pathways. Through this structure, the neck outputs three strong semantic feature maps at different scales to the detection head, enabling the detection of small, medium, and large targets, respectively [16,17,18,19,20]. The overall architecture of YOLOv11 is illustrated in Figure 1.

In order to achieve accurate prediction of pedestrian trajectories, target detection and positioning are of great significance. This study uses the YOLOv11 visual model for multi-target detection, which can effectively improve the accuracy of the data and lay a solid foundation for subsequent time series modeling and prediction work. The specific implementation method is as follows.

For each frame of the input video, YOLOv11 detects pedestrians and outputs a set of bounding boxes, donated as

D_{t} = {d_{t}^{1}, d_{t}^{2}, \dots, d_{t}^{n}}

. Each bounding box

d_{t}^{i}

contains the spatial coordinates

(x_{t}^{i}, y_{t}^{i}, ω_{t}^{i}, h_{t}^{i})

of the i-th object in the image, along with its confidence score

s_{t}^{i}

and class label

c_{t}^{i}

. To eliminate unreliable detections, we apply a predefined confidence threshold

s_{m}

and retain only those high-confidence bounding boxes that are classified as “pedestrians”. These retained boxes serve as the candidate targets for the current frame.

Pedestrian detection in complex scenes is often challenged by adverse factors such as varying illumination and the presence of small pedestrian targets, which significantly complicate feature extraction. To address these challenges, we employ feature fusion techniques to enhance the model’s capability in perceiving fine-grained and multi-scale targets [16].

2.2. Improved YOLOv11-DeepSORT Tracker for Occlusion Robustness

In complicated conditions, YOLO-based detectors have difficulty consistently maintaining target identities through frames and output consistent historical trajectories [17,20]. To enable accurate multi-target localization and stable tracking, we use an efficient algorithm fusion strategy that merges YOLOv11’s powerful detection ability and DeepSORT’s robust association tracking ability. The gist of this method is to use YOLOv11 to perform high accuracy instant target localization and allow DeepSORT to perform association across frames and generate continuous trajectories with unique ID labels. This joint combination makes great use of the merits of both algorithms, enabling a multi-pedestrian detection and tracking system with strong real-time performance and high robustness.

The integration of YOLOv11 and DeepSORT essentially forms a cascade of detection and tracking modules. As the detector, YOLOv11 independently identifies all pedestrian targets in each video frame and outputs a set of detection bounding boxes (refer to (1)). Each detection box contains the pixel coordinates of its center

(x_{t}^{i}, y_{t}^{i})

, its width and height

(ω_{t}^{i}, h_{t}^{i})

, and a detection confidence score

s_{t}^{i}

.

Since frame-independent detection results alone cannot provide temporal continuity for targets, we employ DeepSORT as the tracker to establish correspondences between detected targets across adjacent frames and maintain an independent tracking trajectory for each target (refer to (2)). To achieve this, DeepSORT relies on a two-stage association mechanism: motion-based association and appearance-based association.

First, for each existing trajectory

T^{k}

, DeepSORT maintains a Kalman filter based on a constant velocity linear motion model. This filter predicts the target’s position at time t (denoted as a predicted bounding box

d_{t}^{k}

) based on the trajectory’s state at time t − 1. Subsequently, the Mahalanobis distance between all detection boxes D_t and all predicted boxes

d_{t}^{k}

is computed to measure their compatibility in terms of motion consistency. The Mahalanobis distance is formulated as shown in (3).

D_{t} = {d_{t}^{1}, d_{t}^{2}, \dots, d_{t}^{n}}

(1)

T^{k} = {d_{1}^{k}, d_{2}^{k}, \dots}

(2)

d^{(1)} (i, j) = {(d_{j} - d_{i})}^{T} S_{i}^{- 1} (d_{j} - d_{i})

(3)

where

d_{i}

and

d_{j}

denote the position vectors of the predicted box and the detection box, respectively, and

S_{i}

is the covariance matrix of the Kalman filter’s predicted state.

Equation (3) defines the Mahalanobis distance, which is a standard metric for motion association in the DeepSORT algorithm. Unlike Euclidean distance, the Mahalanobis distance accounts for prediction uncertainty: when the prediction uncertainty is high (i.e.,

S_{i}

is large), the same Euclidean distance yields a smaller Mahalanobis distance, making the association more permissive. This mechanism is widely used in object tracking [21,22].

Appearance-based association, powered by a Re-Identification (Re-ID) network, addresses the challenge of identity preservation under complex scenarios such as occlusion and pedestrian intersection, particularly when the motion model fails (e.g., during sudden turns or pauses). To this end, we introduce a second association criterion [21,22]. Specifically, a pre-trained deep Re-ID network is employed to extract a high-dimensional appearance feature vector

f_{t}^{i}

from each detection box

d_{t}^{j}

. This feature vector is normalized such that

‖f_{t}^{i}‖ =

1. Similarly, for each existing trajectory

T^{k}

, we maintain a gallery of its L most recent appearance feature vectors. The cosine distance between the detection feature and the trajectory feature set is then computed, and the minimum value is taken as the appearance affinity metric. Formally, the appearance-based similarity measure

d^{(2)} (i, j)

is defined as

d^{(2)} (i, j) = \min \{1 - c o s (f_{t}^{j}, f_{t - 1}^{k})\}

(4)

The fusion process of the proposed system operates as an iterative “prediction–association–update” cycle, as illustrated in Figure 2. The procedure consists of the following steps:

Detection. For each input video frame

I_{t}

, the YOLOv11 model performs forward inference to generate a set of high-confidence detection results

D_{t}

.

Prediction. For all currently active tracking trajectories

T_{t - 1}

, their respective Kalman filters predict the state (position and velocity) at time t, yielding a set of predicted bounding boxes

D_{t}

.

Association. After prediction, motion-based and appearance-based associations are performed sequentially:

Stage 1 (motion association): The Mahalanobis distance

d^{(1)}

is computed between detection boxes

D_{t}

and predicted boxes

D_{t}^{'}

for initial matching. A Mahalanobis distance threshold is applied to eliminate unlikely associations.

Stage 2 (appearance association): For detections and trajectories that remain unmatched after the first stage, the cosine distance

d^{(2)}

is used to construct a second cost matrix for secondary association.

Optimal assignment: The Hungarian algorithm [23] is then applied to the combined cost matrix to obtain the optimal matching between detection boxes and existing trajectories.

Trajectory update. Based on the association results, trajectories are updated as follows.

Based on the association results, trajectories are updated accordingly: successfully matched trajectories are refined using their corresponding detection boxes

d_{t}^{j}

as observations; unmatched detection boxes are initialized as new tracking trajectories; and unmatched trajectories are temporarily marked as “lost” and will be re-associated in subsequent frames, with termination occurring if the loss persists for multiple consecutive frames.

2.3. Prediction Method Based on Extended Kalman Filter

Based on the Extended Kalman Filter (EKF), the module for the trajectory prediction system is proposed to predict the trajectories of pedestrians. Through the state-space model of pedestrian motion, EKF improves the accuracy of predicting moving target location and smoothing the trajectory [4]. In particular, we first build on a nonlinear motion model that takes into account time-dependent dynamics such as direction and speed. According to this model, the state vector of the pedestrian is defined to contain the position and velocity information.

From image pixels to Kalman state vector. The output of a depth camera is an image sequence. To apply Kalman filtering for trajectory prediction, we first convert the pedestrian bounding boxes detected by YOLOv11 into observation vectors for the Kalman filter. Specifically, for each frame, YOLOv11 outputs a bounding box for each pedestrian. We take the bottom center of the bounding box as the contact point between the pedestrian and the ground, denoted as pixel coordinates (u,v). Subsequently, we use our implemented “Perspective Transformer” class to convert the pixel coordinates to metric coordinates (

p_{x}

,

p_{y}

) on the real-world ground plane. The conversion consists of two steps: (1) global scale factor calibration, which determines the pixel-to-meter ratio using reference points; and (2) perspective correction, which dynamically adjusts the scale factor based on the vertical position of the target in the image to eliminate perspective distortion (see Section 3.1 for details). Finally, the observation vector of the Kalman filter is defined as

z_{k} = {[p_{x}, p_{y}]}^{T}

, and the state vector

x_{k}

contains position and velocity information, as shown in (5).

x_{k} = {[p_{x}, p_{y}, v, θ]}^{T}

(5)

where

p_{x}

,

p_{y}

denote the pedestrian’s 2D coordinates, v represents the instantaneous velocity magnitude, and θ is the direction angle of motion.

This model is more suitable for describing pedestrian turning behaviors compared to the Constant Velocity (CV) model. The nonlinear state transition process of the system can be expressed as follows:

x_{k} = f (x_{k - 1}, u_{k - 1}) + w_{k - 1}

(6)

where f(.) is the nonlinear state transition function, u_k−1 denotes the control input (which is typically zero in this study), and w_k−1 represents the process noise. The process noise is assumed to follow a zero-mean Gaussian distribution with covariance matrix Q_k,w_k~N(0,Q_k).

Setting of control input and process noise. The control input

u_{k - 1}

in Equation (6) is set to zero because pedestrians have no external control signals available (e.g., acceleration commands or steering angles). Pedestrian motion is driven by intent rather than known control inputs. In the absence of such information, the control term is omitted (set to zero), and all motion variations are absorbed into the process noise

w_{k - 1}

. This is standard practice in pedestrian trajectory prediction.

The process noise

w_{k - 1}

follows a zero-mean Gaussian distribution w_k~N(0,Q_k). Its covariance matrix Q_k−1 is empirically tuned via grid search on a validation set. We search over the diagonal entries of Q with the objective of minimizing the Average Displacement Error (ADE) on the validation set. The optimal values are determined as follows:

\begin{matrix} Q = d i a g ([0.1, 0.1, 0.05, 0.0]) \end{matrix}

(7)

corresponding to the noise variances for position (

p_{x}, p_{y}

), velocity

v

and heading angle

θ

, respectively. These values remain fixed during inference. For scenarios with highly unpredictable motion, an adaptive Q could be used, but we found fixed tuning sufficient for our datasets.

Taking a simplified nonlinear motion model as an example, if we assume that the velocity magnitude v and direction angle θ remain approximately constant over a short time interval Δt, the state transition function can be expressed as follows:

f (x_{k - 1}) = [\begin{matrix} p_{x}^{(k - 1)} + v^{(k - 1)} \cdot Δ t \cdot s i n (θ^{(k - 1)}) \\ p_{x}^{(k - 1)} + v^{(k - 1)} \cdot Δ t \cdot c o s (θ^{(k - 1)} \\ v^{(k - 1)} \\ θ^{(k - 1)} \end{matrix}]

(8)

The system obtains observation data via the improved YOLOv11 detector, specifically the center coordinates of the detected bounding boxes. Let z_k denote the observation vector at time k. The relationship between the observation z_k, and the state

x_{k}

can be expressed as follows:

z_{k} = h (x_{k}) + v_{k}

(9)

where h(.) is the nonlinear observation function, and

v_{k}

represents the observation noise. The observation noise is assumed to follow a zero-mean Gaussian distribution with covariance matrix R_k,

v_{k} ~ N (0, R_{k})

.

Calibration of observation noise covariance matrix. The observation noise

v_{k}

follows a zero-mean Gaussian distribution

v_{k} ~ N (0, R_{k})

, where the covariance matrix R_k quantifies the uncertainty of the YOLOv11 detector output. In this paper, we determine R_k as a constant diagonal matrix via offline calibration. First, run the YOLOv11 detector on a validation video sequence with manually annotated ground-truth pedestrian positions; then, for each frame, compute the residual between the detected bounding box center

(u_{k}, v_{k})

and the ground-truth center

(u_{k}^{g t}, v_{k}^{g t})

; next, calculate the empirical variance of these residuals across all frames and all pedestrians as

σ_{u}^{2} = V a r (u_{k} - u_{k}^{g t})

and

σ_{v}^{2} = V a r (v_{k} - v_{k}^{g t})

; finally, set

R_{k} = d i a g ([σ_{u}^{2}, σ_{v}^{2}])

. On the ETH/UCY data set, we obtain

σ_{u}^{2}

= 3.2 pixels and

σ_{v}^{2}

= 2.8 pixels. It should noted that

R_{k}

is not adapted online in our current implementation. However, in the bidirectional feedback mechanism (Section 2.4), we propagate tracking uncertainty from DeepSORT to adjust the Kalman gain of the EKF, which to some extent achieves dynamic compensation for observation noise.

During algorithm execution, the Kalman filter iteratively performs two steps: prediction and update. In the prediction phase, based on the optimal posterior estimate

x_{k - 1 | k - 1}

, and its error covariance

P_{k - 1 | k - 1}

, from the previous time step, the algorithm predicts the prior state estimate

x_{k | k - 1}

and its error covariance

P_{k | k - 1}

for the current time step, using the state transition function. Specifically, the prior state estimate is computed by propagating the posterior estimate through the nonlinear state transition function, while the error covariance is propagated via the Jacobian matrix. The prediction equations are given in (10) and (11), with the Jacobian matrix defined in (12):

x_{k | k - 1} = f (x_{k | k - 1}, u_{k - 1})

(10)

P_{k | k - 1} = F_{k - 1} P_{k - 1 | k - 1} F_{k - 1}^{T} + Q_{k - 1}

(11)

F_{k - 1} = {\frac{\partial f}{\partial x} |}_{x = x_{k - 1 | k - 1}}

(12)

For the state transition model expressed in (7), the corresponding Jacobian matrix is derived as follows:

F = [\begin{matrix} 1 & 0 & Δ t s i n θ & v Δ t c o s θ \\ 0 & 1 & Δ t c o s θ & - v Δ t s i n θ \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

(13)

In the update phase, the algorithm fuses the current observation—specifically, the center coordinates of the bounding boxes generated by the YOLOv11 detector—with the predicted state. The Kalman gain is computed to balance the credibility between the prediction and the observation, yielding the optimal posterior state estimate at the current time step. This process is formulated in (14)–(17). Here, H_k denotes the Jacobian matrix of the observation function

h

(.) evaluated at the prior state estimate

x_{k | k - 1}

, as defined in (15). The state update and covariance update are then performed according to (16) and (17), as follows:

K_{k} = P_{k | k - 1} H_{k}^{T} {(H_{k} P_{k | k - 1} H_{k}^{T} + R_{k})}^{- 1}

(14)

H_{k} = {\frac{\partial h}{\partial x} |}_{x = x_{k | k - 1}}

(15)

x_{k | k} = x_{k | k - 1} + K_{k} (z_{k} - h (x_{k | k - 1}))

(16)

P_{k | k} = (I - K_{k} H_{k}) P_{k | k - 1}

(17)

This process not only effectively compensates for missing observations during short-term occlusion by providing continuous position predictions, but also adaptively responds to sudden changes in pedestrian motion (e.g., acceleration and turning) through dynamic adjustment of the covariance matrices. As a result, it achieves a favorable balance between trajectory smoothness and real-time prediction, laying a solid foundation for subsequent data association and trajectory management. Furthermore, by incorporating process noise and observation noise covariance matrices, the Kalman Filter effectively quantifies the impact of model uncertainties and measurement errors on the prediction results, accounting for the various sources of uncertainty inherent in real-world scenarios.

Once the observation model is established, the EKF performs recursive estimation of the motion state by locally linearizing the nonlinear functions and following the standard Kalman filtering framework. The core steps of this recursive process are illustrated in Figure 3.

Recursive computation steps of EKF. Given the nonlinear state transition function f(.) and observation function

h

(.), the EKF performs the following recursive steps at each time step

k

. Prediction step: first, predict the current state based on the previous posterior estimate (10); then, compute the state transition Jacobian matrix (12); and finally, predict the covariance (11). Update step (when observation

z_{k}

is available): first, compute the observation Jacobian matrix (15); then, compute the innovation covariance and the Kalman gain (14); next, compute the innovation; and finally, update the state (16) and the covariance (17). When no observation is available for a trajectory (e.g., due to occlusion), the update step is skipped, and the predicted state is used as the current estimate, which allows the filter to maintain a trajectory for a limited number of frames, even when the target is temporarily lost.

2.4. Tracking-Prediction Fusion Network

Traditional pedestrian trajectory prediction frameworks typically adopt a unidirectional serial processing paradigm: the detection module first extracts pedestrian locations, the tracking module then maintains identity continuity across frames, and finally the prediction module infers future paths based on the historical trajectories provided by the tracker. However, in complex scenes with severe occlusion, this linear architecture reveals inherent fragility—any identity switching error occurring in the tracking module is directly propagated to the prediction module, while the physically implausible predictions cannot, in turn, correct the tracking decisions. This one-way error propagation creates a vicious cycle of cumulative mistakes, significantly degrading overall system performance.

In standard trackers, trajectory confidence typically depends on the matching quality of appearance features and the short-term consistency of the motion model. However, this evaluation criterion neglects the long-term physical plausibility of the predicted trajectories. To overcome this limitation, we introduce a comprehensive confidence score

C_{h y b r i d}

derived from the prediction module to refine the predicted trajectories and enhance their physical plausibility. This score is defined as follows:

C_{h y b r i d} = α \cdot C_{b a s e} + β \cdot S_{p r e d} + γ \cdot C_{h i s t o r y}

(18)

The weighting coefficients

α

,

β

and

γ

are determined via grid search on a validation set. The search range is

α, β, γ \in {0.1, 0.2, \dots, 0.8}

, subject to

α + β + γ = 1

. The optimization objective is to minimize the Average Displacement Error (ADE) on a validation set consisting of 20% of the training data, randomly sampled. The optimal values are determined as:

α = 0.4, β = 0.25, γ = 0.35

.

Here,

C_{b a s e}

is the detection confidence (averaged over the last 3 frames where the target was visible),

S_{p r e d}

is the prediction smoothness score (based on the negative log-likelihood of the EKF innovation), and

C_{h i s t o r y}

is the trajectory historical consistency (ratio of successful associations over the past 10 frames). These weights remain fixed during inference.

When

C_{h y b r i d}

falls below a dynamic threshold, the system triggers a correction mechanism: the priority of the corresponding trajectory is reduced during data association, and stronger kinematic smoothing constraints are imposed on its prediction results.

To further exploit the rich meta-information generated during the tracking process for enhancing prediction accuracy and robustness, we propagate a key input—specifically, the covariance matrix from the Kalman filter embedded in the tracking module—to the prediction module. This enables the Extended Kalman Filter (EKF) in the prediction module to assign a lower weight to observations with high uncertainty during state updates, thereby effectively suppressing noise introduced by short-term occlusions or mismatches. Specifically, the observation noise covariance matrix

R_{t r a c k}

obtained from the DeepSORT internal Kalman filter is used to adjust the observation update step of the prediction EKF:

K_{k} = P_{k | k - 1} H^{T} {(H P_{k | k - 1} H^{T} + R_{t r a c k})}^{- 1}

(19)

where a larger

R_{t r a c k}

corresponds to higher tracking uncertainty, which reduces the Kalman gain

K_{k}

. This causes the system to rely more on its own predictions rather than the current observations, providing a stabilizing effect under severe occlusion.

Moreover, the proposed collaborative mechanism is primarily implemented through algorithmic logic and rules, avoiding heavy end-to-end neural network training, which ensures computational efficiency. Key parameters are optimized via grid search on a small validation set. During inference, the system operates with the bidirectional information flow described above. The processing of each frame follows the cycle: “detection → tracking (integrating prediction feedback) → prediction (utilizing tracking meta-information)”. The final output is a physically plausible future trajectory, jointly optimized by the tracking and prediction modules.

The fusion architecture proposed in this chapter breaks the barrier between tracking and prediction, enabling the system to adapt holistically when confronting severe occlusion challenges. This design lays the core foundation for the superior performance demonstrated. An overview of the integrated system architecture is presented in Figure 4.

3. Results

3.1. Experimental Setup

The hardware platform employed for model training and system evaluation in this study is equipped with a high-performance NVIDIA GeForce RTX 5090 GPU and 64 GB of RAM. The software environment consists of the Ubuntu 20.04 operating system and the PyTorch 1.10.0 deep learning framework. For real-world data acquisition, a DJI Osmo Action 4 camera was used to capture pedestrian videos in typical street scenes at 1080 p resolution and 30 fps. To construct a diverse training and testing dataset, we integrated publicly available benchmarks with self-collected real-world data. The ETH/UCY datasets were utilized to provide well-annotated trajectory samples from campus and street scenarios. Additionally, we collected approximately 12 h of real-world videos under various weather and lighting conditions from multiple pedestrian-dense areas in Guangzhou (e.g., commercial streets, campuses, and intersections). These videos were manually annotated to form a complementary dataset. In the data preprocessing stage, all video sequences were standardized to a frame rate of 25 fps and a resolution of 640 × 640 pixels. After coordinate conversion (see Section 3.1.1), pedestrian positions are represented in metric coordinates on the real-world ground plane.

3.1.1. Evaluation Indicators

Detection and tracking performance. We employ mean Average Precision (mAP), frames per second (FPS), and the number of parameters (Params) to evaluate the detection and tracking modules. mAP comprehensively assesses detection precision and recall across varying confidence thresholds, with Average Precision (AP) defined as the area under the precision–recall curve for each pedestrian category.

Computational efficiency. FPS is defined as the ratio of processed frames N_frames to total processing time T_total (in seconds). Model complexity is quantified by the total number of trainable parameters (Params) and floating-point operations per second (FLOPs) for a single forward pass.

Trajectory prediction accuracy. For trajectory prediction evaluation, we adopt three standard metrics: Average Displacement Error (ADE), Final Displacement Error (FDE), and trajectory jerk. ADE measures the average Euclidean distance between predicted and ground-truth positions over all time steps, while FDE evaluates the displacement specifically at the final prediction step. Jerk quantifies trajectory smoothness and physical plausibility.

Definition of coordinate system and coordinate conversion. In this paper, (

x_{t}, y_{t}

) denotes metric coordinates on the real-world ground plane (unit: meters), rather than raw image pixel coordinates. All trajectory prediction and error evaluation (ADE, FDE, Jerk) are performed in this metric coordinate system.

To obtain metric coordinates from the pixel coordinates detected by YOLOv11, we implement a “Perspective Transformer” class. The conversion consists of two steps: global scale factor calibration and perspective correction.

Step 1: Global scale factor calibration. The pixel-to-meter ratio is calibrated using reference points. Let

D_{p i x e l}

be the sum of Euclidean pixel distances between reference points, and

D_{r e a l}

be the sum of corresponding real-world distances. The global scale factor is as follows:

\begin{matrix} s c a l e_f a c t o r = \frac{D_{p i x e l}}{D_{r e a l}} \end{matrix}

(20)

Step 2: Perspective correction. Due to perspective distortion, the same pixel displacement corresponds to a larger real-world displacement for objects farther away (smaller vertical coordinate y). We dynamically adjust the scale factor based on the vertical coordinate y of the target’s bounding box bottom:

\begin{matrix} y_{r a t i o} = \frac{y}{H}, a d j u s t e d_s c a l e = s c a l e_f a c t o r \times (1 - 0.5 \times y_{r a t i o}) \end{matrix}

(21)

where

H

is the image height. The adjusted scale factor is constrained to not fall below 0.3 ×

s c a l e_f a c t o r

to avoid numerical instability.

The final actual distance is as follows:

\begin{matrix} d_{r e a l} = \frac{d_{p i x e l}}{a d j u s t e d_s c a l e} \end{matrix}

(22)

When computing ADE and FDE, for each pair of predicted and ground-truth points, we first compute the pixel Euclidean distance

d_{p i x e l}

, then apply the above perspective correction to obtain the actual distance

d_{r e a l}

in meters. If the system is not calibrated or vertical position information is missing, it falls back to a default conversion (100 pixels ≈ 1 m) as a backup.

3.1.2. Loss Function

We employ the Mean Squared Error (MSE) between the predicted and ground-truth trajectories as the training objective:

\begin{matrix} L_{t r a i} = \frac{1}{T_{f}} \sum_{t = 1}^{T_{f}} [{({\hat{x}}_{t} - x_{t})}^{2} + {({\hat{y}}_{t} - y_{t})}^{2}] \end{matrix}

(23)

where T_f denotes the prediction horizon, and (

x_{t}, y_{t}

) and (

{\hat{x}}_{t}, {\hat{y}}_{t}

) are the ground-truth and predicted positions at time step t, respectively.

3.2. Comparative Experiments

To validate the prediction effectiveness of the proposed algorithm, we selected five representative state-of-the-art baseline models for comparative experiments with our method. All models were evaluated on a mixed dataset comprising the ETH/UCY benchmark [10] and our self-collected dataset. The observation horizon was set to nine time steps, and the prediction horizon to 12 time steps. During training, the epoch number was set to 100, batch size to four, random seed to 0, and initial learning rate to 0.0001. All experiments were conducted on an NVIDIA RTX 5090 GPU. The experimental results are presented in Table 1.

From Table 1, our method achieves two major breakthroughs while maintaining prediction accuracy comparable to the Transformer baseline. First, in terms of motion physical plausibility, our approach attains the lowest average trajectory jerk value (0.9 m/s³), which benefits from the Kalman filter-based trajectory smoothing constraint incorporated into our loss function. Second, regarding system deployment efficiency, our model achieves a real-time inference speed of 57.5 FPS, with a computational complexity of only 6.6 GFLOPs and a parameter count of 2.6 M. Its computational efficiency significantly outperforms deep learning models such as Trajectron++ [24], which rely on iterative computations over complex graph networks. These results validate that our lightweight design successfully strikes a superior balance among accuracy, smoothness, and efficiency, offering a more practical solution for real-time trajectory prediction in resource-constrained scenarios such as vehicle-mounted embedded platforms.

To evaluate the pedestrian detection performance of the proposed model, we conducted comparative experiments against two advanced pedestrian detection models: YOLOv8-CB [1] and YOLOv11x [15]. Following the same experimental protocol used for trajectory prediction, all models were trained on the mixed dataset comprising the ETH/UCY benchmark [10] and our self-collected dataset, and tested on videos captured from real-world street scenes. The experimental results are presented in Table 2 in a pedestrian detection performance comparison with YOLOv8-CB and YOLOv11x.

According to Table 2, our model surpasses YOLOv8-CB by 6.8% mAP@0.5 (and YOLOv11x). By virtue of this better performance, it is evident that adding the DeepSORT module enhances pedestrian detection accuracy, giving more reliable tracking information for further trajectory prediction. Moreover, compared to YOLOv11x, our model significantly lowers the number of parameters and computational cost, allowing us to achieve better prediction accuracy at a lower cost. This allows it to become better suited to embedded devices. Representative detection results during training are visualized in Figure 5, demonstrating the model’s ability to accurately localize pedestrians under varying crowd densities and partial occlusion conditions, with the corresponding precision–recall curve presented alongside.

3.3. Comparison Under Occlusion Scenarios

To validate the core advantage of the proposed OPTP-System in real-world complex scenes—namely, its robustness against occlusion—we designed a systematic comparative experiment under occlusion scenarios, selecting three state-of-the-art (SOTA) models for benchmarking. This section aims to quantitatively evaluate the performance degradation of each trajectory prediction method when the observation quality deteriorates due to mutual occlusion among pedestrians.

Given that occlusion scenarios in standard datasets are limited and uncontrollable, we constructed a test set using a synthetic method based on real data. Specifically, we extracted continuous video segments with dense pedestrian interactions from the ETH/UCY test set and computed the average occlusion rate for pedestrians in each segment. The definition of the average pedestrian occlusion rate is formulated as follows:

\begin{matrix} O_{t}^{i} = 1 - |\frac{V_{t}}{B_{t}}| \end{matrix}

(24)

\begin{matrix} O^{i} = \frac{1}{T} \sum_{t = 1}^{T} O_{t}^{i} \end{matrix}

(25)

\begin{matrix} O R = \frac{1}{N} \sum_{i = 1}^{N} O^{i} \end{matrix}

(26)

where

O_{t}^{i}

denotes the occlusion rate of pedestrian i at frame t,

V_{t}

is the visible area,

B_{t}

is the full body area (estimated by the detection bounding box of the pedestrian in that frame),

O^{i}

represents the average occlusion rate of pedestrian i over the entire video, and T is the number of pedestrians.

Based on the OR value, each test video is categorized into one of four occlusion levels: light occlusion (0–10%), moderate occlusion (10–25%), severe occlusion (25–50%), and extreme occlusion (50–100%). The observation horizon was set to nine time steps and the prediction horizon to 12 time steps. During training, the number of epochs was set to 100, batch size to four, random seed to 0, and initial learning rate to 0.0001. All experiments were conducted on an NVIDIA RTX 5090 GPU.

The experimental results of the occlusion robustness evaluation are presented in Table 3. All values in the table represent ADE/FDE, with units in meters.

From Table 3, our method achieves optimal performance across all occlusion levels. Notably, as the occlusion level increases from light occlusion (0–10%) to extreme occlusion (50–100%), the ADE of our method only increases from 0.26 to 0.28, a relative increase of 7.7%, while the FDE rises from 0.40 to 0.53, corresponding to a 32.5% increase. This performance degradation is substantially smaller than that of the compared methods, highlighting the superior occlusion robustness of our approach. These results demonstrate that the proposed method achieves significant effectiveness in pedestrian trajectory prediction under real-world complex occlusion environments.

3.4. Ablation Study

To verify the effectiveness of each component of the proposed bidirectional feedback mechanism under extreme occlusion, we conducted systematic ablation experiments. Using the EKF prediction model without any bidirectional feedback as the baseline, we progressively added each feedback module and observed the impact on trajectory prediction performance. All experiments were conducted on the extreme occlusion test subset with occlusion rate OR > 50%, observation horizon = 9 and prediction horizon = 12. The results are presented in Table 4.

From Table 4, we observe that adding tracking-to-prediction uncertainty propagation reduces ADE from 1.45 to 0.58 (a 60.0% reduction), demonstrating that leveraging the tracker’s uncertainty information to dynamically adjust the prediction module’s observation weights is effective, especially when the target is severely occluded and observations are unreliable. Adding prediction-to-tracking confidence feedback reduces ADE to 0.56 (a 61.4% reduction), showing that using prediction trajectory smoothness to refine the tracker’s confidence evaluation is also effective, helping to reduce identity switches caused by occlusion. The full bidirectional feedback model achieves the best performance across all metrics, with ADE reduced to 0.28 (an 80.7% reduction relative to baseline), FDE reduced to 0.53, and Jerk reduced to 0.9, demonstrating the superiority of the two feedback directions working synergistically—tracking conveys uncertainty to prediction, while prediction imposes confidence constraints on tracking, forming a virtuous closed-loop optimization. The combined effect is substantially greater than either single feedback direction. These results fully demonstrate that the proposed bidirectional feedback mechanism is highly effective under extreme occlusion, with each component contributing substantially to the final performance, and the two feedback directions are complementary.

To verify the robustness of the hybrid confidence weight selection in Section 2.4, we vary each coefficient by ±20% from its optimal value and observe the change in ADE. Experiments are conducted on the extreme occlusion (OR > 50%) test subset. It should be noted that to maintain the mathematical constraint of weighted averaging (α + β + γ = 1), when one coefficient varies, the other two are proportionally scaled. For example, when α is reduced by 20%, β and γ are proportionally increased to keep the sum equal to one. The ADE values in the table are averaged over five independent runs, and the relative change is calculated as (Current ADE − Optimal ADE)/Optimal ADE × 100%. From Table 5, we observe that when each weight varies within ±20% of the optimal value, the ADE change does not exceed 3.5%, with the largest change occurring at β + 20% (+3.2%) and the smallest at γ + 20% (+0.7%). This result demonstrates that our method is insensitive to weight selection, maintaining stable performance even when weights fluctuate within a considerable range, confirming the robustness of the hybrid confidence weight design.

3.5. Scenario Case Display

We evaluated our model on multiple videos using a high-performance NVIDIA GeForce RTX 5090 GPU with 64 GB of RAM, running the Ubuntu 20.04 operating system. Representative qualitative results are illustrated in Figure 6, where the concentric circles outside each pedestrian bounding box represent the predicted trajectories at 0.5 s, 1 s, 1.5 s and 2 s, respectively.

It can be seen that our model still maintains high-precision trajectory prediction in complex intersection, occlusion and rapid direction change scenarios. The dot distribution closely matches the real motion trend without obvious drift or divergence.

4. Discussion

The research proposes a nimble and proficient framework for occlusion awareness. Our method chooses a modular design philosophy, a departure from the complex visual encoding networks and end-to-end deep learning paradigm that dominates today. The YOLOv11 object detector, an improved DeepSORT multi-target tracker, and an EKF prediction module are integrated into a collaborative detection–tracking–prediction system. To improve the prediction accuracy in difficult occlusion, we first design occlusion-aware detection-tracking collaboration. The mechanism offers state priors during short target occlusions to aid the detector’s re-searching of targets in occluded areas. It also utilizes the smoothness feedback from the tracking trajectories to counteract false positives and missed detections. This effectively prevents any error build-up and identity switches during occlusion. In addition, the lightweight three-stage processing architecture proposed in this paper reduces model parameters compared to the previous methods. However, it did not reduce prediction accuracy and trajectory smoothness. As a result, the total computational overhead and memory usage is vastly lower and yet capable of delivering the same performance levels. Our model outperforms existing methods in prediction accuracy and occlusion robustness, as the experiments show. Nonetheless, there are several limitations to this study. To begin with, although the framework is quite robust to short-term occlusions (typically 2 s or so), identity re-identification can still fail when pedestrians reappear after long periods of complete occlusion (e.g., walking into a building), which is limited by the generalization capacity of the current Re-ID model appearance features. The system also assumes that over short time intervals, the pedestrian is a constant turn model, which does not effectively process an arbitrary acceleration in a complex social environment (e.g., sudden acceleration, sudden stop, or sudden turn). The dynamic optimization may require further incorporation of deep learning models. Ultimately, the practical benefits of our framework’s streamlined design, especially the power usage and latency on edge devices, must be systematically assessed and validated under real conditions.

Author Contributions

Z.L. was responsible for primary writing and data preparation and experimental reasoning and revision of the paper. H.H., Y.Z. and W.Z. gave technical and writing method guidance as instructors. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by The National Key R & D Program of China (2023YFD1400700).

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Q.; Ye, H.; Wang, S.; Xu, Z. YOLOv8-CB: Dense Pedestrian Detection Algorithm Based on In-Vehicle Camera. Electronics 2024, 13, 236. [Google Scholar] [CrossRef]
Qiu, J.; Wang, S.; Xu, Y.; Wu, X.; Liu, B. Improved YOLOv8-based obstacle detection algorithm for unmanned vehicles. J. Xiamen Univ. Technol. 2025; in press.
Wang, Q.; Song, W.; Hua, Z.; Lu, P. Ship trajectory prediction research based on IMM Kalman filter. Transducer Microsyst. Technol. 2025, 44, 48–51. [Google Scholar] [CrossRef]
Dai, H.; Zhang, X.; Zheng, B.; Dai, S.; Zheng, W. Zero-velocity update algorithm for inertial pedestrian navigation with heading error nonlinear prediction. J. Beijing Univ. Aeronaut. Astronaut. 2022, 48, 1135–1144. [Google Scholar] [CrossRef]
Chen, Y. Research on Pedestrian Track Inference Technology Based on Adaptive Kalman Filter. Master’s Thesis, Beijing University of Posts and Telecommunications, Beijing, China, 2021. [Google Scholar]
Wang, H.; Liu, Y.; Qian, K. Pedestrian trajectory prediction based on KDE data density pruning graph neural network. Acta Metrol. Sin. 2025, 46, 1513–1519. [Google Scholar]
Du, T.; Zhuang, X.; Wang, Y.; Li, Z.; Lv, J.; Zhi, Y.; Zhao, Y. Survey on pedestrian trajectory prediction based on graph neural networks. J. Comput. Sci. Explor. 2025, 20, 923–942. [Google Scholar]
Ma, L.; Liu, D.; Dai, X. TF-MI: A Pedestrian Trajectory Prediction Network Based on Human Pose Temporal-Frequency Feature and Multi-Scale Interaction. Digit. Signal Process. 2026, 168, 105680. [Google Scholar] [CrossRef]
Zhou, H.; Lv, J.; Ding, P.; Chen, M. UAV trajectory prediction model integrating channel-temporal and cross-attention. J. Beijing Univ. Aeronaut. Astronaut. 2025, 1–13. [Google Scholar] [CrossRef]
Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Las Vegas, NV, USA, 2016; pp. 961–971. [Google Scholar]
Mohamed, A.; Qian, K.; Elhoseiny, M.; Claudel, C. Social-STGCNN: A Social Spatio-Temporal Graph Convolutional Neural Network for Human Trajectory Prediction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Seattle, WA, USA, 2020; pp. 14412–14420. [Google Scholar]
Giuliari, F.; Hasan, I.; Cristani, M.; Galasso, F. Transformer Networks for Trajectory Forecasting. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Virtually, 10–15 January 2021; pp. 10335–10342. [Google Scholar]
Gao, F.; Huang, W.; Weng, L.; Zhang, Y. SIF-TF: A Scene-Interaction Fusion Transformer for Trajectory Prediction. Knowl. Based Syst. 2024, 294, 111744. [Google Scholar] [CrossRef]
Liu, M.; Deng, S.; Qi, Y.; Shi, X. Research on Lightweight Adaptive Object Tracking Algorithm Based on Improved YOLOv11n and DeepSORT. Int. Core J. Eng. 2025, 11, 33–47. [Google Scholar] [CrossRef]
Liu, W.; Shi, W.; Yang, M.; Wang, J.; Huang, M.; Yang, L. An Improved YOLOv11 Pedestrian Detection Model in Complex Scenes. J. Hebei Univ. Sci. Technol. 2026, 47, 60–72. [Google Scholar] [CrossRef]
Sazak, H.; Kotan, M. Automated Blood Cell Detection and Classification in Microscopic Images Using YOLOv11 and Optimized Weights. Diagnostics 2024, 15, 22. [Google Scholar] [CrossRef] [PubMed]
Cheng, S.; Han, Y.; Wang, Z.; Liu, S.; Yang, B.; Li, J. An Underwater Object Recognition System Based on Improved YOLOv11. Electronics 2025, 14, 201. [Google Scholar] [CrossRef]
Alkhammash, E.H. Multi-Classification Using YOLOv11 and Hybrid YOLO11n-MobileNet Models: A Fire Classes Case Study. Fire 2025, 8, 17. [Google Scholar] [CrossRef]
Zou, C.; Yu, S.; Yu, Y.; Gu, H.; Xu, X. Side-Scan Sonar Small Objects Detection Based on Improved YOLOv11. J. Mar. Sci. Eng. 2025, 13, 162. [Google Scholar] [CrossRef]
Wang, C.; Han, Y.; Yang, C.; Wu, M.; Chen, Z.; Yun, L.; Jin, X. CF-YOLO for Small Target Detection in Drone Imagery Based on YOLOv11 Algorithm. Sci. Rep. 2025, 15, 16741. [Google Scholar] [CrossRef] [PubMed]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. StrongSORT: Make DeepSORT Great Again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Alkandary, K.; Yildiz, A.S.; Meng, H. A Comparative Study of YOLO Series (v3–v10) with DeepSORT and StrongSORT: A Real-Time Tracking Performance Study. Electronics 2025, 14, 876. [Google Scholar] [CrossRef]
Zhu, Y.; Yu, Q.; Xu, Z. Optimized Watermelon Scion Leaf Segmentation Model Based on Hungarian Algorithm and Information Theory. Electronics 2025, 14, 620. [Google Scholar] [CrossRef]
Qian, T.-W.; Wang, Y.; Xu, Y.-J.; Zhang, Z.; Wu, L.; Qiu, Q.; Wang, F. A Model-Agnostic Hierarchical Framework Towards Trajectory Prediction. J. Comput. Sci. Technol. 2025, 40, 322–339. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of YOLOv11.

Figure 2. YOLO-DeepSORT working framework, “prediction–correlation–update” diagram.

Figure 3. Kalman filtering workflow.

Figure 4. Overall architecture of the proposed OPTP-System.

Figure 5. Detection examples during training and precision–recall curve.

Figure 6. Qualitative results in real-world scenarios.

Table 1. Comparison with state-of-the-art baseline models.

Methods	ADE (m)	FDE (m)	Avg.Jerk (m/s³)	FPS	FLOPS (G)	Params (M)
Social-LSTM [10]	0.27	0.46	5.5	67.5	0.8	1.5
Social-STGCNN [11]	0.26	0.45	4.1	56.3	0.3	0.007
Trajectron++ [24]	0.25	0.43	1.2	37.5	15.0	4.0
Transformer [12]	0.31	0.65	4.2	32.1	8.0	10.0
TF-MI [8]	0.30	0.80	2.3	42.5	2.9	3.3
ours	0.26	0.40	0.9	57.5	6.6	2.6

Table 2. Pedestrian detection performance comparison with YOLOv8-CB and YOLOv11x.

Methods	mAP@0.5	FPS	FLOPS (G)	Params (M)
YOLOv8-CB [1]	0.88	92.5	7.5	2.7
YOLOv11x [15]	0.94	31.6	196	57
ours	0.94	57.5	6.6	2.6

Table 3. Experimental results under different occlusion levels.

Methods	Test Video Occlusion Levels
Methods	0%~10%	10%~25%	25%~50%	50%~100%
Social-LSTM [10]	0.27/0.46	0.31/0.55	0.39/0.72	0.52/0.98
Social-STGCNN [11]	0.26/0.45	0.29/0.52	0.35/0.65	0.45/0.86
Trajectron++ [24]	0.25/0.43	0.27/0.48	0.32/0.58	0.40/0.75
ours	0.26/0.40	0.26/0.43	0.27/0.47	0.28/0.53

Table 4. Ablation study under extreme occlusion (OR > 50%).

Model Variant	Configuration	ADE	FDE	Jerk	Improvement of ADE vs. Baseline
(1) EKF (no feedback, baseline)	EKF prediction only, no bidirectional feedback	1.45	1.85	2.3	-
(2) +Tracking→ Prediction	(1) + Uncertainty propagation (Equation (19))	0.58	0.85	1.3	60.0%
(3) +Prediction→ Tracking	(1) + Confidence feedback (Equation (18))	0.56	0.80	1.2	61.4%
(4) Full bidirectional (ours)	(1) + (2) + (3)	0.28	0.53	0.9	80.7%

Table 5. Sensitivity analysis of hybrid confidence weights (extreme occlusion OR > 50%).

Configuration	$α$	$β$	$γ$	ADE	Relative Change
Optimal	0.4	0.25	0.35	0.280	-
$α$ − 20%	0.32	0.27	0.41	0.287	+2.5%
$α$ + 20%	0.48	0.22	0.30	0.284	+2.4%
$β$ − 20%	0.43	0.20	0.37	0.283	+1.1%
$β$ + 20%	0.37	0.30	0.33	0.289	+3.2%
$γ$ − 20%	0.42	0.27	0.28	0.285	+1.8%
$γ$ + 20%	0.38	0.24	0.38	0.282	+0.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, Z.; Huang, H.; Zhang, Y.; Zhao, W. OPTP-System: A Lightweight Pedestrian Trajectory Prediction System for Complex Occlusion Environments. Electronics 2026, 15, 2247. https://doi.org/10.3390/electronics15112247

AMA Style

Lin Z, Huang H, Zhang Y, Zhao W. OPTP-System: A Lightweight Pedestrian Trajectory Prediction System for Complex Occlusion Environments. Electronics. 2026; 15(11):2247. https://doi.org/10.3390/electronics15112247

Chicago/Turabian Style

Lin, Zijian, Hong Huang, Yirui Zhang, and Wenfeng Zhao. 2026. "OPTP-System: A Lightweight Pedestrian Trajectory Prediction System for Complex Occlusion Environments" Electronics 15, no. 11: 2247. https://doi.org/10.3390/electronics15112247

APA Style

Lin, Z., Huang, H., Zhang, Y., & Zhao, W. (2026). OPTP-System: A Lightweight Pedestrian Trajectory Prediction System for Complex Occlusion Environments. Electronics, 15(11), 2247. https://doi.org/10.3390/electronics15112247

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

OPTP-System: A Lightweight Pedestrian Trajectory Prediction System for Complex Occlusion Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Pedestrian Detection Framework Based on YOLOv11

2.2. Improved YOLOv11-DeepSORT Tracker for Occlusion Robustness

2.3. Prediction Method Based on Extended Kalman Filter

2.4. Tracking-Prediction Fusion Network

3. Results

3.1. Experimental Setup

3.1.1. Evaluation Indicators

3.1.2. Loss Function

3.2. Comparative Experiments

3.3. Comparison Under Occlusion Scenarios

3.4. Ablation Study

3.5. Scenario Case Display

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI