2.1. Pedestrian Detection Framework Based on YOLOv11
YOLOv11, the most recent version of the framework, was made available by Ultralytics on 30 September 2024 [
15]. It is a comprehensive unified framework for five computer vision tasks. The various tasks are as follows: object detection, instance segmentation, image classification, pose estimation, and oriented object detection. Through architectural optimizations of core components, along with better design choices and training methods, this version achieves substantial improvements in feature recognition accuracy and inference efficiency. Interestingly, YOLOv11 obtains a greater mean Average Precision (mAP) on COCO while enjoying lower computational burden. Aside from these performance enhancements, its improved usability allows for integration and deployment in downstream applications [
15], expanding its practical utility.
YOLOv11 retains the classic “Backbone–Neck–Head” three-stage architecture characteristic of the YOLO series, while introducing key upgrades in each component. In the backbone network, YOLOv11 builds upon the foundation of YOLOv8 [
1] with further optimizations. Notably, it extensively adopts the CSPNet (Cross Stage Partial Networks) design philosophy [
15], which is manifested in the model as the C3k2 module. This module achieves richer gradient flow information by partitioning feature maps and performing partial fusion across different stages, thereby enhancing its feature representation capability while reducing computational cost. It employs the SiLU activation function, which generally provides smoother optimization landscapes and better performance than traditional ReLU in deep models.
In the neck, YOLOv11 utilizes an enhanced version of PAN-FPN (Path Aggregation Network–Feature Pyramid Network), which implements both top-down and bottom-up feature fusion pathways. Through this structure, the neck outputs three strong semantic feature maps at different scales to the detection head, enabling the detection of small, medium, and large targets, respectively [
16,
17,
18,
19,
20]. The overall architecture of YOLOv11 is illustrated in
Figure 1.
In order to achieve accurate prediction of pedestrian trajectories, target detection and positioning are of great significance. This study uses the YOLOv11 visual model for multi-target detection, which can effectively improve the accuracy of the data and lay a solid foundation for subsequent time series modeling and prediction work. The specific implementation method is as follows.
For each frame of the input video, YOLOv11 detects pedestrians and outputs a set of bounding boxes, donated as . Each bounding box contains the spatial coordinates of the i-th object in the image, along with its confidence score and class label . To eliminate unreliable detections, we apply a predefined confidence threshold and retain only those high-confidence bounding boxes that are classified as “pedestrians”. These retained boxes serve as the candidate targets for the current frame.
Pedestrian detection in complex scenes is often challenged by adverse factors such as varying illumination and the presence of small pedestrian targets, which significantly complicate feature extraction. To address these challenges, we employ feature fusion techniques to enhance the model’s capability in perceiving fine-grained and multi-scale targets [
16].
2.2. Improved YOLOv11-DeepSORT Tracker for Occlusion Robustness
In complicated conditions, YOLO-based detectors have difficulty consistently maintaining target identities through frames and output consistent historical trajectories [
17,
20]. To enable accurate multi-target localization and stable tracking, we use an efficient algorithm fusion strategy that merges YOLOv11’s powerful detection ability and DeepSORT’s robust association tracking ability. The gist of this method is to use YOLOv11 to perform high accuracy instant target localization and allow DeepSORT to perform association across frames and generate continuous trajectories with unique ID labels. This joint combination makes great use of the merits of both algorithms, enabling a multi-pedestrian detection and tracking system with strong real-time performance and high robustness.
The integration of YOLOv11 and DeepSORT essentially forms a cascade of detection and tracking modules. As the detector, YOLOv11 independently identifies all pedestrian targets in each video frame and outputs a set of detection bounding boxes (refer to (1)). Each detection box contains the pixel coordinates of its center , its width and height , and a detection confidence score .
Since frame-independent detection results alone cannot provide temporal continuity for targets, we employ DeepSORT as the tracker to establish correspondences between detected targets across adjacent frames and maintain an independent tracking trajectory for each target (refer to (2)). To achieve this, DeepSORT relies on a two-stage association mechanism: motion-based association and appearance-based association.
First, for each existing trajectory
, DeepSORT maintains a Kalman filter based on a constant velocity linear motion model. This filter predicts the target’s position at time t (denoted as a predicted bounding box
) based on the trajectory’s state at time t − 1. Subsequently, the Mahalanobis distance between all detection boxes D
t and all predicted boxes
is computed to measure their compatibility in terms of motion consistency. The Mahalanobis distance is formulated as shown in (3).
where
and
denote the position vectors of the predicted box and the detection box, respectively, and
is the covariance matrix of the Kalman filter’s predicted state.
Equation (3) defines the Mahalanobis distance, which is a standard metric for motion association in the DeepSORT algorithm. Unlike Euclidean distance, the Mahalanobis distance accounts for prediction uncertainty: when the prediction uncertainty is high (i.e.,
is large), the same Euclidean distance yields a smaller Mahalanobis distance, making the association more permissive. This mechanism is widely used in object tracking [
21,
22].
Appearance-based association, powered by a Re-Identification (Re-ID) network, addresses the challenge of identity preservation under complex scenarios such as occlusion and pedestrian intersection, particularly when the motion model fails (e.g., during sudden turns or pauses). To this end, we introduce a second association criterion [
21,
22]. Specifically, a pre-trained deep Re-ID network is employed to extract a high-dimensional appearance feature vector
from each detection box
. This feature vector is normalized such that
1. Similarly, for each existing trajectory
, we maintain a gallery of its L most recent appearance feature vectors. The cosine distance between the detection feature and the trajectory feature set is then computed, and the minimum value is taken as the appearance affinity metric. Formally, the appearance-based similarity measure
is defined as
The fusion process of the proposed system operates as an iterative “prediction–association–update” cycle, as illustrated in
Figure 2. The procedure consists of the following steps:
Detection. For each input video frame , the YOLOv11 model performs forward inference to generate a set of high-confidence detection results .
Prediction. For all currently active tracking trajectories , their respective Kalman filters predict the state (position and velocity) at time t, yielding a set of predicted bounding boxes .
Association. After prediction, motion-based and appearance-based associations are performed sequentially:
Stage 1 (motion association): The Mahalanobis distance is computed between detection boxes and predicted boxes for initial matching. A Mahalanobis distance threshold is applied to eliminate unlikely associations.
Stage 2 (appearance association): For detections and trajectories that remain unmatched after the first stage, the cosine distance is used to construct a second cost matrix for secondary association.
Optimal assignment: The Hungarian algorithm [
23] is then applied to the combined cost matrix to obtain the optimal matching between detection boxes and existing trajectories.
Trajectory update. Based on the association results, trajectories are updated as follows.
Based on the association results, trajectories are updated accordingly: successfully matched trajectories are refined using their corresponding detection boxes as observations; unmatched detection boxes are initialized as new tracking trajectories; and unmatched trajectories are temporarily marked as “lost” and will be re-associated in subsequent frames, with termination occurring if the loss persists for multiple consecutive frames.
2.3. Prediction Method Based on Extended Kalman Filter
Based on the Extended Kalman Filter (EKF), the module for the trajectory prediction system is proposed to predict the trajectories of pedestrians. Through the state-space model of pedestrian motion, EKF improves the accuracy of predicting moving target location and smoothing the trajectory [
4]. In particular, we first build on a nonlinear motion model that takes into account time-dependent dynamics such as direction and speed. According to this model, the state vector of the pedestrian is defined to contain the position and velocity information.
From image pixels to Kalman state vector. The output of a depth camera is an image sequence. To apply Kalman filtering for trajectory prediction, we first convert the pedestrian bounding boxes detected by YOLOv11 into observation vectors for the Kalman filter. Specifically, for each frame, YOLOv11 outputs a bounding box for each pedestrian. We take the bottom center of the bounding box as the contact point between the pedestrian and the ground, denoted as pixel coordinates (u,v). Subsequently, we use our implemented “Perspective Transformer” class to convert the pixel coordinates to metric coordinates (
,
) on the real-world ground plane. The conversion consists of two steps: (1) global scale factor calibration, which determines the pixel-to-meter ratio using reference points; and (2) perspective correction, which dynamically adjusts the scale factor based on the vertical position of the target in the image to eliminate perspective distortion (see
Section 3.1 for details). Finally, the observation vector of the Kalman filter is defined as
, and the state vector
contains position and velocity information, as shown in (5).
where
,
denote the pedestrian’s 2D coordinates, v represents the instantaneous velocity magnitude, and θ is the direction angle of motion.
This model is more suitable for describing pedestrian turning behaviors compared to the Constant Velocity (CV) model. The nonlinear state transition process of the system can be expressed as follows:
where f(.) is the nonlinear state transition function, u
k−1 denotes the control input (which is typically zero in this study), and w
k−1 represents the process noise. The process noise is assumed to follow a zero-mean Gaussian distribution with covariance matrix Q
k,w
k~N(0,Q
k).
Setting of control input and process noise. The control input in Equation (6) is set to zero because pedestrians have no external control signals available (e.g., acceleration commands or steering angles). Pedestrian motion is driven by intent rather than known control inputs. In the absence of such information, the control term is omitted (set to zero), and all motion variations are absorbed into the process noise . This is standard practice in pedestrian trajectory prediction.
The process noise
follows a zero-mean Gaussian distribution w
k~N(0,Q
k). Its covariance matrix Q
k−1 is empirically tuned via grid search on a validation set. We search over the diagonal entries of Q with the objective of minimizing the Average Displacement Error (ADE) on the validation set. The optimal values are determined as follows:
corresponding to the noise variances for position (
), velocity
and heading angle
, respectively. These values remain fixed during inference. For scenarios with highly unpredictable motion, an adaptive Q could be used, but we found fixed tuning sufficient for our datasets.
Taking a simplified nonlinear motion model as an example, if we assume that the velocity magnitude v and direction angle θ remain approximately constant over a short time interval Δt, the state transition function can be expressed as follows:
The system obtains observation data via the improved YOLOv11 detector, specifically the center coordinates of the detected bounding boxes. Let z
k denote the observation vector at time k. The relationship between the observation z
k, and the state
can be expressed as follows:
where h(.) is the nonlinear observation function, and
represents the observation noise. The observation noise is assumed to follow a zero-mean Gaussian distribution with covariance matrix R
k,
.
Calibration of observation noise covariance matrix. The observation noise
follows a zero-mean Gaussian distribution
, where the covariance matrix R
k quantifies the uncertainty of the YOLOv11 detector output. In this paper, we determine R
k as a constant diagonal matrix via offline calibration. First, run the YOLOv11 detector on a validation video sequence with manually annotated ground-truth pedestrian positions; then, for each frame, compute the residual between the detected bounding box center
and the ground-truth center
; next, calculate the empirical variance of these residuals across all frames and all pedestrians as
and
; finally, set
. On the ETH/UCY data set, we obtain
= 3.2 pixels and
= 2.8 pixels. It should noted that
is not adapted online in our current implementation. However, in the bidirectional feedback mechanism (
Section 2.4), we propagate tracking uncertainty from DeepSORT to adjust the Kalman gain of the EKF, which to some extent achieves dynamic compensation for observation noise.
During algorithm execution, the Kalman filter iteratively performs two steps: prediction and update. In the prediction phase, based on the optimal posterior estimate
, and its error covariance
, from the previous time step, the algorithm predicts the prior state estimate
and its error covariance
for the current time step, using the state transition function. Specifically, the prior state estimate is computed by propagating the posterior estimate through the nonlinear state transition function, while the error covariance is propagated via the Jacobian matrix. The prediction equations are given in (10) and (11), with the Jacobian matrix defined in (12):
For the state transition model expressed in (7), the corresponding Jacobian matrix is derived as follows:
In the update phase, the algorithm fuses the current observation—specifically, the center coordinates of the bounding boxes generated by the YOLOv11 detector—with the predicted state. The Kalman gain is computed to balance the credibility between the prediction and the observation, yielding the optimal posterior state estimate at the current time step. This process is formulated in (14)–(17). Here, H
k denotes the Jacobian matrix of the observation function
(.) evaluated at the prior state estimate
, as defined in (15). The state update and covariance update are then performed according to (16) and (17), as follows:
This process not only effectively compensates for missing observations during short-term occlusion by providing continuous position predictions, but also adaptively responds to sudden changes in pedestrian motion (e.g., acceleration and turning) through dynamic adjustment of the covariance matrices. As a result, it achieves a favorable balance between trajectory smoothness and real-time prediction, laying a solid foundation for subsequent data association and trajectory management. Furthermore, by incorporating process noise and observation noise covariance matrices, the Kalman Filter effectively quantifies the impact of model uncertainties and measurement errors on the prediction results, accounting for the various sources of uncertainty inherent in real-world scenarios.
Once the observation model is established, the EKF performs recursive estimation of the motion state by locally linearizing the nonlinear functions and following the standard Kalman filtering framework. The core steps of this recursive process are illustrated in
Figure 3.
Recursive computation steps of EKF. Given the nonlinear state transition function f(.) and observation function (.), the EKF performs the following recursive steps at each time step . Prediction step: first, predict the current state based on the previous posterior estimate (10); then, compute the state transition Jacobian matrix (12); and finally, predict the covariance (11). Update step (when observation is available): first, compute the observation Jacobian matrix (15); then, compute the innovation covariance and the Kalman gain (14); next, compute the innovation; and finally, update the state (16) and the covariance (17). When no observation is available for a trajectory (e.g., due to occlusion), the update step is skipped, and the predicted state is used as the current estimate, which allows the filter to maintain a trajectory for a limited number of frames, even when the target is temporarily lost.
2.4. Tracking-Prediction Fusion Network
Traditional pedestrian trajectory prediction frameworks typically adopt a unidirectional serial processing paradigm: the detection module first extracts pedestrian locations, the tracking module then maintains identity continuity across frames, and finally the prediction module infers future paths based on the historical trajectories provided by the tracker. However, in complex scenes with severe occlusion, this linear architecture reveals inherent fragility—any identity switching error occurring in the tracking module is directly propagated to the prediction module, while the physically implausible predictions cannot, in turn, correct the tracking decisions. This one-way error propagation creates a vicious cycle of cumulative mistakes, significantly degrading overall system performance.
In standard trackers, trajectory confidence typically depends on the matching quality of appearance features and the short-term consistency of the motion model. However, this evaluation criterion neglects the long-term physical plausibility of the predicted trajectories. To overcome this limitation, we introduce a comprehensive confidence score
derived from the prediction module to refine the predicted trajectories and enhance their physical plausibility. This score is defined as follows:
The weighting coefficients , and are determined via grid search on a validation set. The search range is , subject to . The optimization objective is to minimize the Average Displacement Error (ADE) on a validation set consisting of 20% of the training data, randomly sampled. The optimal values are determined as: .
Here, is the detection confidence (averaged over the last 3 frames where the target was visible), is the prediction smoothness score (based on the negative log-likelihood of the EKF innovation), and is the trajectory historical consistency (ratio of successful associations over the past 10 frames). These weights remain fixed during inference.
When falls below a dynamic threshold, the system triggers a correction mechanism: the priority of the corresponding trajectory is reduced during data association, and stronger kinematic smoothing constraints are imposed on its prediction results.
To further exploit the rich meta-information generated during the tracking process for enhancing prediction accuracy and robustness, we propagate a key input—specifically, the covariance matrix from the Kalman filter embedded in the tracking module—to the prediction module. This enables the Extended Kalman Filter (EKF) in the prediction module to assign a lower weight to observations with high uncertainty during state updates, thereby effectively suppressing noise introduced by short-term occlusions or mismatches. Specifically, the observation noise covariance matrix
obtained from the DeepSORT internal Kalman filter is used to adjust the observation update step of the prediction EKF:
where a larger
corresponds to higher tracking uncertainty, which reduces the Kalman gain
. This causes the system to rely more on its own predictions rather than the current observations, providing a stabilizing effect under severe occlusion.
Moreover, the proposed collaborative mechanism is primarily implemented through algorithmic logic and rules, avoiding heavy end-to-end neural network training, which ensures computational efficiency. Key parameters are optimized via grid search on a small validation set. During inference, the system operates with the bidirectional information flow described above. The processing of each frame follows the cycle: “detection → tracking (integrating prediction feedback) → prediction (utilizing tracking meta-information)”. The final output is a physically plausible future trajectory, jointly optimized by the tracking and prediction modules.
The fusion architecture proposed in this chapter breaks the barrier between tracking and prediction, enabling the system to adapt holistically when confronting severe occlusion challenges. This design lays the core foundation for the superior performance demonstrated. An overview of the integrated system architecture is presented in
Figure 4.