Highlights
What are the main findings?
- Lightweight stereo matching using only bounding box coordinates achieves robust multi-object tracking with a MOTA of 0.932 and an IDF1 of 0.823, outperforming state-of-the-art monocular trackers.
- A dual-tracker design with a re-identification mechanism maintains consistent object identities during occlusions and truncations by leveraging stereo redundancy.
What are the implications of the main findings?
- Resource-efficient 2.5D tracking enables real-time deployment (70 FPS) on standard hardware without expensive 3D reconstruction or dense stereo matching.
- Stereo vision’s inherent redundancy provides a practical solution for robust tracking in challenging real-world scenarios like retail monitoring and autonomous systems.
Abstract
Multi-object tracking faces persistent challenges from occlusions and truncations in monocular vision systems. While stereo vision provides depth information, existing approaches require computationally expensive dense matching or 3D reconstruction. This paper presents a real-time 2.5D stereo multi-object tracking framework combining lightweight stereo matching with resilient tracker management. The stereo matching module employs Direct Linear Transform-based triangulation using only bounding box coordinates, eliminating costly feature extraction while maintaining robust correspondence through geometric constraints. A dual-tracker architecture maintains independent trackers in both views, enabling re-identification when objects become occluded in one view but remain visible in the other. Experimental validation on a refrigerator monitoring dataset demonstrates that StereoSORT achieves a multiple object tracking accuracy (MOTA) of 0.932 and an identification F1 score (IDF1) of 0.823, substantially outperforming monocular trackers, including OC-SORT (IDF1: 0.765) and ByteTrack (IDF1: 0.609). The system achieves a 50.1 mm median depth error, comparable to commercial sensors, while maintaining 70 FPS on standard hardware. These results validate that geometric constraints alone enable robust stereo tracking without appearance features, offering a practical solution for resource-constrained environments where computational efficiency and tracking reliability are equally critical.
1. Introduction
Multi-object tracking (MOT) is a fundamental technique in tracking and perception that aims to estimate the states of multiple objects over time while preserving their unique identities (IDs) []. MOT has been extensively studied in the broader tracking community through classical MOT approaches, such as Multiple Hypothesis Tracking (MHT) [] and Joint Probabilistic Data Association (JPDA) [], which address data association under uncertainty. With the development of deep learning-based object detectors, MOT has evolved within computer vision under the tracking-by-detection paradigm [], where objects are detected in each frame and temporally associated. Among them, three-dimensional (3D) MOT [,,] has become increasingly important in diverse real-world domains—ranging from autonomous driving and robotic navigation to surveillance and augmented reality [,,,]. While monocular vision systems [,] have shown remarkable progress with the advancement of deep learning techniques, they inherently lack depth information, limiting their ability to provide accurate 3D localization []. Compared to LiDAR-based systems [,], which provide highly accurate depth measurements but require expensive hardware and significant power consumption, stereo vision systems offer a cost-effective alternative for acquiring depth information via triangulation, making it particularly attractive for real-time 3D tracking where cost and power efficiency are critical [,].
The integration of deep learning-based object detection with stereo vision has enabled true 3D tracking capabilities beyond the limitations of image-based 2D tracking. Recent state-of-the-art detectors [] achieve high accuracy in detecting and classifying objects with bounding box localization, but stereo vision can transform these 2D detections into 3D spatial information through depth estimation. The primary challenge in stereo vision-based tracking lies in establishing reliable correspondences between detected objects across stereo image pairs. This stereo matching process is critical for accurate depth estimation and subsequent 3D localization through triangulation [,]. Beyond accurate depth recovery, stereo vision also offers an inherent advantage over monocular systems: the disparity between viewpoints can mitigate occlusions or truncations in one view by leveraging better visibility in the other view.
However, despite recent advances in deep learning-based object detection, these systems remain susceptible to misdetection, spurious detections, misclassification, occlusion, and truncation, particularly in complex real-world scenarios []. These detection errors can significantly degrade stereo matching accuracy, leading to unreliable depth estimation. When objects are partially occluded or truncated at image boundaries, traditional tracking systems struggle to maintain consistent object IDs, often leading to tracking failures and ID switches [,]. While stereo vision’s viewpoint redundancy offers theoretical advantages, conventional stereo matching algorithms fail to fully exploit this potential when at least one view experiences detection errors. Traditional approaches attempt to mitigate these issues through additional shape similarity measures or feature descriptors [,], but such methods significantly increase computational overhead, making them impractical for resource-constrained environments.
The gap between the theoretical advantages of stereo vision and its practical limitations motivates the need for a more efficient approach. Rather than pursuing computationally expensive full 3D reconstruction [], we recognize that many practical applications primarily require accurate depth-aware localization [] rather than complete geometric modeling []. This insight leads us to adopt a 2.5D representation that combines 2D bounding boxes with depth information, eliminating the computational burden of 3D bounding box regression while maintaining essential spatial awareness for real-world applications.
In this paper, we propose a real-time 2.5D multi-object tracking system that achieves robust performance against occlusions and truncations through a novel stereo matching algorithm. Our system consists of two core components: a lightweight stereo matching module and a robust stereo tracking module. The stereo matching module evaluates all possible object correspondences between stereo image pairs by computing depth information through triangulation based on bounding box coordinates. For each potential match, we apply disparity-based image warping to align the bounding boxes and calculate their Intersection over Union (IoU) scores. The Hungarian algorithm then determines the optimal multi-object correspondence configuration, ensuring globally consistent matching without requiring computationally expensive feature extraction. The stereo tracking module creates independent trackers for each matched object in both stereo views, operating in parallel to maintain tracking continuity. This dual-tracker approach ensures that object tracking persists as long as at least one tracker remains active, effectively handling scenarios where objects become occluded or truncated in one view but remain visible in the other.
The main contributions of this work are threefold. First, we introduce a resource-aware stereo matching algorithm that achieves high matching accuracy using only detected bounding box coordinates without computing shape-based similarity or extracting appearance features. This minimalistic approach dramatically reduces computational overhead while maintaining robust correspondence-matching performance through the synergistic use of geometric constraints and disparity-driven spatial alignment, proving that these fundamental stereo vision principles are sufficient for reliable object matching. Second, we develop a resilient stereo tracking management system that can be seamlessly integrated with most existing object tracking algorithms. Our dual-tracker design provides exceptional robustness: when one tracker is lost due to detection errors, the system continues tracking with the remaining tracker, and once the detection issue is resolved, the lost tracker can be automatically recovered and synchronized with the existing one. This recovery mechanism ensures tracking continuity even under frequent detection failures. Third, we demonstrate that our 2.5D representation, combining the simplicity of 2D bounding boxes with essential depth information, provides practical advantages over full 3D MOT systems. Extensive experiments validate that our approach achieves comparable or superior tracking performance to state-of-the-art methods while requiring significantly fewer computational resources, enabling real-time operation on standard hardware platforms.
The remainder of this paper is organized as follows. Section 2 reviews related work on stereo tracking, object detection, stereo matching, and visibility handling. Section 3 presents our proposed methodology, including the detailed design of the lightweight stereo matching module and the resilient stereo tracking module. Section 4 describes the implementation details (experimental setup and datasets) and presents comprehensive experimental results and analysis, including comparisons with state-of-the-art methods and computational efficiency evaluations. Section 5 discusses the implications of our findings and limitations. Finally, Section 6 concludes the paper and outlines future research directions.
2. Related Work
2.1. Stereo Vision-Based Object Tracking
Stereo vision-based object tracking, as defined by Geiger et al. [] in the KITTI benchmark, employs synchronized stereo image pairs to continuously localize and identify objects over time. This approach exploits binocular disparity for depth extraction through triangulation between corresponding features, enabling 3D spatial understanding without active sensors.
Li et al. [] proposed joint spatial–temporal optimization for stereo 3D object tracking, combining deep neural network detection with geometric bundle adjustment. Their method regresses initial 3D bounding boxes and predicts dense object cues, including local depth and coordinates. While achieving state-of-the-art performance on KITTI, the dense cue prediction and iterative optimization limit real-time capability.
Karaev et al. [] introduced DynamicStereo for consistent dynamic depth estimation from stereo videos, focusing on temporal consistency through transformer architectures. Their framework handles dynamic scenes effectively but requires computationally expensive temporal aggregation networks. Additionally, the reliance on learned temporal patterns fails when objects undergo rapid motion changes or severe occlusions.
Zhang et al. [] developed TemporalStereo, an efficient spatial–temporal stereo matching network for video sequences. Their coarse-to-fine approach leverages sparse cost volume and past geometry information to enhance matching accuracy. However, the method requires extensive training data and struggles with domain adaptation to new environments. The end-to-end learning paradigm also lacks interpretability when failures occur.
2.2. Deep Learning-Based Object Detection
Faster R-Convolutional Neural Network (CNN) by Ren et al. [] established the two-stage detection paradigm with Region Proposal Networks (RPN) sharing convolutional features for end-to-end training. While achieving high accuracy is crucial for tracking systems, the two-stage pipeline limits real-time performance on resource-constrained platforms.
The YOLO series by Redmon et al. [] revolutionized object detection by reformulating it as single-stage regression, eliminating the time-consuming region proposal step of two-stage detectors. This unified architecture processes entire images in a single forward pass, achieving real-time performance. YOLO has evolved rapidly through numerous versions, with recent implementations becoming the de facto standard for real-time detection applications. Despite its widespread adoption and continuous improvements, localization accuracy limitations and small object detection challenges persist, particularly in cluttered scenes.
2.3. Depth Estimation and Stereo Matching Algorithms
Scharstein and Szeliski [] defined the four-stage stereo correspondence pipeline: matching cost computation, aggregation, disparity optimization, and refinement. They formalized triangulation-based depth recovery from disparity and established the Middlebury benchmark for algorithm evaluation.
Hirschmüller’s Semi-Global Matching (SGM) [] remains the standard for dense disparity estimation, approximating global energy minimization through multiple one-dimensional path optimizations. While achieving accurate depth in textured regions with balanced computational efficiency, SGM fails in textureless areas and requires substantial resources for high-resolution processing.
Pyramid Stereo Matching Network (PSMNet) by Chang and Chen [] employs spatial pyramid pooling and stacked hourglass architecture for multi-scale context aggregation. Despite state-of-the-art accuracy on benchmarks, the significant processing time per stereo pair prohibits real-time applications, and global context assumptions fail for dynamic scenes.
2.4. Multi-Object Tracking Under Occlusion and Truncation
The MOT16 benchmark by Milan et al. [] established standardized evaluation protocols for occlusion and truncation handling in crowded scenes. The benchmark reveals occlusion as the primary performance bottleneck, with traditional motion and appearance-based methods failing in dense scenarios with similar objects.
DeepSORT by Wojke et al. [] extends Simple Online and Realtime Tracking (SORT) with deep appearance features from person re-identification networks, combining Kalman filtering with learned visual representations for data association. While reducing ID switches in occlusion scenarios, the per-object feature extraction overhead limits scalability, and appearance features degrade severely at truncated image boundaries.
ByteTrack by Zhang et al. [] recovers low-confidence detections typically discarded by other trackers, recognizing that occluded objects often produce weak detection responses. This strategy improves trajectory completeness but increases false positives in cluttered scenes and fails during extended complete occlusions due to temporal dependency.
Observation-Centric SORT (OC-SORT) by Cao et al. [] introduces Observation-Centric Momentum (OCM) to maintain velocity consistency from historical observations rather than noisy motion predictions, achieving real-time performance and reducing ID switches. Despite improved robustness to detection noise, it remains vulnerable to extended occlusions and lacks appearance features for disambiguating similar objects.
Weng et al. [] demonstrated that 3D MOT with A Baseline for 3D MOT (AB3DMOT) using 3D Kalman filtering and IoU association inherently handles occlusions through physical constraints. However, the dependency on accurate 3D detectors or expensive 3D bounding box regression limits practical deployment, as these detection requirements impose substantial computational overhead independent of the tracking algorithm itself.
Beyond traditional tracking-by-detection frameworks, Random Finite Set (RFS) methods [] offer a principled probabilistic framework for MOT by handling both detection uncertainty and data association simultaneously. In particular, the Labeled RFS (LRFS) framework [] extends RFS theory to maintain object identities over time and has demonstrated competitive performance through Gibbs sampling-based data association under occlusion and re-identification scenarios. Similar to our proposed method, LRFS also follows the tracking-by-detection paradigm by utilizing detection outputs as measurement inputs for state estimation. However, RFS-based approaches generally require iterative probabilistic inference, which increases computational complexity and limits real-time deployment. In contrast, this work focuses on achieving robust identity preservation in real-time by leveraging stereo geometric consistency without heavy probabilistic modeling.
3. Materials and Methods
3.1. Overview of the Proposed Framework
This study aims to develop a robust 2.5D multi-object tracking technology that is resilient to the object detection and tracking errors commonly encountered in real-world stereo vision applications. While achieving perfect accuracy in object detection and tracking remains practically impossible due to occlusions, illumination variations, motion blur, and appearance ambiguities, our framework addresses these challenges by leveraging the redundancy of stereo vision systems: when detection or tracking errors occur in one view, the complementary information from the other view can maintain tracking continuity. The framework employs geometric consistency verification through stereo triangulation and disparity-based validation to handle uncertainties in both detection and stereo matching while utilizing temporal information to predict object locations and recover from temporary failures [,].
The proposed framework consists of three sequential stages: (1) object detection from stereo image pairs, using deep learning models to identify objects in both left and right images, (2) stereo matching for optimal object pairing across the stereo image pair, computing 2.5D positions through a combination of geometric triangulation and image disparity, and (3) stereo tracking with temporal data association, maintaining object IDs across frames and handling temporary occlusions through re-identification mechanisms. These stages are integrated through a feedback loop where tracking predictions enhance subsequent detection and matching processes []. Figure 1 illustrates the overall system architecture, showing the data flow through the three stages and the feedback mechanisms that enable robust tracking performance. Detailed descriptions of each stage are provided in the following subsections.
Figure 1.
Overall architecture of the proposed 2.5D multi-object tracking framework. The system consists of three sequential stages: (1) Object Detection from Stereo Image Pairs, where independent detection is performed on synchronized left (L) and right (R) images; (2) Stereo Matching for Optimal Object Pairing, which establishes correspondences between detected objects using Direct Linear Transform (DLT)-based triangulation and IoU validation (highlighted in yellow); and (3) Stereo Tracking with Temporal Data Association, which maintains object IDs across frames through tracker management (highlighted in yellow), motion estimation, and stereo track association map updates. The feedback loop from tracking predictions enhances detection and matching in subsequent frames.
3.2. Stage 1: Object Detection from Stereo Image Pairs
The first stage performs object detection on synchronized stereo image pairs captured from a calibrated stereo camera system. Here, the term ‘stereo image pair’ refers only to the left and right images acquired simultaneously at each frame; object detection is applied independently to each image without using stereo relationships. As shown in Figure 1, an object detector is applied independently to each image, enabling parallel processing that preserves objects visible in only one view due to occlusions or limited field overlap. This independent detection strategy provides redundancy that enhances system robustness against single-view detection failures.
The detection model outputs, for each identified object, a class label, bounding box coordinates (x, y, width, and height), and a confidence score. These confidence scores are subsequently utilized in the stereo matching stage to weight matching decisions and resolve ambiguous correspondences. The state-of-the-art detection architecture ensures robust performance across varying environmental conditions. The object detection process is described in Algorithm 1 using pseudocode notation.
| Algorithm 1: Stereo Object Detection (Pseudocode) |
| Input: Stereo image pair (IL, IR) at time t Output: Detection sets DL and DR 1: procedure STEREO_OBJECT_DETECTION(IL, IR) 2: // Synchronize and preprocess images 3: IL_sync, IR_sync ← SYNCHRONIZE(IL, IR) 4: 5: // Initialize the detection sets 6: DL ← ∅, DR ← ∅ 7: 8: // Detection on left image 9: DL ← DETECT_OBJECTS(IL_sync) 10: for each detection dL in DL do 11: dL.class ← CLASSIFY(dL) 12: dL.bbox ← (xL, yL, wL, hL) 13: dL.conf ← CONFIDENCE_SCORE(dL) 14: end for 15: 16: // Detection on right image 17: DR ← DETECT_OBJECTS(IR_sync) 18: for each detection dR in DR do 19: dR.class ← CLASSIFY(dR) 20: dR.bbox ← (xR, yR, wR, hR) 21: dR.conf ← CONFIDENCE_SCORE(dR) 22: end for 23: 24: return DL, DR 25: end procedure |
3.3. Stage 2: Stereo Matching for Optimal Object Pairing
The second stage implements a stereo matching module for optimal object pairing through Direct Linear Transform (DLT)-based triangulation and stereo consistency verification, as illustrated in Figure 2. Unlike traditional appearance-based matching methods that rely on visual features, our approach leverages stereo geometry to validate object correspondences, making it more robust to appearance variations and illumination changes.
Figure 2.
Stereo matching process using DLT-based triangulation and IoU validation. The method evaluates all possible object combinations between left (L) and right (R) detections. For each candidate pair, DLT-based triangulation estimates the 3D position assuming correspondence, from which the expected disparity is computed. Bounding boxes are then warped using the calculated disparity, and stereo consistency is verified through IoU evaluation. The cost matrix C with elements Cij = −IoUij is constructed for all combinations, and the Hungarian algorithm determines optimal assignments. High IoU values indicate correct matches (paired), while low IoU values suggest incorrect correspondences (unpaired).
Given all possible object combinations between left and right detections (DL and DR from Stage 1), the algorithm performs DLT-based triangulation [] for each candidate pair under the assumption that they represent the same physical object. The DLT method estimates the 3D position by solving an overdetermined system derived from the projection matrices of both cameras. From the triangulated 3D position (X, Y, Z), the expected disparity (shown as , in Figure 2) is computed using the fundamental stereo geometry relationship , where b is the baseline and is the focal length [].
Using the computed disparity, each bounding box is warped to the opposite view through disparity-based image warping []. The stereo consistency is then evaluated by computing the IoU between the warped bounding box and the actual detection in the target image. This IoU metric quantifies the geometric consistency of the hypothesized correspondence []. Figure 2 demonstrates how high IoU values indicate correct matches (paired objects), while low IoU values suggest incorrect correspondences (unpaired objects).
A cost matrix is constructed where each element represents the negative IoU score between the -th detection in the left image and the j-th detection in the right image. The negative sign transforms the maximization problem into a minimization problem suitable for the Hungarian algorithm []. The optimal assignment = Hungarian() yields the one-to-one correspondence that maximizes the total IoU score. Pairs with high IoU values indicate correct matches, forming stereo object pairs—matched bounding boxes from left and right images corresponding to the same physical object. In contrast, low IoU values suggest incorrect correspondences. The stereo matching process is described in Algorithm 2 using pseudocode notation.
| Algorithm 2: Stereo Object Matching (Pseudocode) |
| Input: Detection sets DL and DR, camera parameters (b, f) Output: Stereo object pairs P and their depths Z 1: procedure STEREO_MATCHING(DL, DR, b, f) 2: // Initialize cost matrix 3: C ← zeros(|DL|, |DR|) 4: Z_temp ← zeros(|DL|, |DR|) 5: 6: // Compute matching costs for all pairs 7: for i = 1 to |DL| do 8: for j = 1 to |DR| do 9: // Assume correspondence and estimate depth 10: Z_temp[i,j] ← TRIANGULATE(DL[i].center, DR[j].center, b, f) 11: 12: // Calculate expected disparity 13: d ← b × f/Z_temp[i,j] 14: 15: // Warp bounding box and compute IoU 16: bbox_warped ← WARP_BBOX(DL[i].bbox, d) 17: C[i,j] ← −IoU(bbox_warped, DR[j].bbox) 18: end for 19: end for 20: 21: // Find optimal assignment using Hungarian algorithm 22: P ← HUNGARIAN_ASSIGNMENT(C) 23: 24: // Filter matches by threshold and assign depths 25: for each match (i,j) in P do 26: if C[i,j] > −τ then //Since C contains negative IoU values 27: Remove (i,j) from P 28: else 29: Z[i,j] ← Z_temp[i,j] 30: end if 31: end for 32: 33: return P, Z 34: end procedure |
3.4. Stage 3: Stereo Tracking with Temporal Data Association
The third stage performs stereo tracking with temporal data association to maintain consistent object IDs across frames while leveraging stereo correspondence information. As input, this stage receives the detected bounding boxes (DL and DR) and stereo object pairs (P) from the previous stages.
The tracking process begins with predicting the positions of tracked objects in the current frame using motion estimators that incorporate historical trajectory information []. Temporal data association then establishes correspondences between predicted tracker positions and detected objects using cost functions that incorporate IoU-based matching [], the Hungarian algorithm [], Kalman filter predictions [], and feature-based similarity measures [], among others. Notably, in the proposed framework, motion estimation and temporal data association are performed independently for each camera; that is, separate tracker instances operate on the left and right image streams, respectively. These trackers are then linked through the stereo track association map (denoted as T in Figure 3 and Algorithm 3), which maintains cross-view correspondences for the same physical object.
Figure 3.
Flowchart of stereo tracker management in temporal data association. The process handles three scenarios based on association results: (1) Associated Tracker–Object Pairs, where trackers are updated with actual detection positions and added to both tracker update and stereo tracker pair lists; (2) Unassociated Objects, where new trackers are created with either re-identified IDs (if corresponding tracker exists in T) or new IDs, then added to appropriate lists; and (3) Unassociated Trackers, where age-based decisions determine whether to increment age, maintain without updating, or delete trackers based on their presence in the stereo track association map (T) and paired tracker status. P denotes stereo object pairs from the matching stage, and T represents the stereo track association map. The outputs include an updated stereo tracker pair list and tracker update list, which feed into the stereo track association map update and motion estimator update modules.
Figure 3 presents a detailed flowchart of the tracker management process, which follows three primary scenarios based on association results:
- Unassociated Trackers: When existing trackers fail to associate with any detected object, the system increments their age counter if the age is below the maximum threshold. As shown in the right branch of Figure 3, for trackers whose age equals or exceeds the maximum threshold, the system checks their status in the stereo track association map (denoted as T in Figure 3 and Algorithm 3). If the tracker exists in the map and its paired tracker also has an age equal to or exceeding the maximum threshold, both trackers are deleted. If the paired tracker’s age is below the threshold, the current tracker is maintained without updates to preserve stereo consistency. Trackers not included in the map are immediately deleted when their age reaches the maximum threshold [].
- Unassociated Objects: The middle branch of Figure 3 illustrates the handling of detected objects that lack tracker associations. The system first verifies their stereo pairing status from P (denoted in Algorithm 2 and Figure 3). Unpaired objects generate new independent trackers. For stereo object pairs, if their corresponding object has an existing tracker in the stereo track association map, a new tracker is created for the object but inherits the ID from the previously tracked object, implementing re-identification that maintains consistent ID after temporary occlusions or truncations while starting fresh motion estimation from the current position []. Otherwise, a new tracker for the object is initialized with a new ID.
- Associated Tracker–Object Pairs: The left branch of Figure 3 shows that successfully associated pairs replace the predicted tracker position with the actual detected object position in the current frame. Specifically, while the motion estimator provides predicted positions based on historical trajectory information, the association confirms the actual object location, allowing the tracker state to be updated with this ground-truth position rather than relying on the prediction [].
After the tracker management phase, as indicated in the bottom section of Figure 3, the system performs two distinct update operations. First, motion estimators are updated for all trackers that have associated objects, incorporating the current frame’s detection positions into their trajectory models for improved future predictions. Second, the stereo track association map is updated based on the current stereo object pairs from P, establishing new tracker correspondences or modifying existing ones to maintain consistency between left and right view trackers that follow the same physical object.
The modular design in tracker management allows integration with existing tracking algorithms for temporal data association and motion estimation components, providing flexibility in implementation while maintaining the benefits of stereo-aware tracking []. The complete stereo tracking process is described in Algorithm 3 using pseudocode notation.
| Algorithm 3: Stereo Tracking (Pseudocode) |
| Input: Detection sets DL and DR, Stereo object pairs P, Tracker sets TL and TR, Association Map T (list) Output: Updated trackers and association map 1: procedure STEREO_TRACKING(DL, DR, P, TL, TR, T) 2: // Predict tracker positions for all trackers 3: for each tracker t in TL ∪ TR do 4: t.predicted ← PREDICT(t.estimator) 5: end for 6: 7: // Temporal data association (returns tracker-object pairs) 8: AL ← ASSOCIATE(TL, DL) //Associated tracker-object pairs for left 9: AR ← ASSOCIATE(TR, DR) //Associated tracker-object pairs for right 10: 11: // Process unassociated trackers 12: for each tracker t in TL ∪ TR do 13: if t not in TRACKERS(AL ∪ AR) then //t is not in associated trackers 14: if t.age ≥ max_age then 15: if EXISTS_IN_MAP(t, T) then 16: paired_t ← GET_PAIRED_TRACKER(t, T) 17: if paired_t.age ≥ max_age then 18: DELETE(t) 19: DELETE(paired_t) 20: end if 21: //Otherwise, maintain tracker without update 22: else 23: DELETE(t) 24: end if 25: else 26: t.age ← t.age + 1 27: end if 28: end if 29: end for 30: 31: // Handle unassociated objects 32: for each object o not in OBJECTS(AL ∪ AR) do //o is not in associated objects 33: if HAS_STEREO_PAIR(o, P) then 34: if PAIRED_TRACKER_EXISTS(o, P, T) then 35: paired_tracker ← GET_PAIRED_TRACKER(o, P, T) 36: // Create new tracker with re-identified ID 37: t_new ← CREATE_TRACKER(o) 38: t_new.id ← paired_tracker.id //Re-identification 39: t_new.age ← 0 40: if o ∈ DL then ADD_TO_TL(t_new) else ADD_TO_TR(t_new) 41: else 42: t_new ← CREATE_TRACKER(o) 43: t_new.id ← GENERATE_NEW_ID() 44: t_new.age ← 0 45: if o ∈ DL then ADD_TO_TL(t_new) else ADD_TO_TR(t_new) 46: end if 47: else 48: t_new ← CREATE_TRACKER(o) 49: t_new.id ← GENERATE_NEW_ID() 50: t_new.age ← 0 51: if o ∈ DL then ADD_TO_TL(t_new) else ADD_TO_TR(t_new) 52: end if 53: end for 54: 55: // Update associated trackers with actual detected positions 56: for each (tracker t, object o) in AL ∪ AR do 57: // Replace predicted position with actual detection 58: t.position ← o.position 59: t.age ← 0 //Reset age for associated trackers 60: UPDATE_TRACKER_STATE(t, o) 61: end for 62: 63: // Update motion estimators for associated and new trackers only 64: for each tracker t in TL ∪ TR do 65: if (t.age = 0) then //Associated or newly created trackers 66: UPDATE_MOTION_ESTIMATOR(t) 67: end if 68: end for 69: 70: // Update stereo track association map (list structure) 71: T ← UPDATE_STEREO_ASSOCIATIONS(P, TL, TR, T) 72: 73: return TL, TR, T 74: end procedure |
4. Results
4.1. Experimental Setup
The experimental platform for validating the proposed 2.5D multi-object tracking framework consists of a stereo vision system, computational hardware, and software environment configured for real-time processing.
4.1.1. Stereo Vision System Configuration
The stereo vision system was constructed using two identical RGB cameras, each equipped with a HU205 image sensor (Huentek Co., Ltd., Gwangju, Republic of Korea) and HULE51300 lens (Huentek Co., Ltd.). As illustrated in Figure 4, the stereo cameras (marked as L for left and R for right) were mounted at the top of a commercial refrigerator, oriented downward to capture the interior workspace. The cameras were mounted on a rigid baseline with parallel optical axes to ensure consistent stereo geometry. The stereo camera system was calibrated using the standard checkerboard calibration method [] to obtain intrinsic parameters (focal length, principal point, distortion coefficients) and extrinsic parameters (rotation and translation between cameras). The baseline distance was set to 280 mm to provide an optimal balance between depth resolution and field of view overlap for indoor environments. The yellow triangular regions outlined by red and cyan dashed lines in Figure 4 represent the individual fields of view of the left and right cameras, respectively, with their overlapping area defining the effective stereo vision workspace.
Figure 4.
Experimental setup and data processing pipeline for 2.5D multi-object tracking validation. The left panel shows the physical configuration with stereo RGB cameras (L: left, R: right) and a depth camera (M: middle) mounted at the top of a commercial refrigerator. The yellow triangular regions outlined by red and cyan dashed lines indicate the fields of view for the left and right cameras, respectively. The right panel illustrates the image allocation and processing workflow: stereo RGB camera images are processed through the tracking system to generate 3D trajectories, which are then compared with ground truth from the depth camera for both tracking performance evaluation and 3D trajectory validation.
Additionally, an Intel RealSense depth camera D435f (Intel Corp., Santa Clara, CA, USA), marked as M in Figure 4, was installed between the stereo cameras for validation purposes, as detailed in Section 4.4. This configuration enables direct comparison between the proposed stereo-based depth estimation and commercial depth sensing technology within the same workspace.
4.1.2. Computational Platform
All experiments were conducted on a workstation with an Intel Core i9-12900K CPU, 64 GB of RAM, and an NVIDIA GeForce RTX 3090 GPU. The framework was implemented using Python 3.8, CUDA 11.8, and PyTorch 2.0. The system was designed to process stereo video streams at 30 frames per second (FPS), with the GPU handling object detection inference while the CPU managed stereo matching, temporal association, and tracker management tasks in parallel. This heterogeneous computing approach maximized throughput and minimized latency in the tracking pipeline.
4.2. Dataset Preparation
4.2.1. Custom Dataset Construction
To comprehensively validate the proposed 2.5D stereo tracking framework, a custom stereo dataset tailored to commercial refrigerator environments was constructed. The custom dataset considered three primary aspects: First, public datasets are collected under varying camera setups and calibration conditions, which exhibit geometric inconsistencies across sequences. Such variability makes it difficult to isolate algorithmic performance from dataset-induced bias. In contrast, our custom dataset ensures consistent geometric properties by employing identical stereo configuration and unified calibration and rectification across all sequences.
Second, quantitative evaluation of 3D trajectory accuracy requires depth ground truth that is spatially and temporally aligned with stereo image pairs. However, existing public datasets either provide only 2D annotations or rely on LiDAR depth that is not synchronized with stereo inputs. To address this limitation, our dataset includes ground-truth depth obtained by synchronizing a stereo camera system with a calibrated depth sensor, enabling accurate 3D validation.
Third, the dataset was designed to reflect realistic interaction scenarios with frequent occlusions and truncations, which commonly occur in hand–object interaction tasks but are underrepresented in public benchmarks designed primarily for autonomous driving (e.g., KITTI). By incorporating near-field interactions and dense object clutter, the dataset enables rigorous evaluation of the robustness of the stereo-based identity preservation mechanism.
The dataset comprises three subsets for different validation purposes: (1) dataset for object detection model training and validation (Section 4.2.2), (2) dataset for 3D position accuracy evaluation (Section 4.4), and (3) dataset for tracking performance evaluation (Section 4.5).
4.2.2. Data Collection and Annotation
Data were collected using both cameras of the stereo vision system, and hand instances in each view were annotated with bounding boxes. The annotation process was conducted using Roboflow [], a web-based annotation platform that provides efficient tools for bounding box labeling and dataset management. The collected dataset comprises 26,226 images with 52,934 annotated hand instances, divided into training (20,991 images, 80%), validation (4207 images, 16%), and test (1028 images, 4%) sets.
All images were captured at 640 × 480 resolution. The dataset captures various hand gestures, including reaching, grasping, and retrieving products at different shelf locations within the 300–2000 mm working range, exhibiting significant scale variations. The dataset includes challenging detection scenarios such as partial occlusions caused by refrigerator products and shelves during hand–product interactions, as well as truncations at image boundaries when hands enter or exit the field of view.
4.3. Implementation Details
4.3.1. Object Detection Model Selection and Training
For the object detection component of the 2.5D multi-object tracking framework, YOLOv11-Large [] was selected as the deep learning-based detection model. While YOLOv12 with attention mechanisms represents the latest advancement in the YOLO series, YOLOv11 was chosen based on its faster inference speed. Comparative benchmarks validate this selection []: YOLOv11-Large achieves comparable accuracy (mAP@0.5:0.95 53.3%) to YOLOv12-Large (mAP@0.5:0.95 53.7%, +0.4%) while providing approximately 9% faster inference speed (6.2 ms vs. 6.77 ms), confirming that YOLOv11-Large offers an optimal balance.
The training process achieved exceptional detection performance, with the validation set yielding a mean Average Precision of 0.993 at IoU threshold 0.5 (mAP@0.5) and 0.794 for the stricter mAP@0.5:0.95 metric. These results were consistently reproduced on the test set, achieving mAP@0.5 of 0.993 and mAP@0.5:0.95 of 0.789, demonstrating robust generalization without overfitting. The high mAP@0.5 indicates excellent detection reliability, which is crucial for tracking initialization, while the mAP@0.5:0.95 score confirms precise bounding box localization, which is necessary for accurate stereo matching.
4.3.2. Tracking Algorithm Implementation
Temporal data association and motion estimation components were implemented, adjusting OC-SORT [] through its modules of Observation-Centric Momentum (OCM) and Observation-Centric Reassociation (OCR). The algorithm significantly reduces association errors through its OCM module, which maintains object momentum during nonlinear motion by computing velocity consistency from historical trajectories. This observation-centric approach provides better noise robustness compared to traditional state-space models, as it directly leverages detection observations rather than relying solely on motion predictions. In addition, OCR effectively reduces ID switches during temporary occlusion situation by re-associating with the previous observation information in the detection-to-track association. Furthermore, OC-SORT’s modular design enables seamless integration with various detection backends without requiring extensive parameter tuning, making it ideal for the proposed stereo tracking framework.
In our implementation, additional observation consistency terms were incorporated into the association cost function of OCM to enhance tracking stability: center distance, orientation consistency, and velocity discrepancy penalty. These terms penalize associations that violate motion continuity constraints, suppressing physically implausible 2D movements and improving inter-frame tracking stability.
The data association process implementation employed a two-stage matching strategy. The association threshold was set to 0.3 IoU for high-confidence matches, with a secondary threshold of 0.15 for recovery association. Furthermore, to exploit stereo geometry constraints, the disparity difference between consecutive frames was limited to not exceed 40 pixels, eliminating geometrically inconsistent matches that violate depth continuity.
4.3.3. 2.5D Multi-Object Tracking Framework Integration
The complete framework was implemented as a modular Python package designed for real-time stereo vision processing. Prior to the main processing pipeline, comprehensive image preprocessing was performed to compensate for hardware imperfections inherent in the stereo vision system. The preprocessing step involved camera calibration, lens distortion correction, and stereo rectification for the image pairs. Following preprocessing, the stereo processing pipeline began with hardware-triggered camera synchronization, achieving a sub-ms temporal offset between left and right captures. Both rectified images were then processed through the YOLOv11 detector in parallel using batch inference to maximize GPU utilization.
The stereo matching implementation employed DLT-based triangulation, computing 3D positions for all possible object pair combinations between the two views. Disparity-based validation with adaptive thresholds ensured geometric consistency; the expected disparity derived from triangulated depth was compared against actual bounding box positions. The Hungarian algorithm determined optimal one-to-one correspondences with an IoU threshold of 0.01 for accepting valid stereo matches. This deliberately low threshold was selected to ensure that small objects, particularly hands at far distances or partially visible at frame edges, were not missed during the stereo matching process, as even minimal overlap can indicate valid correspondence given accurate triangulation.
The tracker management system maintained dual tracker pools for left and right views, with a stereo track association map implemented as a bidirectional dictionary structure enabling efficient lookup of corresponding trackers. Trackers were allowed a maximum age of 30 frames (equivalent to 1 s at 30 FPS) before deletion, providing sufficient time for re-identification after temporary occlusions. A re-identification buffer preserved tracker information indefinitely until successful re-identification occurred, enabling the system to restore consistent IDs when objects reappeared even after extended disappearances.
Real-time performance optimization was achieved through an optimized architecture where detection, matching, and tracking operations were executed in a single process. The GPU-CPU pipeline implemented asynchronous data transfers to minimize latency, while frame buffering accommodated occasional processing delays without dropping frames. This optimized implementation achieved processing rates up to 70 FPS on the specified hardware configuration, providing substantial headroom for additional processing tasks or handling of more complex scenes. The framework exposed a streamlined API for integration with downstream applications, returning tracked objects with 2.5D positions, velocities, and unique IDs maintained consistently throughout the tracking session.
4.4. Depth Estimation Accuracy Evaluation
4.4.1. Validation Methodology
To evaluate the 3D position estimation accuracy of the proposed 2.5D multi-object tracking framework, a concomitant validation method was employed using a commercial depth camera as ground-truth reference. The depth camera introduced in Section 4.1 was selected for validation due to its high depth accuracy (<2% at 2000 mm) and active stereo technology with an infrared projector that provides reliable depth measurements independent of ambient lighting conditions. It features a depth resolution of 1280 × 720 at 30 FPS with a depth range of 300 to 3000 mm, well-suited for the refrigerator monitoring application. Using a separate depth camera ensured that the evaluation of stereo triangulation accuracy was independent of the stereo vision system itself, avoiding circular validation.
The validation setup involved a depth camera adjacent to the stereo vision system with overlapping fields of view covering most of the monitored volume, as illustrated in Figure 4. To obtain an independent depth reference for evaluating stereo triangulation accuracy, we generated depth-assisted ground truth using the RGB-D measurements from the depth camera. Both systems were temporally synchronized to acquire data simultaneously. The depth camera provided RGB images with spatially aligned dense depth maps, while the stereo vision system estimated 3D object positions through the proposed stereo matching and DLT-based triangulation procedures.
During ground-truth generation, hand regions were manually annotated in the RGB images of the depth camera to ensure independence from the stereo-based detection results. For each annotation, a small region of interest (ROI) was defined around the box center and mapped onto the corresponding aligned depth map. Valid pixels within the ROI of the aligned depth map were back-projected into 3D space using the intrinsic parameters, and their mean coordinates were used as the reference position. ROI averaging reduces sensor noise and mitigates missing-depth artifacts near object boundaries, resulting in more stable ground-truth estimation. These depth-camera-based reference values are regarded as pseudo ground truth for validating the stereo-derived 3D positions, as both systems operate synchronously but differ in sensing modalities.
4.4.2. Experimental Results
The validation experiment was conducted using four distinct video sequences capturing various hand movement patterns within the refrigerator environment with successfully tracked objects. Each sequence was approximately 20 s in duration, captured at 30 FPS. To ensure reliable one-to-one matching between depth camera and stereo camera observations, each sequence captured a single hand performing various motions across the entire working range (300–2000 mm depth), including reaching, grasping, and retrieving movements at different distances and positions. The 3D positions of tracked hand objects in each sequence were quantitatively compared with the depth camera-based ground-truth coordinates generated using the method described in Section 4.4.1.
Figure 5 presents a representative trajectory comparison from one validation sequence, demonstrating the close alignment between ground-truth (depth camera) and stereo tracker estimates across the full sequence and three detailed segments. The visualization reveals that the stereo tracking system accurately captures the overall motion pattern and maintains consistent depth estimation throughout the trajectory, with minor deviations primarily occurring during rapid movements or at the workspace boundaries.
Figure 5.
Comparison of 3D hand trajectories between ground truth (depth camera, orange dots) and the proposed stereo tracker (green dots) with correspondence lines (dotted gray) connecting matched points. The figure shows the full sequence and three detailed segments, demonstrating close alignment between the two systems across different motion patterns. The axes represent the 3D workspace in meters, with the origin at the stereo camera position.
Quantitative analysis across all four sequences yielded the following error metrics for 3D Euclidean distance (, where and are the estimated and ground-truth 3D positions) between estimated and ground-truth positions: Root Mean Square Error (), Mean Absolute Error (), median error (), 95th percentile error (). The Root Mean Square Error (RMSE) of 74.2 mm indicates the overall accuracy, including outliers, while the Mean Absolute Error (MAE) of 60.5 mm provides a more robust measure less sensitive to occasional large errors. The median error of 50.1 mm demonstrates that half of all measurements achieve accuracy better than 50 mm, suitable for hand tracking in retail applications. The 95th percentile error (P95) of 136.4 mm indicates that 95% of estimates fall within 136.4 mm of ground truth, with larger errors typically occurring at maximum reach distances where stereo baseline limitations become more pronounced.
These results validate that the proposed stereo matching module achieves depth estimation accuracy comparable to commercial depth sensors while operating solely on passive RGB imagery, eliminating the need for active infrared projection that may interfere with other sensors or violate privacy regulations in retail environments. The consistent sub-100 mm accuracy for the majority of measurements confirms the framework’s suitability for practical deployment in commercial refrigerator monitoring applications.
4.5. Multi-Object Tracking Performance
4.5.1. Evaluation Dataset and Metrics
While Section 4.4 validated the 3D coordinate accuracy of the stereo matching module, a comprehensive evaluation of the stereo tracker’s robustness to occlusions and truncations is essential for practical deployment. To this end, a challenging test dataset was created consisting of 10 video sequences, each 20–25 s in duration, specifically designed to include frequent hand occlusions and truncations typical in refrigerator interaction scenarios. The sequences were captured at 640 × 480 resolution at 30 FPS using the stereo camera system with a 280 mm baseline. All hand objects in these sequences were manually annotated to establish ground truth for tracking evaluation.
The tracking performance was evaluated using standard multi-object tracking metrics that comprehensively assess different aspects of tracking quality. MOTA (Multiple Object Tracking Accuracy) combines false positives, false negatives, and ID switches into a single accuracy measure []. IDF1 (Identification F1-Score) evaluates the tracker’s ability to maintain correct IDs over time []. HOTA (Higher Order Tracking Accuracy) balances detection and association performance at multiple IoU thresholds []. DetA (Detection Accuracy) and LocA (Localization Accuracy) measure the spatial accuracy of detections, while AssA (Association Accuracy) and AssR (Association Recall) evaluate the temporal consistency of ID assignments [].
4.5.2. Comparative Analysis
Table 1 presents the tracking performance comparison between the proposed StereoSORT and five state-of-the-art monocular tracking algorithms: SORT, DeepSORT, ByteTrack, BoT-SORT (Bag of Tricks for SORT), and OC-SORT. The proposed StereoSORT achieves superior performance across all evaluation metrics, demonstrating the effectiveness of leveraging stereo information for robust tracking.
Table 1.
Quantitative comparison of multi-object tracking performance between the proposed StereoSORT and state-of-the-art monocular trackers on the occlusion/truncation test dataset. All metrics range from 0 to 1, with higher values (↑) indicating better performance. The proposed method achieves superior performance across all evaluation metrics, demonstrating the effectiveness of stereo-based tracking.
StereoSORT attains the highest MOTA score of 0.932, representing a 0.6–6.0 percentage point improvement over monocular methods, with the most significant gains over ByteTrack (0.874). The IDF1 score of 0.823 substantially exceeds all baselines, with improvements ranging from 5.8 percentage points over OC-SORT to 22.2 percentage points over DeepSORT, indicating superior ID preservation during occlusions. The HOTA score of 0.844 confirms balanced performance in both detection and association tasks.
The association metrics reveal the key advantage of stereo tracking, with StereoSORT achieving AssA of 0.775 and AssR of 0.787, significantly outperforming the best monocular method (OC-SORT) by 7.2 and 7.6 percentage points, respectively. This improvement directly results from the stereo tracker’s ability to maintain object ID when one view experiences occlusion while the other maintains visibility. The near-perfect LocA score of 0.981 demonstrates that stereo triangulation provides more accurate spatial localization compared to monocular depth estimation.
These results validate that the proposed stereo tracking module effectively leverages redundant visual information to achieve robust performance in challenging scenarios with frequent occlusions and truncations, making it particularly suitable for commercial refrigerator monitoring applications where hands frequently disappear behind products or shelves.
Beyond quantitative metrics, qualitative analysis provides crucial insights into the practical advantages of stereo tracking in handling challenging scenarios. Figure 6 illustrates representative examples comparing the proposed StereoSORT with OC-SORT (the best-performing monocular baseline) under occlusion and truncation conditions.
Figure 6.
Qualitative comparison of tracking performance between the monocular tracker (OC-SORT) and the proposed stereo tracker (StereoSORT) under challenging scenarios. The left panel demonstrates occlusion handling where a hand temporarily disappears (Occ.) behind products, showing ID switching in the mono tracker (ID 6 → ID 7) while the stereo tracker maintains consistent IDs (0 and 5). The right panel illustrates truncation cases (Trunc.) at image boundaries, where the mono tracker exhibits ID inconsistency (ID 5 → ID 6) while the stereo tracker preserves stable IDs (1 and 3) through stereo correspondence. Purple arrows indicate tracker associations between views. The yellow, red, and blue bounding boxes represent different tracked objects with their respective IDs.
To quantify the specific contribution of leveraging stereo redundancy, the proposed system was evaluated in both monocular and stereo configurations. The monocular configuration utilizes only the left camera, thus excluding the Stereo Matching and Stereo Tracker Management modules. The stereo configuration demonstrated improved performance across all metrics. Notably, IDF1 increased from 0.772 to 0.823 (5.1 percentage points), with corresponding improvements in AssA (5.8 percentage points) and AssR (5.8 percentage points). In contrast, MOTA (0.932 vs. 0.931) and LocA (0.981 vs. 0.981) remained nearly identical, indicating that stereo matching and tracker management enhance ID consistency by reducing ID switches and fragmentation without degrading detection performance. Additionally, StereoSORT in monocular configuration showed improved performance over baseline OC-SORT [] in key metrics. Specifically, AssA improved from 0.703 to 0.720 (1.7 percentage points) and HOTA from 0.802 to 0.814 (1.2 percentage points), demonstrating that the observation-consistency term (Section 4.3.2) contributes to tracking robustness.
In the occlusion scenario (left panel), a hand temporarily disappears behind products between frames t1 and t3. The monocular tracker exhibits ID switching, assigning new IDs when the hand reappears (ID 6 → ID 7), disrupting trajectory continuity. In contrast, StereoSORT maintains consistent IDs throughout the sequence by leveraging visibility in the alternate view when one camera’s view is occluded. The tracker association arrows demonstrate how stereo correspondence enables ID preservation even during complete occlusion in one view.
The truncation scenario (right panel) presents hands partially visible at image boundaries, a common occurrence when customers reach into the refrigerator from different angles. The monocular tracker struggles with ID consistency (0 and 5), showing frequent switches, particularly for the partially visible hand at the frame edge. StereoSORT successfully maintains stable IDs (1 and 3) by utilizing the complementary view, where the truncated object may be more completely visible, or by maintaining stereo correspondence even with partial observations.
These visual examples confirm that the stereo tracking module’s re-identification mechanism effectively exploits redundant visual information to maintain ID consistency. The ability to preserve tracker IDs during temporary occlusions and truncations is critical for applications requiring accurate trajectory analysis, such as customer behavior monitoring or interaction pattern recognition in retail environments. This qualitative evidence, combined with the superior quantitative metrics, demonstrates that StereoSORT provides robust and reliable tracking performance suitable for deployment in challenging real-world scenarios.
5. Discussion
The experimental results demonstrate that the proposed 2.5D stereo multi-object tracking framework achieves robust performance in challenging real-world scenarios with frequent occlusions and truncations. The superior tracking metrics, particularly the IDF1 score of 0.823 compared to 0.765 for the best monocular baseline (OC-SORT), validate our hypothesis that stereo vision’s redundant viewpoints can effectively mitigate single-view detection failures.
5.1. Analysis of Stereo Matching Performance
The depth estimation accuracy evaluation reveals that our lightweight DLT-based stereo matching approach achieves a median error of 50.1 mm, comparable to commercial depth sensors while operating solely on passive RGB imagery. This performance is particularly noteworthy considering the computational efficiency gained by avoiding dense stereo matching algorithms. The slightly higher RMSE of 74.2 mm indicates occasional outliers, primarily occurring at maximum reach distances where the stereo baseline becomes a limiting factor. For a refrigerator monitoring application with typical interaction distances of 300–1500 mm, this accuracy level is more than sufficient for reliable hand tracking.
The deliberate choice of a low IoU threshold (0.01) for stereo matching acceptance deserves further discussion. While counterintuitive, this threshold ensures that small or partially visible objects at frame boundaries maintain stereo correspondence. Our experiments showed that higher thresholds (0.1–0.3) resulted in frequent loss of stereo pairs for truncated objects, defeating the purpose of leveraging stereo redundancy. However, we acknowledge that using bounding box centers for DLT-based triangulation assumes these points correspond to the same 3D location, which may not hold perfectly for non-rigid objects like hands or for large objects at close distances. This assumption has minimal impact during the stereo object matching stage, as matching is based on the entire bounding box region rather than a single point. However, when computing actual 3D positions through triangulation, depth errors can occur if the two center points do not represent the same physical location. In our application domain (hands at 300–2000 mm distances), this assumption provides sufficient accuracy as evidenced by the median depth error of 50.1 mm (Section 4.4.2). For applications involving larger non-rigid objects or closer distances, incorporating multiple correspondence points (e.g., corner points or skeleton-based keypoints) could improve accuracy.
5.2. Robustness to Occlusions and Truncations
The significant improvement in association metrics (AssA: 0.775 vs. 0.703 for OC-SORT) directly demonstrates the effectiveness of our dual-tracker approach. When one view experiences occlusion, the corresponding tracker in the alternate view maintains object ID, enabling seamless re-identification when visibility is restored. This mechanism proves particularly valuable in retail environments where products and shelving create frequent occlusions.
The qualitative analysis in Figure 6 reveals an interesting pattern: monocular trackers tend to exhibit ID switches not only during occlusions but also immediately after recovery. This suggests that appearance-based re-identification alone is insufficient when objects undergo partial visibility changes. In contrast, our stereo correspondence provides a geometric constraint that maintains ID consistency independent of appearance variations.
5.3. Computational Efficiency Considerations
The framework achieves a processing rate of 70 FPS, demonstrating the practical advantage of our 2.5D representation over full 3D MOT systems. By avoiding computationally expensive 3D bounding box regression and dense stereo matching, the system maintains real-time performance while providing essential depth information. The per-frame latency is 13.6 ms in total, comprising detector inference (12.3 ms on GPU) and tracking (1.25 ms on CPU, including 0.23 ms for stereo matching). Notably, excluding detection, the tracking module runs entirely on the CPU within 1.25 ms without any appearance feature extraction or deep Re-ID computation, demonstrating the lightweight design of the proposed method. The GPU was used only to accelerate object detection during evaluation and is not required by the tracking pipeline itself. For future optimization, lightweight backbones may be explored to enable fully CPU-based embedded deployment.
5.4. Limitations and Practical Considerations
Despite the strong performance, several limitations warrant discussion.
First, the stereo baseline of 280 mm, while suitable for indoor environments, may be insufficient for larger spaces requiring an extended depth range. The system’s depth estimation accuracy degrades beyond 2000 mm due to the fixed baseline limitation.
Second, the system’s performance degrades in scenarios with significant lighting variations between stereo views, as this violates the photometric consistency assumption underlying stereo matching. In our experiments, we observed matching failures when one camera faced direct illumination while the other was shadowed, resulting in incorrect correspondences and depth estimation errors exceeding 200 mm.
Third, the method struggles when objects are simultaneously occluded or truncated in both stereo views, as the redundancy advantage is lost. In such cases, the system reverts to monocular tracking behavior until visibility is restored in at least one view.
Fourth, rapid hand movements can cause motion blur in the captured images, leading to inaccurate bounding box detections and subsequent stereo matching failures. The DLT-based triangulation relies on precise center point localization, which degrades under motion blur conditions.
Fifth, due to parallax or partial occlusion, the same object can be perceived differently across the left and right views. For example, two objects may be distinctly separated and detected in one view but merged into a single bounding box in the other. If this merged box is incorrectly matched to the wrong object in the opposite view, erroneous correspondences can produce a mismatched tracker pair. When the objects later separate, ID reuse may fail, resulting in track fragmentation. Because the current system updates the stereo tracker association map on a per-frame basis, it is not sufficiently robust for such transient merging and splitting events. Future work could incorporate temporal association history or continuity constraints to maintain a more stable correspondence between stereo trackers.
6. Conclusions
This paper presented a real-time robust 2.5D stereo multi-object tracking system that effectively addresses the challenges of occlusions and truncations. The key contribution lies in demonstrating that lightweight stereo matching based solely on bounding box coordinates, combined with intelligent tracker management, can achieve superior tracking performance compared to sophisticated monocular methods.
The proposed framework makes three significant contributions to the field. First, the resource-aware stereo matching algorithm eliminates the computational overhead of appearance feature extraction while maintaining robust correspondence through geometric constraints. Second, the resilient dual-tracker design with re-identification capability ensures tracking continuity even under frequent detection failures. Third, the 2.5D representation provides a practical balance between computational efficiency and spatial awareness, making the system suitable for real-time deployment.
Experimental validation on a custom refrigerator monitoring dataset demonstrated that StereoSORT achieves MOTA of 0.932 and IDF1 of 0.823, substantially outperforming state-of-the-art monocular trackers. The depth estimation accuracy (median error: 50.1 mm) proves sufficient for practical applications while maintaining real-time performance at 70 FPS.
Several important directions for future research emerge from this work. First, comprehensive benchmarking on public datasets such as KITTI is essential to validate the framework’s generalizability beyond the current application domain. Although this study focused on a single-class setting, the proposed method does not limit the applicability because it is geometry-based and does not rely on class-dependent appearance cues. Future work will evaluate the framework on large-scale stereo benchmarks to further assess its scalability across diverse stereo configurations, object types, and operating environments.
Second, extending the framework to multi-class scenarios introduces additional challenges beyond dataset-level generalization. In particular, stereo matching must remain reliable even when detections from the two views belong to different semantic classes due to detector uncertainty or classification noise. To address such ambiguity, two promising strategies can be explored: (1) adopting the class with the highest confidence score from either view, or (2) selecting the class with the highest mean confidence across both views for balanced fusion. This extension is critical for deploying the framework in complex environments involving complex object categories. Moreover, incorporating explainable AI techniques could enhance decision reliability by providing interpretable evidence for class assignment (e.g., saliency and Grad-CAM heatmaps), which would improve transparency in real-world applications [,,,].
Third, the current motion estimation operates in the 2D image space, which may lead to physically inconsistent predictions for objects moving along the depth axis. This design choice was made to prioritize real-time efficiency and maintain temporal stability during the mechanism-level validation stage of this study. However, a more principled extension would incorporate a 3D motion model that performs prediction in world coordinates while leveraging stereo-based depth information, followed by projection back to the image plane for association. Such a depth-aware motion model would provide physically consistent trajectory estimation and improve accuracy for objects exhibiting significant depth variation. In future work, the extension to 3D MOT can be achieved by integrating existing stereo-based 3D position estimates with a 3D Kalman filter or velocity-aware motion models.
Additionally, investigating adaptive stereo baseline configurations could optimize the depth-resolution trade-off for different operational ranges. The integration of temporal depth consistency constraints could further improve tracking robustness during extended occlusions. From a deployment perspective, developing hardware-accelerated implementations for edge devices would enable broader adoption in resource-constrained environments.
The framework’s principles demonstrate potential for adaptation to other domains requiring robust tracking, such as industrial automation, sports analytics, and assisted living environments. As stereo vision systems become increasingly prevalent in robotics and autonomous systems, the proposed lightweight yet robust tracking approach offers a practical solution for real-world deployment where computational resources and reliability are equally critical.
Author Contributions
Conceptualization, J.L., J.S., and D.K.; methodology, J.L. and J.S.; software, J.L. and J.S.; validation, J.L., J.S., and E.P.; formal analysis, J.L.; investigation, J.L. and J.S.; resources, D.K.; data curation, J.L. and J.S.; writing—original draft preparation, J.L., J.S., E.P., and D.K.; writing—review and editing, J.L., J.S., E.P., and D.K.; visualization, J.S. and E.P.; supervision, E.P. and D.K.; project administration, D.K.; funding acquisition, D.K. All authors have read and agreed to the published version of the manuscript.
Funding
The present research was conducted by the research fund of Dankook University in 2023.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The sample data and codes presented in the study are available in GitHub: https://github.com/JJhyeongg/stereo_tracker_repo_2025 (accessed on 18 September 2025). The training datasets are available from the corresponding author upon request, for research purposes.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| AssA | Association Accuracy |
| AssR | Association Recall |
| BoT-SORT | Bag of Tricks for Simple Online and Realtime Tracking |
| CPU | Central Processing Unit |
| DeepSORT | Deep Simple Online and Realtime Tracking |
| DetA | Detection Accuracy |
| DLT | Direct Linear Transform |
| FPS | Frames per second |
| GPU | Graphics Processing Unit |
| HOTA | Higher Order Tracking Accuracy |
| ID | Identity |
| IDF1 | Identification F1 Score |
| IoU | Intersection over Union |
| LocA | Localization Accuracy |
| MAE | Mean Absolute Error |
| MOT | Multi-Object Tracking |
| MOTA | Multiple Object Tracking Accuracy |
| OCM | Observation-Centric Momentum |
| OCR | Observation-Centric Reassociation |
| OC-SORT | Observation-Centric SORT |
| P95 | 95th Percentile Error |
| RMSE | Root Mean Square Error |
| SORT | Simple Online and Realtime Tracking |
| YOLO | You Only Look Once |
References
- Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Kim, T.-K. Multiple object tracking: A literature review. Artif. Intell. 2021, 293, 103448. [Google Scholar] [CrossRef]
- Reid, D. An algorithm for tracking multiple targets. IEEE Trans. Autom. Control. 2003, 24, 843–854. [Google Scholar] [CrossRef]
- Bar-Shalom, Y.; Fortmann, T.E.; Cable, P.G. Tracking and Data Association; Academic Press Professional, Inc.: San Diego, CA, USA, 1990. [Google Scholar]
- Breitenstein, M.D.; Reichlin, F.; Leibe, B.; Koller-Meier, E.; Van Gool, L. Robust tracking-by-detection using a detector confidence particle filter. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 1515–1522. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. Hota: A higher order metric for evaluating multi-object tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef] [PubMed]
- Weng, X.; Wang, J.; Held, D.; Kitani, K. 3d multi-object tracking: A baseline and new evaluation metrics. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 10359–10366. [Google Scholar]
- Yin, T.; Zhou, X.; Krahenbuhl, P. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11784–11793. [Google Scholar]
- Kim, A.; Ošep, A.; Leal-Taixé, L. Eagermot: 3d multi-object tracking via sensor fusion. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 11315–11321. [Google Scholar]
- Zhu, Y.; Wang, T.; Zhu, S. Adaptive multi-pedestrian tracking by multi-sensor: Track-to-track fusion using monocular 3D detection and MMW radar. Remote Sens. 2022, 14, 1837. [Google Scholar] [CrossRef]
- Ahmadyan, A.; Hou, T.; Wei, J.; Zhang, L.; Ablavatski, A.; Grundmann, M. Instant 3D object tracking with applications in augmented reality. arXiv 2020, arXiv:2006.13194. [Google Scholar] [CrossRef]
- Hu, H.-N.; Cai, Q.-Z.; Wang, D.; Lin, J.; Sun, M.; Krahenbuhl, P.; Darrell, T.; Yu, F. Joint monocular 3D vehicle detection and tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5390–5399. [Google Scholar]
- Hu, H.-N.; Yang, Y.-H.; Fischer, T.; Darrell, T.; Yu, F.; Sun, M. Monocular quasi-dense 3d object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1992–2008. [Google Scholar] [CrossRef] [PubMed]
- Tosi, F.; Bartolomei, L.; Poggi, M. A survey on deep stereo matching in the twenties. Int. J. Comput. Vis. 2025, 133, 4245–4276. [Google Scholar] [CrossRef]
- Li, Y.; Ibanez-Guzman, J. Lidar for autonomous driving: The principles, challenges, and trends for automotive lidar and perception systems. IEEE Signal Process. Mag. 2020, 37, 50–61. [Google Scholar] [CrossRef]
- Roriz, R.; Cabral, J.; Gomes, T. Automotive LiDAR technology: A survey. IEEE Trans. Intell. Transp. Syst. 2021, 23, 6282–6297. [Google Scholar] [CrossRef]
- Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
- Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
- Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO, version 8.3.182; Ultralytics: Online, 2023. [Google Scholar]
- Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Pollefeys, M.; Koch, R.; Van Gool, L. A simple and efficient rectification method for general motion. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; pp. 496–501. [Google Scholar]
- Bolya, D.; Foley, S.; Hays, J.; Hoffman, J. Tide: A general toolbox for identifying object detection errors. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 558–573. [Google Scholar]
- Bergmann, P.; Meinhardt, T.; Leal-Taixe, L. Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 26 October–1 November 2019; pp. 941–951. [Google Scholar]
- Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–21. [Google Scholar]
- Zhang, F.; Prisacariu, V.; Yang, R.; Torr, P.H. Ga-net: Guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 185–194. [Google Scholar]
- Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 66–75. [Google Scholar]
- Zhang, S.; Zheng, L.; Tao, W. Survey and evaluation of RGB-D SLAM. IEEE Access 2021, 9, 21367–21387. [Google Scholar] [CrossRef]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
- Li, P.; Shi, J.; Shen, S. Joint spatial-temporal optimization for stereo 3D object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6877–6886. [Google Scholar]
- Karaev, N.; Rocco, I.; Graham, B.; Neverova, N.; Vedaldi, A.; Rupprecht, C. Dynamicstereo: Consistent dynamic depth from stereo videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–20 June 2023; pp. 13229–13239. [Google Scholar]
- Zhang, Y.; Poggi, M.; Mattoccia, S. Temporalstereo: Efficient spatial-temporal stereo matching network. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 9528–9535. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
- Chang, J.-R.; Chen, Y.-S. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2018; pp. 5410–5418. [Google Scholar]
- Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar] [CrossRef]
- Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
- Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–20 June 2023; pp. 9686–9696. [Google Scholar]
- Mahler, R.P. Advances in Statistical Multisource-Multitarget Information Fusion; Artech House: London, UK, 2014. [Google Scholar]
- Vo, B.-N.; Vo, B.-T.; Nguyen, T.T.D.; Shim, C. An overview of multi-object estimation via labeled random finite set. IEEE Trans. Signal Process. 2024, 72, 4888–4917. [Google Scholar] [CrossRef]
- Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3061–3070. [Google Scholar]
- Wang, Z.; Wu, Y.; Niu, Q. Multi-sensor fusion in automated driving: A survey. IEEE Access 2019, 8, 2847–2868. [Google Scholar] [CrossRef]
- Szeliski, R. Computer Vision: Algorithms and Applications; Springer Nature: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
- Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
- Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar]
- Yu, F.; Li, W.; Li, Q.; Liu, Y.; Shi, X.; Yan, J. Poi: Multiple object tracking with high performance detection and appearance feature. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 36–42. [Google Scholar]
- Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
- Kalman, R.E. A new approach to linear filtering and prediction problems. ASME J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
- Chen, L.; Ai, H.; Zhuang, Z.; Shang, C. Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In Proceedings of the 2018 IEEE international conference on multimedia and expo (ICME), San Diego, CA, USA, 23–27 July 2018; pp. 1–6. [Google Scholar]
- Ciaparrone, G.; Sánchez, F.L.; Tabik, S.; Troiano, L.; Tagliaferri, R.; Herrera, F. Deep learning in video multi-object tracking: A survey. Neurocomputing 2020, 381, 61–88. [Google Scholar] [CrossRef]
- Dwyer, B.; Nelson, J.; Hansen, T. Roboflow, version 1.0; Roboflow, Inc.: Des Moines, IA, USA, 2025. [Google Scholar]
- Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
- Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 17–35. [Google Scholar]
- Aharon, N.; Orfaig, R.; Bobrovsky, B.-Z. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
- Mokdad, S.; Khalid, A.; Nasr, D.; Talib, M.A. Interpretable deep learning: Evaluating YOLO models and XAI techniques for video annotation. In Proceedings of the IET Conference Proceedings CP870, Patna, India, 14–15 July 2023; pp. 487–496. [Google Scholar]
- Yoon, C.; Park, E.; Misra, S.; Kim, J.Y.; Baik, J.W.; Kim, K.G.; Jung, C.K.; Kim, C. Deep learning-based virtual staining, segmentation, and classification in label-free photoacoustic histology of human specimens. Light Sci. Appl. 2024, 13, 226. [Google Scholar] [CrossRef] [PubMed]
- Park, E.; Misra, S.; Hwang, D.G.; Yoon, C.; Ahn, J.; Kim, D.; Jang, J.; Kim, C. Unsupervised inter-domain transformation for virtually stained high-resolution mid-infrared photoacoustic microscopy using explainable deep learning. Nat. Commun. 2024, 15, 10892. [Google Scholar] [CrossRef] [PubMed]
- Beemelmanns, T.; Zahr, W.; Eckstein, L. Explainable Multi-Camera 3D Object Detection with Transformer-Based Saliency Maps. arXiv 2023, arXiv:2312.14606. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).