Real-Time 6D Pose Estimation and Multi-Target Tracking for Low-Cost Multi-Robot System
Abstract
1. Introduction
- We present a low-cost semi-physical verification platform for multi-robot systems, integrating an affordable RGB-D camera (Intel D435i), a standard GPU workstation, and multiple Mecanum-wheeled robots. This setup supports closed-loop validation from real-world perception to simulation-based control, lowering the barrier for algorithm development and system testing.
- We propose the YolPnP-FT perception pipeline, which introduces a keypoint confidence filtering strategy (PnP-FT) after the YOLOv8 detection head. This strategy effectively enhances the robustness of PnP-based pose estimation under partial occlusion while reducing unnecessary computation.
- We adapt the DeepSORT tracker by employing a linearly weighted combination of Mahalanobis distance and cosine distance with cascaded matching. This strategy enables the system to achieve stable ID assignment in visually similar multi-robot scenarios, effectively mitigating tracking instability and identity confusion caused by occlusion or appearance homogeneity.
- We establish a perception-driven validation paradigm: real-time perception outputs (6D pose + ID) are directly injected into a CoppeliaSim simulation environment to verify their direct utility in downstream tasks such as formation control and path planning, offering a reproducible deployment-validation pathway for low-cost systems.
2. Related Works
2.1. Six-Dimensional Pose Estimation
2.2. Multi-Object Tracking (MOT)
2.3. Research Gap
3. Method
3.1. System Overview
- (1)
- A physical hardware platform for real-time 6D pose estimation and tracking, consisting of an Intel RealSense D435i RGB-D camera (Santa Clara, CA, USA), a commercial GPU workstation, and multiple heterogeneous Mecanum-wheeled robots;
- (2)
- A simulation validation environment based on CoppeliaSim, which receives the 6D pose and identity (ID) outputs from the physical perception system to drive virtual robot models.
3.2. A 6D Pose Estimation Method Based on the YolPnP-FT
3.2.1. Dataset Construction
3.2.2. Six-Dimensional Pose Estimation Based on the YolPnP-FT Perception Pipeline
| Algorithm 1: PnP-FT 6D Pose Estimation Filtering Threshold Strategy | |
| Input: Image data I, 3D keypoint model K3D confidence threshold θ | |
| Output: Rotation matrix R, translation vector t | |
| 1 | K2D ← Detector(I); |
| // Detect 2D keypoints from image | |
| 2 | if confidence (k) < for any k ∈ then |
| 3 | Discard low-confidence keypoints; |
| 4 | ← {k∈| confidence(k)≥θ}; |
| 5 | Assign IDs to keypoints in |
| 6 | ← GetMatching 3D Keypoints (, ; |
| // Select corresponding 3D keypoints | |
| 7 | Match keypoints between and ; |
| 8 | [R,t] ← PnP(, ); |
| // Solve for pose using PnP algorithm | |
| 9 | return R,t; |
3.2.3. Adapted DeepSORT for Multi-Object Tracking in Visually Homogeneous Multi-Robot Scenarios
- (1)
- Motion Model Information Feature Matching: Mahalanobis Distance
- (2)
- Appearance Model Information Feature Matching: Cosine Distance
- (3)
- Comprehensive Matching Degree
- (4)
- Cascade Matching Strategy
| Algorithm 2: DeepSORT Algorithm | |
| Input: Frame , detections = (,,..., ), track set T, max age , IOU threshold λ | |
| Output: Updated track set | |
| 1 | for each track ∈ T do |
| 2 | . predict() // Kalman prediction |
| // Stage 1: Cascade Matching (appearance +motion) | |
| 3 | ← { ∈ ∣ .is_confirmed()}; |
| 4 | , , ← ∅, , D; |
| 5 | for n = 1 to do |
| 6 | ← {∈ ∣. age = n}; |
| 7 | if ≠ ∅ then |
| 8 | Compute cost matrix C using appearance + Mahalanobis; |
| 9 | [] ← Hungarian(C); |
| 10 | ← {∣ = 1 and < gate}; |
| 11 | ← ∪ ; |
| 12 | ← \{∣ ∈}; |
| 13 | ←\{∣ ∈ }; |
| // Stage2: IOU Matching for remaining confirmed tracks | |
| 14 | , , ← IOUMatch(, , λ); |
| // Update matched tracks | |
| 15 | for each ∈ M1 ∪ M2 do |
| 16 | . update(); |
| 17 | . age ← 0; |
| // Handle unmatched confirmed tracks | |
| 18 | for each ∈ where .is_confirmed() do |
| 19 | . age ← . age + 1; |
| 20 | if .age > then |
| 21 | T ← T\{}; |
| // Create new unconfirmed tracks | |
| 22 | for each ∈ do |
| 23 | ← newTrack(); |
| 24 | . confirmed ← false; |
| 25 | . age ← 1; |
| 26 | T ← T ∪ {}; |
| // Manage unconfirmed tracks | |
| 27 | for each ∈ T where ¬.is_confirmed() do |
| 28 | if .age > 1 then |
| 29 | T ← T\{} // Delete if not confirmed in 2 frames |
| 30 | return T; |
3.3. Downstream Task Validation of Perception Outputs
4. Experiment and Analysis
4.1. Real-Time 6D Pose Estimation and Tracking Experiments for Multi-Robot Formations Based on the YolPnP-FT Perception Pipeline
4.1.1. Experimental Environment
4.1.2. 6D Pose Estimation Experiments Based on the YolPnP-FT Perception Pipeline
- (1)
- Small-object effect: As camera height increases, the robot occupies fewer pixels in the image, causing minor 2D keypoint detection errors to be significantly amplified through PnP triangulation;
- (2)
- Feature degradation: At high (near-top-down) viewpoints, the robot’s upper surface often lacks texture or exhibits symmetry (e.g., planar surfaces), leading to ambiguous keypoint localization. In contrast, lower viewpoints, though more prone to occlusion, provide richer lateral geometric features that enhance matching reliability.
4.1.3. Comprehensive Real-Time Multi-Robot Formation Tracking Experiments
4.2. Perception-Driven Simulation Validation for Multi-Robot Systems
4.2.1. Multi-Robot Formation Validation
4.2.2. Path Planning Validation for Multi-Robot Formations
4.2.3. Collaborative Task Validation for Multi-Robot Formations
Limitations and Future Work
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Yu, Y.G.; Miao, Z.Q.; Wang, X.K.; Shen, L.C. Cooperative circumnavigation control of multiple unicycle-type robots with non-identical input constraints. IET Control. Theory A 2022, 16, 889–901. [Google Scholar] [CrossRef]
- Huang, G. Visual-inertial navigation: A concise review. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: New York, NY, USA, 2019; pp. 9572–9582. [Google Scholar]
- Wu, L.; Guo, S.L.; Han, L.; Baris, C.A. Indoor positioning method for pedestrian dead reckoning based on multi-source sensors. Measurement 2024, 229, 114416. [Google Scholar] [CrossRef]
- Herath, S.; Yan, H.; Furukawa, Y. Ronin: Robust neural inertial navigation in the wild: Benchmark, evaluations, & new methods. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; IEEE: New York, NY, USA, 2020; pp. 3146–3152. [Google Scholar]
- Yu, D.; Li, C.G. An Accurate WiFi Indoor Positioning Algorithm for Complex Pedestrian Environments. IEEE Sens. J. 2021, 21, 24440–24452. [Google Scholar] [CrossRef]
- Cui, Y.E.; Chen, X.Y.L.; Zhang, Y.L.; Dong, J.H.; Wu, Q.X.; Zhu, F. BoW3D: Bag of Words for Real-Time Loop Closing in 3D LiDAR SLAM. IEEE Robot. Autom. Lett. 2023, 8, 2828–2835. [Google Scholar] [CrossRef]
- Zhang, T.; Zhao, D.; Yang, J.; Wang, S.; Liu, H. A smart home based on multi-heterogeneous robots and sensor networks for elderly care. In International Conference on Intelligent Robotics and Applications, Harbin, China, 1–3 August 2022; Springer: Cham, Switzerland, 2022; pp. 98–104. [Google Scholar]
- Xia, X.; Bhatt, N.P.; Khajepour, A.; Hashemi, E. Integrated Inertial-LiDAR-Based Map Matching Localization for Varying Environments. IEEE Trans. Intell. Veh. 2023, 8, 4307–4318. [Google Scholar] [CrossRef]
- Wang, G.; Manhardt, F.; Tombari, F.; Ji, X. Gdr-net: Geometry-guided direct regression network for monocular 6d object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16611–16621. [Google Scholar]
- Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3343–3352. [Google Scholar]
- Zhang, Y.F.; Wang, C.Y.; Wang, X.G.; Zeng, W.J.; Liu, W.Y. FairMOT: On the Fairness of Detection and Re-identification in Multiple Object Tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
- Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: New York, NY, USA, 2018; pp. 3645–3649. [Google Scholar]
- Altillawi, M.; Li, S.L.; Prakhya, S.M.; Liu, Z.Y.; Serrat, J. Implicit Learning of Scene Geometry From Poses for Global Localization. IEEE Robot. Autom. Lett. 2024, 9, 955–962. [Google Scholar] [CrossRef]
- Nguyen, A.; Do, T.-T.; Caldwell, D.G.; Tsagarakis, N.G. Real-time 6DOF pose relocalization for event cameras with stacked spatial LSTM networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Hoque, S.; Xu, S.X.; Maiti, A.; Wei, Y.C.; Arafat, M.Y. Deep learning for 6D pose estimation of objects-A case study for autonomous driving. Expert Syst. Appl. 2023, 223, 119838. [Google Scholar] [CrossRef]
- Tekin, B.; Sinha, S.N.; Fua, P. Real-time seamless single shot 6d object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–31 June 2018; pp. 292–301. [Google Scholar]
- Su, Y.; Saleh, M.; Fetzer, T.; Rambach, J.; Navab, N.; Busam, B.; Stricker, D.; Tombari, F. Zebrapose: Coarse to fine surface encoding for 6dof object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6738–6748. [Google Scholar]
- Bukschat, Y.; Vetter, M. EfficientPose: An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach. arXiv 2020, arXiv:2011.04307. [Google Scholar]
- Zakharov, S.; Shugurov, I.; Ilic, S. Dpod: 6d pose object detector and refiner. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1941–1950. [Google Scholar]
- Xu, Y.; Lin, K.-Y.; Zhang, G.; Wang, X.; Li, H. Rnnpose: Recurrent 6-dof object pose refinement with robust correspondence field estimation and pose optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 14880–14890. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems; Curran Associates Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
- Zhao, D.H.; Yang, C.H.; Zhang, T.Q.; Yang, J.Y.; Hiroshi, Y. A Task Allocation Approach of Multi-Heterogeneous Robot System for Elderly Care. Machines 2022, 10, 622. [Google Scholar] [CrossRef]
- Xin, P.; Dames, P. Comparing stochastic optimization methods for multi-robot, multi-target tracking. In International Symposium on Distributed Autonomous Robotic Systems; Springer: Berlin/Heidelberg, Germany, 2022; pp. 378–393. [Google Scholar]
- Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards real-time multi-object tracking. In European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 107–122. [Google Scholar]
- Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 474–490. [Google Scholar]
- Peng, J.; Wang, C.; Wan, F.; Wu, Y.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Fu, Y. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 145–161. [Google Scholar]
- Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: New York, NY, USA, 2016; pp. 3464–3468. [Google Scholar] [CrossRef]
- Jawad Alzubairi, S.M.; Petunin, A.; Humaidi, A.J. Multi-robot task allocation based on an automatic clustering strategy employing an enhanced dynamic distributed PSO. Int. Rev. Appl. Sci. Engineering 2025, 16, 347–359. [Google Scholar] [CrossRef]
- Nie, Z.; Zhang, Q.; Wang, X.; Wang, F.; Hu, T. Triangular lattice formation in robot swarms with minimal local sensing. IET Cyber-Syst. Robot. 2023, 5, e12087. [Google Scholar] [CrossRef]
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
- Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO, version 8.0.0. (2023). Available online: https://github.com/ultralytics/ultralytics (accessed on 16 November 2025).




















| Method | 6D Pose Est. | 6D Pose Track. | Multi Target | Real Time | Hardware Cost | Camera Setup | Env. Constraints |
|---|---|---|---|---|---|---|---|
| Vicon [1] | Yes | Yes | Yes | Yes | Very High | Multi-camera, fixed | Controlled (no occlusion, stable light) |
| LOAM [5] | No | No | Limited | Yes | High | LiDAR only | Outdoor/indoor, but needs geometry |
| GDR-Net [9] | Yes | No | No | No | Medium | Single RGB | Controlled lab |
| PVN3D [10] | Yes | No | No | Yes | Medium | Single RGB-D | Moderate occlusion OK |
| FairMOT [11] | No | No | Yes | Yes | Medium | Single RGB | General |
| DeepSORT [12] | No | No | Yes | Yes | Low | Single RGB | General |
| Ours | Yes | Yes | Yes | Yes | Low | Single RGB | Indoor, partial occlusion OK |
| Camera Height (m) | Mean X Position Error (m) | Maximum X Position Error (m) | Mean Y Position Error (m) | Maximum Y Position Error (m) | Mean Z Position Error (m) | Maximum Z Position Error (m) |
|---|---|---|---|---|---|---|
| 0.5 | 0.0039 | 0.005 | 0.0025 | 0.011 | 0.0025 | 0.008 |
| 1.5 | 0.008 | 0.016 | 0.005 | 0.013 | 0.009 | 0.021 |
| 2.5 | 0.017 | 0.057 | 0.017 | 0.034 | 0.043 | 0.082 |
| Camera Height (m) | Mean X Angle Error (°) | Maximum X Angle Error (°) | Mean Y Angle Error (°) | Maximum Y Angle Error (°) | Mean Z Angle Error (°) | Maximum Z Angle Error (°) |
|---|---|---|---|---|---|---|
| 0.5 | 2.5 | 4.7 | 1.2 | 3.3 | 0.5 | 1.0 |
| 1.5 | 2.2 | 7.0 | 4.2 | 6.9 | 0.9 | 2.1 |
| 2.5 | 6.6 | 18.2 | 6.5 | 12.6 | 3.9 | 8.8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shan, B.; Zhao, D.; Zhao, R.; Hiroshi, Y. Real-Time 6D Pose Estimation and Multi-Target Tracking for Low-Cost Multi-Robot System. Sensors 2025, 25, 7130. https://doi.org/10.3390/s25237130
Shan B, Zhao D, Zhao R, Hiroshi Y. Real-Time 6D Pose Estimation and Multi-Target Tracking for Low-Cost Multi-Robot System. Sensors. 2025; 25(23):7130. https://doi.org/10.3390/s25237130
Chicago/Turabian StyleShan, Bo, Donghui Zhao, Ruijin Zhao, and Yokoi Hiroshi. 2025. "Real-Time 6D Pose Estimation and Multi-Target Tracking for Low-Cost Multi-Robot System" Sensors 25, no. 23: 7130. https://doi.org/10.3390/s25237130
APA StyleShan, B., Zhao, D., Zhao, R., & Hiroshi, Y. (2025). Real-Time 6D Pose Estimation and Multi-Target Tracking for Low-Cost Multi-Robot System. Sensors, 25(23), 7130. https://doi.org/10.3390/s25237130

