6DoF Object Pose and Focal Length Estimation from Single RGB Images in Uncontrolled Environments
Abstract
1. Introduction
2. Related Works
2.1. Classical Approaches
2.1.1. Template Matching
2.1.2. Descriptor-Based Techniques
2.1.3. Feature-Based Methods
2.2. Deep Learning Based Approaches
2.2.1. RGB-D Image-Based Approaches
2.2.2. RGB Image-Based Approaches
3. Methodology
3.1. Motivation
3.1.1. Projection Scale Ambiguity in Perspective Projection of Pinhole Camera Model
3.1.2. Decoupling Ambiguity in Projection Scale by Fixing One Correlated Parameter to an Arbitrary Constant
3.2. Two-Stage Approach for 6DoF Pose and Focal Length Prediction
3.2.1. Stage I—Arbitrary Estimator Network
- The 3D rotation update rule: A similar approach to the [13] is used for updating rotations using Gram–Schmidt orthogonalization. The update is performed using the Equation (22):where represents the updated rotation of the object, denotes the current rotation, and is the rotation matrix derived through Gram–Schmidt orthogonalization of the two three-dimensional vectors and , which are predicted by the alignment network F as a component of .
- Focal length update rule: During the training of the first stage, the focal length is rescaled to compensate for setting to an arbitrary constant. However, the focal length update rule remains the same as in [13] because there are no correlated parameters in the focal length update rule.
3.2.2. Stage II: Depth Estimator Network
4. Results
4.1. Quantitative Results
4.2. Qualitative Results
4.3. Ablation Study
4.3.1. Effect of Using a Refiner in Stage II
4.3.2. Effect of Loss Functions
4.3.3. Effect of Selection of Value in Stage I
- Small (0.2 m): This value resulted in relatively high translation and focal length estimation errors.
- Optimal (2 m): This value produced the best balance, with lower median errors in translation and focal length, and also a higher projection accuracy. This validates the choice of 2 m as a good approximation for initialization.
- Large (20 m): This value degraded the performance, with comparatively higher errors in focal length estimation and lower projection accuracy.
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Balntas, V.; Doumanoglou, A.; Sahin, C.; Sock, J.; Kouskouridas, R.; Kim, T.K. Pose Guided RGBD Feature Learning for 3D Object Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Tian, M.; Pan, L.; Ang, M.H.; Lee, G.H. Robust 6D Object Pose Estimation by Learning RGB-D Features. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31August 2020; pp. 6218–6224. [Google Scholar]
- Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S. DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3338–3347. [Google Scholar]
- Li, Y.; Wang, G.; Ji, X.; Xiang, Y.; Fox, D. DeepIM: Deep Iterative Matching for 6D Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Labbe, Y.; Carpentier, J.; Aubry, M.; Sivic, J. CosyPose: Consistent multi-view multi-object 6D pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Brachmann, E.; Michel, F.; Krull, A.; Yang, M.Y.; Gumhold, S.; Rother, C. Uncertainty-Driven 6D Pose Estimation of Objects and Scenes From a Single RGB Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Zhang, X.; Jiang, Z.; Zhang, H. Real-time 6D pose estimation from a single RGB image. Image Vis. Comput. 2019, 89, 1–11. [Google Scholar] [CrossRef]
- Do, T.T.; Cai, M.; Pham, T.T.; Reid, I.D. Deep-6DPose: Recovering 6D Object Pose from a Single RGB Image. arXiv 2018, arXiv:1802.10367. [Google Scholar]
- Park, S.Y.; Son, C.M.; Jeong, W.J.; Park, S. Relative Pose Estimation between Image Object and ShapeNet CAD Model for Automatic 4-DoF Annotation. Appl. Sci. 2023, 13, 693. [Google Scholar] [CrossRef]
- Nguyen, D.M.H.; Henschel, R.; Rosenhahn, B.; Sonntag, D.; Swoboda, P. LMGP: Lifted Multicut Meets Geometry Projections for Multi-Camera Multi-Object Tracking. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8856–8865. [Google Scholar]
- Van Ma, L.; Nguyen, T.T.D.; Vo, B.N.; Jang, H.; Jeon, M. Track initialization and re-identification for 3D multi-view multi-object tracking. Inf. Fusion 2024, 98, 102496. [Google Scholar] [CrossRef]
- Han, Y.; Di, H.; Zheng, H.; Qi, J.; Gong, J. GCVNet: Geometry Constrained Voting Network to Estimate 3D Pose for Fine-Grained Object Categories. In Proceedings of the Pattern Recognition and Computer Vision: Third Chinese Conference, PRCV 2020, Nanjing, China, 16–18 October 2020; Proceedings, Part I. Springer: Cham, Switzerland, 2020; pp. 180–192. [Google Scholar] [CrossRef]
- Ponimatkin, G.; Labbé, Y.; Russell, B.; Aubry, M.; Sivic, J. Focal length and object pose estimation via render and compare. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3825–3834. [Google Scholar]
- Cífka, M.; Ponimatkin, G.; Labbé, Y.; Russell, B.; Aubry, M.; Petrik, V.; Sivic, J. FocalPose++: Focal Length and Object Pose Estimation via Render and Compare. arXiv 2023, arXiv:2312.02985. [Google Scholar]
- He, Z.; Feng, W.; Zhao, X.; Lv, Y. 6D Pose Estimation of Objects: Recent Technologies and Challenges. Appl. Sci. 2021, 11, 228. [Google Scholar] [CrossRef]
- Gorschlüter, F.; Rojtberg, P.; Pöllabauer, T. A Survey of 6D Object Detection Based on 3D Models for Industrial Applications. J. Imaging 2022, 8, 53. [Google Scholar] [CrossRef]
- Mueggler, E.; Rebecq, H.; Gallego, G.; Delbruck, T.; Scaramuzza, D. The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and SLAM. Int. J. Robot. Res. 2017, 36, 142–149. [Google Scholar] [CrossRef]
- Gallego, G.; Forster, C.; Mueggler, E.; Scaramuzza, D. Event-based camera pose tracking using a generative event model. arXiv 2015, arXiv:1510.01972. [Google Scholar]
- Dufour, R.; Miller, E.; Galatsanos, N. Template matching based object recognition with unknown geometric parameters. IEEE Trans. Image Process. 2002, 11, 1385–1396. [Google Scholar] [CrossRef]
- Besl, P.J.; McKay, N.D. Method for registration of 3-D shapes. In Proceedings of the Sensor Fusion IV: Control Paradigms and Data Structures; SPIE: Bellingham, WA, USA, 1992; Volume 1611, pp. 586–606. [Google Scholar]
- Cyr, C.; Kimia, B. 3D object recognition using shape similiarity-based aspect graph. In Proceedings of the Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 1, pp. 254–261. [Google Scholar] [CrossRef]
- Costa, M.S.; Shapiro, L.G. 3D Object Recognition and Pose with Relational Indexing. Comput. Vis. Image Underst. 2000, 79, 364–407. [Google Scholar] [CrossRef]
- Byne, J.; Anderson, J. A CAD-based computer vision system. Image Vis. Comput. 1998, 16, 533–539. [Google Scholar] [CrossRef]
- Vock, R.; Dieckmann, A.; Ochmann, S.; Klein, R. Fast template matching and pose estimation in 3D point clouds. Comput. Graph. 2019, 79, 36–45. [Google Scholar] [CrossRef]
- Reinbacher, C.; Rüther, M.; Bischof, H. Pose Estimation of Known Objects by Efficient Silhouette Matching. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 1080–1083. [Google Scholar] [CrossRef]
- Rusu, R.B.; Blodow, N.; Marton, Z.C.; Beetz, M. Aligning point cloud views using persistent feature histograms. In Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, Nice, France, 22–26 September 2008; pp. 3384–3391. [Google Scholar] [CrossRef]
- Rusu, R.B.; Marton, Z.C.; Blodow, N.; Beetz, M. Learning informative point classes for the acquisition of object model maps. In Proceedings of the 2008 10th International Conference on Control, Automation, Robotics and Vision, Hanoi, Vietnam, 17–20 December 2008; pp. 643–650. [Google Scholar] [CrossRef]
- Rusu, R.B.; Blodow, N.; Beetz, M. Fast Point Feature Histograms (FPFH) for 3D registration. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; pp. 3212–3217. [Google Scholar] [CrossRef]
- Salti, S.; Tombari, F.; Di Stefano, L. SHOT: Unique signatures of histograms for surface and texture description. Comput. Vis. Image Underst. 2014, 125, 251–264. [Google Scholar] [CrossRef]
- Johnson, A.; Hebert, M. Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Trans. Pattern Anal. Mach. Intell. 1999, 21, 433–449. [Google Scholar] [CrossRef]
- Zhou, J.; Liu, Y.; Liu, J.; Xie, Q.; Zhang, Y.; Zhu, X.; Ding, X. BOLD3D: A 3D BOLD descriptor for 6Dof pose estimation. Comput. Graph. 2020, 89, 94–104. [Google Scholar] [CrossRef]
- Yoon, Y.; DeSouza, G.; Kak, A. Real-time tracking and pose estimation for industrial objects using geometric features. In Proceedings of the 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422), Taipei, Taiwan, 14–19 September 2003; Volume 3, pp. 3473–3478. [Google Scholar] [CrossRef]
- Seppälä, T.; Saukkoriipi, J.; Lohi, T.; Soutukorva, S.; Heikkilä, T.; Koskinen, J. Feature-Based Object Detection and Pose Estimation Based on 3D Cameras and CAD Models for Industrial Robot Applications. In Proceedings of the 2022 18th IEEE/ASME International Conference on Mechatronic and Embedded Systems and Applications (MESA), Taipei, Taiwan, 28–30 November 2022; pp. 1–5. [Google Scholar] [CrossRef]
- Teney, D.; Piater, J. Multiview feature distributions for object detection and continuous pose estimation. Comput. Vis. Image Underst. 2014, 125, 265–282. [Google Scholar] [CrossRef]
- Gedik, O.S.; Alatan, A.A. RGBD data based pose estimation: Why sensor fusion? In Proceedings of the 2015 18th International Conference on Information Fusion (Fusion), Washington, DC, USA, 6–9 July 2015; pp. 2129–2136. [Google Scholar]
- da Silva Neto, J.G.; da Lima Silva, P.J.; Figueredo, F.; Teixeira, J.M.X.N.; Teichrieb, V. Comparison of RGB-D sensors for 3D reconstruction. In Proceedings of the 2020 22nd Symposium on Virtual and Augmented Reality (SVR), Porto de Galinhas, Brazil, 7–10 November 2020; pp. 252–261. [Google Scholar] [CrossRef]
- Xiao, Y.; Du, Y.; Marlet, R. PoseContrast: Class-Agnostic Object Viewpoint Estimation in the Wild with Pose-Aware Contrastive Learning. arXiv 2021, arXiv:2105.05643, 05643. [Google Scholar]
- Grabner, A.; Roth, P.M.; Lepetit, V. GP2C: Geometric Projection Parameter Consensus for Joint 3D Pose and Focal Length Estimation in the Wild. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2222–2231. [Google Scholar]
- Manawadu, M.; Park, S.Y. Enhancing 6DoF Pose and Focal Length Estimation from Uncontrolled RGB Images for Robotics Vision. In Proceedings of the ICRA 2024 Workshop on 3D Visual Representations for Robot Manipulation, Yokohama, Japan, 17 May 2024. [Google Scholar]
- Shimshoni, I.; Basri, R.; Rivlin, E. A geometric interpretation of weak-perspective motion. IEEE Trans. Pattern Anal. Mach. Intell. 1999, 21, 252–257. [Google Scholar] [CrossRef][Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Sun, X.; Wu, J.; Zhang, X.; Zhang, Z.; Zhang, C.; Xue, T.; Tenenbaum, J.B.; Freeman, W.T. Pix3d: Dataset and methods for single-image 3d shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2974–2983. [Google Scholar]
- Everingham, M.; Eslami, S.M.A.; Gool, L.V.; Williams, C.K.I.; Winn, J.M.; Zisserman, A. The Pascal Visual Object Classes Challenge: A Retrospective. Int. J. Comput. Vis. 2014, 111, 98–136. [Google Scholar] [CrossRef]
- Böttcher, A.; Wenzel, D. The Frobenius norm and the commutator. Linear Algebra Its Appl. 2008, 429, 1864–1885. [Google Scholar] [CrossRef]








| Dataset | Method | DoF | Rotation | Translation | Pose | Focal Length | Projection | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ↓ | Acc 30° ↑ | Acc 15° ↑ | Acc 5° ↑ | ↓ | ↓ | ↓ | ↓ | ↑ | ↑ | |||||||
| Pix3D Bed | FocalPose [13] | 7 | 0.436 | 53.68% | 32.11% | 3.16% | 0.251 | 0.202 | 0.222 | 0.132 | 41.05% | 13.16% | ||||
| FocalPose++ [14] | 7 | 0.450 | 53.68% | 37.37% | 7.37% | 0.204 | 0.176 | 0.204 | 0.135 | 40.53% | 18.95% | |||||
| Proposed (Stage I) | 6 | 0.389 | 62.11% | 37.89% | 6.32% | 0.019 | 0.044 | 0.064 | 0.104 | 47.37% | 20.53% | |||||
| Proposed (Stage II-) | 7 | 0.382 | 60.00% | 36.32% | 7.89% | 0.200 | 0.179 | 0.208 | 0.119 | 45.26% | 18.42% | |||||
| Proposed (Stage II-) | 7 | 0.387 | 57.37% | 39.47% | 6.84% | 0.187 | 0.174 | 0.199 | 0.129 | 44.21% | 17.89% | |||||
| Pix3D Sofa | FocalPose [13] | 7 | 0.236 | 79.78% | 56.77% | 10.39% | 0.230 | 0.153 | 0.208 | 0.057 | 74.77% | 43.04% | ||||
| FocalPose++ [14] | 7 | 0.193 | 90.74% | 69.26% | 11.48% | 0.203 | 0.137 | 0.195 | 0.048 | 81.85% | 53.89% | |||||
| Proposed (Stage I) | 6 | 0.134 | 94.07% | 80.37% | 30.56% | 0.012 | 0.017 | 0.038 | 0.038 | 87.04% | 65.37% | |||||
| Proposed (Stage II-) | 7 | 0.169 | 92.02% | 74.21% | 20.04% | 0.200 | 0.132 | 0.194 | 0.056 | 81.45% | 41.19% | |||||
| Proposed (Stage II-) | 7 | 0.172 | 91.47% | 73.10% | 20.59% | 0.192 | 0.124 | 0.197 | 0.055 | 81.82% | 43.97% | |||||
| Pix3D Table | FocalPose [13] | 7 | 0.762 | 36.75% | 17.38% | 1.71% | 0.503 | 0.312 | 0.323 | 0.204 | 19.09% | 3.70% | ||||
| FocalPose++ [14] | 7 | 0.617 | 42.17% | 21.08% | 2.28% | 0.391 | 0.277 | 0.363 | 0.202 | 23.36% | 6.84% | |||||
| Proposed (Stage I) | 6 | 0.500 | 51.28% | 27.07% | 3.70% | 0.021 | 0.053 | 0.075 | 0.136 | 38.46% | 15.38% | |||||
| Proposed (Stage II-) | 7 | 0.587 | 47.29% | 26.50% | 4.56% | 0.279 | 0.213 | 0.315 | 0.180 | 27.07% | 7.41% | |||||
| Proposed (Stage II-) | 7 | 0.611 | 46.44% | 24.50% | 5.13% | 0.272 | 0.211 | 0.320 | 0.182 | 26.21% | 5.70% | |||||
| Pix3D Chair | FocalPose [13] | 7 | 0.964 | 24.08% | 7.47% | 0.44% | 0.553 | 0.376 | 0.210 | 0.182 | 16.17% | 1.45% | ||||
| FocalPose++ [14] | 7 | 0.594 | 45.35% | 20.12% | 1.66% | 0.348 | 0.229 | 0.242 | 0.137 | 35.11% | 9.88% | |||||
| Proposed (Stage I) | 6 | 0.278 | 66.69% | 47.95% | 7.86% | 0.020 | 0.026 | 0.061 | 0.068 | 62.44% | 35.26% | |||||
| Proposed (Stage II-) | 7 | 0.288 | 66.35% | 44.96% | 7.40% | 0.216 | 0.146 | 0.210 | 0.096 | 51.56% | 20.96% | |||||
| Proposed (Stage II-) | 7 | 0.286 | 66.28% | 46.41% | 7.54% | 0.220 | 0.147 | 0.211 | 0.098 | 50.69% | 21.25% | |||||
| Parameter | Metric | = | = + | 
|---|---|---|---|
| Rotation | ↓ | 0.3821 | 0.4305 | 
| Acc 30° ↑ | 0.6000 | 0.5737 | |
| Acc 15° ↑ | 0.3632 | 0.3421 | |
| Acc 5° ↑ | 0.0789 | 0.0474 | |
| Translation | ↓ | 0.1997 | 0.2451 | 
| Focal | ↓ | 0.2084 | 0.2961 | 
| Pose | ↓ | 0.1788 | 0.1954 | 
| Projection | ↓ | 0.1189 | 0.1239 | 
| ↑ | 0.4526 | 0.4211 | |
| ↑ | 0.1842 | 0.1684 | 
| Category | Mean (m) | Median (m) | 
|---|---|---|
| Bed | 1.53 | 1.27 | 
| Chair | 1.77 | 1.35 | 
| Table | 2.35 | 1.93 | 
| Sofa | 1.67 | 1.39 | 
| Parameter | Metric | m | m | m | 
|---|---|---|---|---|
| Rotation | ↓ | 0.3286 | 0.3893 | 1.1300 | 
| Acc 30° ↑ | 0.5947 | 0.6211 | 0.1053 | |
| Acc 15° ↑ | 0.4158 | 0.3789 | 0.0158 | |
| Acc 5° ↑ | 0.0632 | 0.0632 | 0.0053 | |
| Translation | ↓ | 0.1554 | 0.0185 | 0.0217 | 
| Focal | ↓ | 0.1325 | 0.0641 | 0.0985 | 
| Pose | ↓ | 0.3445 | 0.0440 | 0.0116 | 
| Projection | ↓ | 0.2102 | 0.1040 | 0.2416 | 
| ↑ | 0.2053 | 0.4737 | 0.1053 | |
| ↑ | 0.0368 | 0.2053 | 0.0158 | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Manawadu, M.; Park, S.-Y. 6DoF Object Pose and Focal Length Estimation from Single RGB Images in Uncontrolled Environments. Sensors 2024, 24, 5474. https://doi.org/10.3390/s24175474
Manawadu M, Park S-Y. 6DoF Object Pose and Focal Length Estimation from Single RGB Images in Uncontrolled Environments. Sensors. 2024; 24(17):5474. https://doi.org/10.3390/s24175474
Chicago/Turabian StyleManawadu, Mayura, and Soon-Yong Park. 2024. "6DoF Object Pose and Focal Length Estimation from Single RGB Images in Uncontrolled Environments" Sensors 24, no. 17: 5474. https://doi.org/10.3390/s24175474
APA StyleManawadu, M., & Park, S.-Y. (2024). 6DoF Object Pose and Focal Length Estimation from Single RGB Images in Uncontrolled Environments. Sensors, 24(17), 5474. https://doi.org/10.3390/s24175474
 
        
 
                                                

 
       