MuTr: Multi-Stage Transformer for Hand Pose Estimation from Full-Scene Depth Image
Abstract
:1. Introduction
- We propose a novel method for hand pose estimation based on transformer architecture—DePOTR—overcoming other state-of-the-art transformer-based methods and achieving comparable results with other non-transformer-based methods in the standard setup.
- We introduce a multi-stage approach—MuTr—which achieves competitive results while predicting 3D hand pose directly from the full-scene depth image via one model and replaces several separate sub-tasks in hand pose estimation pipeline by overcoming the tedious data processing.
2. Related Work
3. Methods
3.1. Training Data Modalities
3.2. Deformable Pose Estimation Transformer
3.3. Multi-Stage Transformer
4. Experiments
4.1. Datasets and Evaluation
4.2. Implementation Details
4.3. Ablation Study
4.4. Comparison with State-of-the-Art
5. Attention Analysis
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Romero, J.; Kjellstrom, H.; Kragic, D. Monocular real-time 3d articulated hand pose estimation. In Proceedings of the 9th IEEE RAS International Conference on Humanoid Robots, Paris, France, 7–10 December 2009; pp. 87–92. [Google Scholar]
- Feix, T.; Romero, J.; Ek, C.H.; Schmiedmayer, H.B.; Kragic, D. A Metric for Comparing the Anthropomorphic Motion Capability of Artificial Hands. IEEE Trans. Robot. 2013, 29, 82–93. [Google Scholar] [CrossRef] [Green Version]
- Zimmermann, C.; Brox, T. Learning to Estimate 3D Hand Pose From Single RGB Images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Garcia-Hernando, G.; Yuan, S.; Baek, S.; Kim, T.K. First-Person Hand Action Benchmark With RGB-D Videos and 3D Hand Pose Annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Tekin, B.; Bogo, F.; Pollefeys, M. H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
- Oberweger, M.; Lepetit, V. DeepPrior++: Improving Fast and Accurate 3D Hand Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef] [Green Version]
- Kolesnikov, A.; Dosovitskiy, A.; Weissenborn, D.; Heigold, G.; Uszkoreit, J.; Beyer, L.; Minderer, M.; Dehghani, M.; Houlsby, N.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021. Available online: https://openreview.net/forum?id=YicbFdNTTy (accessed on 11 June 2023).
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. arXiv 2021, arXiv:2103.15808. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. arXiv 2021, arXiv:2102.12122. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
- Yang, J.; Li, C.; Zhang, P.; Dai, X.; Xiao, B.; Yuan, L.; Gao, J. Focal self-attention for local-global interactions in vision transformers. arXiv 2021, arXiv:2107.00641. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Zheng, M.; Gao, P.; Wang, X.; Li, H.; Dong, H. End-to-end object detection with adaptive clustering transformer. arXiv 2020, arXiv:2011.09315. [Google Scholar]
- Dai, Z.; Cai, B.; Lin, Y.; Chen, J. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1601–1610. [Google Scholar]
- Wang, H.; Zhu, Y.; Adam, H.; Yuille, A.; Chen, L.C. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5463–5474. [Google Scholar]
- Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng, B.; Shen, H.; Xia, H. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8741–8750. [Google Scholar]
- Ge, L.; Liang, H.; Yuan, J.; Thalmann, D. Real-Time 3D Hand Pose Estimation with 3D Convolutional Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 956–970. [Google Scholar] [CrossRef] [PubMed]
- Oberweger, M.; Wohlhart, P.; Lepetit, V. Generalized Feedback Loop for Joint Hand-Object Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1898–1912. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Moon, G.; Yong Chang, J.; Mu Lee, K. V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation From a Single Depth Map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Huang, F.; Zeng, A.; Liu, M.; Qin, J.; Xu, Q. Structure-Aware 3D Hourglass Network for Hand Pose Estimation from Single Depth Image. In Proceedings of the British Machine Vision Conference, BMVC, Newcastle, UK, 3–6 September 2018; BMVA Press: Durham, UK, 2018; p. 289. [Google Scholar]
- Ting, P.W.; Chou, E.T.; Tang, Y.H.; Fu, L.C. Hand Pose Estimation Based on 3D Residual Network with Data Padding and Skeleton Steadying. In Proceedings of the Asian Conference on Computer Vision (ACCV), Perth, Australia, 2–6 December 2018; Jawahar, C., Li, H., Mori, G., Schindler, K., Eds.; Springer: Cham, Switzerland, 2019; pp. 293–307. [Google Scholar]
- Guo, F.; He, Z.; Zhang, S.; Zhao, X.; Tan, J. Attention-Based Pose Sequence Machine for 3D Hand Pose Estimation. IEEE Access 2020, 8, 18258–18269. [Google Scholar] [CrossRef]
- Xiong, F.; Zhang, B.; Xiao, Y.; Cao, Z.; Yu, T.; Zhou, J.T.; Yuan, J. A2J: Anchor-to-Joint Regression Network for 3D Articulated Pose Estimation From a Single Depth Image. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Ren, P.; Sun, H.; Qi, Q.; Wang, J.; Huang, W. SRN: Stacked Regression Network for Real-time 3D Hand Pose Estimation. In Proceedings of the British Machine Vision Conference BMVC, Cardiff, UK, 9–12 September 2019. [Google Scholar]
- Ren, P.; Sun, H.; Huang, W.; Hao, J.; Cheng, D.; Qi, Q.; Wang, J.; Liao, J. Spatial-aware stacked regression network for real-time 3D hand pose estimation. Neurocomputing 2021, 437, 42–57. [Google Scholar] [CrossRef]
- Ge, L.; Ren, Z.; Yuan, J. Point-to-Point Regression PointNet for 3D Hand Pose Estimation. In Proceedings of the European Conference on Computer Vision, ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Li, S.; Lee, D. Point-To-Pose Voting Based Hand Pose Estimation Using Residual Permutation Equivariant Layer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Chen, X.; Wang, G.; Zhang, C.; Kim, T.; Ji, X. SHPR-Net: Deep Semantic Hand Pose Regression From Point Clouds. IEEE Access 2018, 6, 43425–43439. [Google Scholar] [CrossRef]
- Huang, L.; Tan, J.; Liu, J.; Yuan, J. Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 17–33. [Google Scholar]
- Li, K.; Wang, S.; Zhang, X.; Xu, Y.; Xu, W.; Tu, Z. Pose Recognition With Cascade Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1944–1953. [Google Scholar]
- Hampali, S.; Sarkar, S.D.; Rad, M.; Lepetit, V. Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11090–11100. [Google Scholar]
- Chen, T.; Wu, M.; Hsieh, Y.; Fu, L. Deep learning for integrated hand detection and pose estimation. In Proceedings of the International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 615–620. [Google Scholar]
- Choi, C.; Kim, S.; Ramani, K. Learning Hand Articulations by Hallucinating Heat Distribution. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Che, Y.; Song, Y.; Qi, Y. A Novel Framework of Hand Localization and Hand Pose Estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2222–2226. [Google Scholar] [CrossRef]
- Tompson, J.; Stein, M.; Lecun, Y.; Perlin, K. Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Trans. Graph. 2014, 33, 1–10. [Google Scholar] [CrossRef]
- Oberweger, M.; Wohlhart, P.; Lepetit, V. Hands Deep in Deep Learning for Hand Pose Estimation. In Proceedings of the Computer Vision Winter Workshop, Waikoloa, HI, USA, 6–9 January 2015. [Google Scholar]
- Ge, L.; Liang, H.; Yuan, J.; Thalmann, D. Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3593–3601. [Google Scholar] [CrossRef] [Green Version]
- Tang, D.; Jin Chang, H.; Tejani, A.; Kim, T.K. Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Yuan, S.; Garcia-Hernando, G.; Stenger, B.; Moon, G.; Chang, J.Y.; Lee, K.M.; Molchanov, P.; Kautz, J.; Honari, S.; Ge, L.; et al. Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Armagan, A.; Garcia-Hernando, G.; Baek, S.; Hampali, S.; Rad, M.; Zhang, Z.; Xie, S.; Chen, M.; Zhang, B.; Xiong, F.; et al. Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Yuan, S.; Ye, Q.; Stenger, B.; Jain, S.; Kim, T. BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2605–2613. [Google Scholar] [CrossRef] [Green Version]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- Supancic, J.S., III; Rogez, G.; Yang, Y.; Shotton, J.; Ramanan, D. Depth-Based Hand Pose Estimation: Data, Methods and Challenges. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Method | Precision (mm) |
---|---|
Baseline | 11.11 |
with 2.5Dproj | 10.37 |
with 3Dproj | 10.25 |
+prep. + E&D aug. + re-scal. | 9.91 |
+enhance + crop-out aug. | 8.62 |
+enhance2 | 7.85 |
Method | Precision (mm) |
---|---|
S1 baseline | 31.92 |
S1 enhance | 24.43 |
S1 enhance2 | 21.57 |
S1 enhance2 | 20.24 |
S2 baseline | 15.69 |
S2 S1 init | 15.92 |
S2 enhance | 11.17 |
S2 enhance2 | 10.44 |
S2 enhance2 | 9.11 |
S3 baseline | 15.49 |
S3 S2 init | 14.41 |
S3 enhance | 10.66 |
S3 enhance2 | 9.84 |
S3 enhance2 | 8.71 |
S4 enhance2 | 9.86 |
Ours DePOTR | 10.83 |
Ours DePOTR | 7.85 |
Method | NYU | ICVL | FPS |
---|---|---|---|
Deepprior++ [9] | 12.24 | 8.10 | 30.0 |
A2J [28] | 8.61 | 6.46 | 105.1 |
V2V-PoseNet [24] | 8.42 | 6.28 | 3.5 |
SRN [29] | 7.79 | 6.27 | 263.4 |
SSRN [30] | 7.37 | 6.01 | 295.6 |
H-trans [34] | 9.80 | 6.47 | 43.2 |
Ours DePOTR | 7.85 | 5.98 | 106.9 |
Full-Scene Image | |||
WR-OCNN [39] | 15.62 | — | — |
SRN-FI | 39.97 | — | — |
Ours MuTr (only S1) | 20.24 | — | — |
Ours MuTr | 8.71 | — | — |
Method | Avg. (mm) | Seen (mm) | Unseen (mm) |
---|---|---|---|
V2V-PoseNet | 9.95 | 6.98 | 12.43 |
A2J | 8.57 | 6.92 | 9.95 |
SRN | 8.39 | 6.06 | 10.33 |
SSRN | 7.88 | 5.65 | 9.75 |
Ours DePOTR | 8.88 | 6.36 | 10.98 |
Method | Precision (mm) |
---|---|
S1 enhance2 | 23.97 |
S2 enhance2 | 15.64 |
S3 enhance2 | 15.66 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kanis, J.; Gruber, I.; Krňoul, Z.; Boháček, M.; Straka, J.; Hrúz, M. MuTr: Multi-Stage Transformer for Hand Pose Estimation from Full-Scene Depth Image. Sensors 2023, 23, 5509. https://doi.org/10.3390/s23125509
Kanis J, Gruber I, Krňoul Z, Boháček M, Straka J, Hrúz M. MuTr: Multi-Stage Transformer for Hand Pose Estimation from Full-Scene Depth Image. Sensors. 2023; 23(12):5509. https://doi.org/10.3390/s23125509
Chicago/Turabian StyleKanis, Jakub, Ivan Gruber, Zdeněk Krňoul, Matyáš Boháček, Jakub Straka, and Marek Hrúz. 2023. "MuTr: Multi-Stage Transformer for Hand Pose Estimation from Full-Scene Depth Image" Sensors 23, no. 12: 5509. https://doi.org/10.3390/s23125509
APA StyleKanis, J., Gruber, I., Krňoul, Z., Boháček, M., Straka, J., & Hrúz, M. (2023). MuTr: Multi-Stage Transformer for Hand Pose Estimation from Full-Scene Depth Image. Sensors, 23(12), 5509. https://doi.org/10.3390/s23125509