Pointless Pose: Part Affinity Field-Based 3D Pose Estimation without Detecting Keypoints
Abstract
:1. Introduction
2. Related Work
2.1. 2D Keypoint Estimation Based Methods
2.2. Part Affinity Fields Based Methods
2.3. Our Approach
3. Method
3.1. Consistent 1D/2D/3D PAFs Representations
3.2. Simultaneous 1D/2D/3D PAFs Learning
3.2.1. Semi-Supervised 3D PAFs Training Strategy
3.2.2. The Loss Functions
3.3. Differentiable Post-Processing
3.3.1. 3D PAF Refinement
3.3.2. 3D Orientation Injection
3.4. End-to-End Training with 3D Pose Loss
4. Experiments
4.1. Datasets and Protocols
4.2. Implementation Details
4.3. Quantitative Results on Human3.6M
4.4. Quantitative Results on MPI-INF-3DHP
4.5. Qualitative Results on MPII
4.6. Robustness Analysis: A Case Study
4.7. Ablation Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral Human Pose Regression. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Rhodin, H.; Spörri, J.; Katircioglu, I.; Constantin, V.; Meyer, F.; Müller, E.; Salzmann, M.; Fua, P. Learning Monocular 3D Human Pose Estimation From Multi-View Images. In Proceedings of the Conference on Computer Vision and Pattern Recognition, 18–22 June 2018. [Google Scholar]
- Sárándi, I.; Linder, T.; Arras, K.O.; Leibe, B. How robust is 3D human pose estimation to occlusion? arXiv, 2018; arXiv:1808.09316. [Google Scholar]
- Sárándi, I.; Linder, T.; Arras, K.O.; Leibe, B. Synthetic occlusion augmentation with volumetric heatmaps for the 2018 eccv posetrack challenge on 3d human pose estimation. arXiv, 2018; arXiv:1809.04987. [Google Scholar]
- Chen, X.; Lin, K.Y.; Liu, W.; Qian, C.; Lin, L. Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–21 June 2019. [Google Scholar]
- Qiu, H.; Wang, C.; Wang, J.; Wang, N.; Zeng, W. Cross View Fusion for 3D Human Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
- Zhou, X.; Zhu, M.; Leonardos, S.; Derpanis, K.G.; Daniilidis, K. Sparseness meets deepness: 3D human pose estimation from monocular video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Zhou, X.; Zhu, M.; Leonardos, S.; Daniilidis, K. Sparse representation for 3D shape estimation: A convex relaxation approach. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1648–1661. [Google Scholar] [CrossRef] [PubMed]
- Zhou, X.; Zhu, M.; Pavlakos, G.; Leonardos, S.; Derpanis, K.G.; Daniilidis, K. Monocap: Monocular human motion capture using a cnn coupled with a geometric prior. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 901–914. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chen, C.H.; Ramanan, D. 3D Human Pose Estimation = 2D Pose Estimation + Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Bogo, F.; Kanazawa, A.; Lassner, C.; Gehler, P.; Romero, J.; Black, M.J. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
- Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A Simple yet Effective Baseline for 3D Human Pose Estimation. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Sun, X.; Shang, J.; Liang, S.; Wei, Y. Compositional Human Pose Regression. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Tome, D.; Russell, C.; Agapito, L. Lifting From the Deep: Convolutional 3D Pose Estimation From a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Moreno-Noguer, F. 3D Human Pose Estimation from a Single Image via Distance Matrix Regression. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Nie, B.X.; Wei, P.; Zhu, S.C. Monocular 3D human pose estimation by predicting depth on joints. In Proceedings of the International Conference on Computer Vision, 22–29 October 2017. [Google Scholar]
- Fang, H.; Xu, Y.; Wang, W.; Liu, X.; Zhu, S.C. Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Lee, K.; Lee, I.; Lee, S. Propagating LSTM: 3D Pose Estimation based on Joint Interdependency. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Kocabas, M.; Karagoz, S.; Akbas, E. Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–21 June 2019. [Google Scholar]
- Arnab, A.; Doersch, C.; Zisserman, A. Exploiting Temporal Context for 3D Human Pose Estimation in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–21 June 2019. [Google Scholar]
- Chen, W.; Wang, H.; Li, Y.; Su, H.; Wang, Z.; Tu, C.; Lischinski, D.; Cohen-Or, D.; Chen, B. Synthesizing Training Images for Boosting Human 3D Pose Estimation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 479–488. [Google Scholar] [CrossRef] [Green Version]
- Bagiwa, M.A.; Wahab, A.W.A.; Idris, M.Y.I.; Khan, S.; Choo, K.K.R. Chroma key background detection for digital video using statistical correlation of blurring artifact. Digit. Investig. 2016, 19, 29–43. [Google Scholar] [CrossRef]
- Aminu, M.; Wahid, A.; Idris, M.; Khan, S. Digital Video Inpainting Detection Using Correlation Of Hessian Matrix. Malays. J. Comput. Sci. 2016, 29, 179–195. [Google Scholar] [CrossRef]
- Hossain, M.R.I.; Little, J.J. Exploiting temporal information for 3D human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Pons-Moll, G.; Fleet, D.J.; Rosenhahn, B. Posebits for monocular human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2337–2344. [Google Scholar]
- Zhou, X.; Huang, Q.; Sun, X.; Xue, X.; Wei, Y. Towards 3d human pose estimation in the wild: A weakly-supervised approach. In Proceedings of the International Conference on Computer Vision, 22–29 October 2017. [Google Scholar]
- Pavlakos, G.; Zhou, X.; Daniilidis, K. Ordinal Depth Supervision for 3D Human Pose Estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, 18–22 June 2018. [Google Scholar]
- Wang, J.; Huang, S.; Wang, X.; Tao, D. Not All Parts Are Created Equal: 3D Pose Estimation by Modeling Bi-Directional Dependencies of Body Parts. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Luo, C.; Chu, X.; Yuille, A. Orinet: A fully convolutional network for 3d human pose estimation. arXiv, 2018; arXiv:1811.04989. [Google Scholar]
- Xiang, D.; Joo, H.; Sheikh, Y. Monocular Total Capture: Posing Face, Body, and Hands in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–21 June 2019. [Google Scholar]
- Liu, D.; Zhao, Z.; Wang, X.; Hu, Y.; Zhang, L.; Huang, T. Improving 3D Human Pose Estimation Via 3D Part Affinity Fields. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1004–1013. [Google Scholar]
- Yang, W.; Ouyang, W.; Wang, X.; Ren, J.; Li, H.; Wang, X. 3D Human Pose Estimation in the Wild by Adversarial Learning. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Dabral, R.; Mundhada, A.; Kusupati, U.; Afaque, S.; Sharma, A.; Jain, A. Learning 3D Human Pose from Structure and Motion. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Zhou, X.; Sun, X.; Zhang, W.; Liang, S.; Wei, Y. Deep kinematic pose regression. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
- Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3d human pose estimation in the wild using improved cnn supervision. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 506–516. [Google Scholar]
- Chen, X.; Lin, K.Y.; Liu, W.; Qian, C.; Lin, L. Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation. arXiv, 2019; arXiv:1903.08839. [Google Scholar]
- Tekin, B.; Marquez Neila, P.; Salzmann, M.; Fua, P. Learning to fuse 2d and 3d image cues for monocular body pose estimation. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Habibie, I.; Xu, W.; Mehta, D.; Pons-Moll, G.; Theobalt, C. In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations. arXiv, 2019; arXiv:1904.03289. [Google Scholar]
- Li, C.; Lee, G.H. Generating Multiple Hypotheses for 3D Human Pose Estimation with Mixture Density Network. arXiv, 2019; arXiv:1904.05547. [Google Scholar]
MPJPE | Direct. | Discuss | Eat | Greet | Phone | Photo | Pose | Purch. | Sit | SitD. | Smoke | Wait | WalkD. | Walk | WalkT | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tekin et al. [43] | 54.2 | 61.4 | 60.2 | 61.2 | 79.4 | 78.3 | 63.1 | 81.6 | 70.1 | 107.3 | 69.3 | 70.3 | 74.3 | 51.8 | 74.3 | 69.7 |
Zhou et al. [32] | 54.8 | 60.7 | 58.2 | 71.4 | 62.0 | 65.5 | 53.8 | 55.6 | 75.2 | 111.6 | 64.2 | 66.1 | 51.4 | 63.2 | 55.3 | 64.9 |
Martinez et al. [18] | 51.8 | 56.2 | 58.1 | 59.0 | 69.5 | 78.4 | 55.2 | 58.1 | 74.0 | 94.6 | 62.3 | 59.1 | 65.1 | 49.5 | 52.4 | 62.9 |
Sun et al. [19] | 52.8 | 54.8 | 54.2 | 54.3 | 61.8 | 67.2 | 53.1 | 53.6 | 71.7 | 86.7 | 61.5 | 53.4 | 61.6 | 47.1 | 53.4 | 59.1 |
Fang et al. [23] | 50.1 | 54.3 | 57.0 | 57.1 | 66.6 | 73.3 | 53.4 | 55.7 | 72.8 | 88.6 | 60.3 | 57.7 | 62.7 | 47.5 | 50.6 | 60.4 |
Yang et al. [38] | 51.5 | 58.9 | 50.4 | 57.0 | 62.1 | 65.4 | 49.8 | 52.7 | 69.2 | 85.2 | 57.4 | 58.4 | 43.6 | 60.1 | 47.7 | 58.6 |
Pavlakos et al. [33] | 48.5 | 54.4 | 54.4 | 52.0 | 59.4 | 65.3 | 49.9 | 52.9 | 65.8 | 71.1 | 56.6 | 52.9 | 60.9 | 44.7 | 47.8 | 56.2 |
Lee et al. [24] | 43.8 | 51.7 | 48.8 | 53.1 | 52.2 | 74.9 | 52.7 | 44.6 | 56.9 | 74.3 | 56.7 | 66.4 | 68.4 | 47.5 | 45.6 | 55.8 |
Dabral et al. [39] | 46.9 | 53.8 | 47.0 | 52.8 | 56.9 | 63.6 | 45.2 | 48.2 | 68.0 | 94.0 | 55.7 | 51.6 | 55.4 | 40.3 | 44.3 | 55.5 |
Chen et al. [42] | 45.9 | 53.5 | 50.1 | 53.2 | 61.5 | 72.8 | 50.7 | 49.4 | 68.4 | 82.1 | 58.6 | 53.9 | 57.6 | 41.1 | 46.0 | 56.9 |
Sun et al. [5] | 46.5 | 48.1 | 49.9 | 51.1 | 47.3 | 43.2 | 45.9 | 57.0 | 77.6 | 47.9 | 54.9 | 46.9 | 37.1 | 49.8 | 41.2 | 49.8 |
Chen et al. [42] | 41.1 | 44.2 | 44.9 | 45.9 | 46.5 | 39.3 | 41.6 | 54.8 | 73.2 | 46.2 | 48.7 | 42.1 | 35.8 | 46.6 | 38.5 | 46.3 |
Ours () | 51.2 | 56.5 | 54.0 | 57.1 | 59.4 | 63.3 | 51.1 | 53.3 | 65.2 | 74.5 | 57.4 | 54.6 | 59.8 | 52.7 | 47.9 | 57.2 |
Ours () | 48.6 | 54.5 | 53.1 | 55.0 | 57.2 | 60.8 | 47.9 | 53.0 | 64.2 | 74.9 | 56.8 | 51.1 | 56.4 | 49.1 | 45.2 | 55.2 |
Ours (GT Length) | 43.6 | 50.3 | 50.2 | 50.7 | 54.1 | 58.8 | 43.4 | 49.5 | 61.8 | 72.9 | 54.2 | 47.5 | 53.9 | 45.3 | 41.9 | 51.9 |
PA-MPJPE | Direct. | Discuss | Eat | Greet | Phone | Photo | Pose | Purch. | Sit | SitD. | Smoke | Wait | WalkD. | Walk | WalkT | Avg. |
Moreno-Noguer [21] | 66.1 | 61.7 | 84.5 | 73.7 | 65.2 | 67.2 | 60.9 | 67.3 | 103.5 | 74.6 | 92.6 | 69.6 | 71.5 | 78.0 | 73.2 | 74.0 |
Martinez et al. [18] | 39.5 | 43.2 | 46.4 | 47.0 | 51.0 | 56.0 | 41.4 | 40.6 | 56.5 | 59.4 | 49.2 | 45.0 | 49.5 | 38.0 | 43.1 | 47.7 |
Fang et al. [23] | 38.2 | 41.7 | 43.7 | 44.9 | 48.5 | 55.3 | 40.2 | 38.2 | 54.5 | 64.4 | 47.2 | 44.3 | 47.3 | 36.7 | 41.7 | 45.7 |
Pavlakos et al. [33] | 34.7 | 39.8 | 41.8 | 38.6 | 42.5 | 47.5 | 38.0 | 36.6 | 50.7 | 56.8 | 42.6 | 39.6 | 43.9 | 32.1 | 36.5 | 41.8 |
Lee et al. [24] | 38.0 | 39.1 | 46.3 | 44.4 | 49.0 | 55.1 | 40.2 | 41.1 | 53.2 | 68.9 | 51.0 | 39.1 | 56.4 | 33.9 | 38.5 | 46.2 |
Dabral et al. [39] | 32.8 | 36.8 | 42.5 | 38.5 | 42.4 | 49.0 | 35.4 | 34.3 | 53.6 | 66.2 | 46.5 | 34.1 | 42.3 | 30.0 | 39.7 | 42.2 |
Ours () | 37.3 | 40.3 | 39.9 | 41.2 | 43.4 | 43.7 | 37.3 | 38.7 | 50.7 | 56.9 | 42.9 | 37.9 | 43.7 | 38.7 | 35.1 | 41.8 |
Ours () | 36.2 | 39.4 | 39.1 | 40.1 | 43.1 | 43.6 | 35.1 | 38.6 | 50.7 | 57.2 | 43.5 | 36.5 | 41.8 | 36.3 | 33.9 | 40.9 |
Ours (GT Length) | 33.7 | 37.5 | 38.0 | 37.7 | 42.0 | 42.4 | 32.7 | 37.1 | 50.4 | 56.4 | 42.5 | 34.6 | 40.6 | 33.7 | 31.1 | 39.4 |
MPLORE () | Direct. | Discuss | Eat | Greet | Phone | Photo | Pose | Purch. | Sit | SitD. | Smoke | Wait | WalkD. | Walk | WalkT | Avg. |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Zhou et al. [32] | 11.72 | 13.19 | 11.56 | 12.73 | 13.99 | 14.45 | 11.36 | 12.05 | 16.05 | 15.86 | 13.89 | 12.21 | 13.44 | 10.71 | 10.63 | 12.92 |
Martinez et al. [18] | 8.73 | 9.29 | 8.96 | 9.20 | 10.85 | 11.86 | 9.27 | 8.70 | 12.10 | 15.14 | 10.20 | 9.98 | 10.82 | 8.10 | 8.97 | 10.14 |
Wang et al. [34] | 8.64 | 8.94 | 8.88 | 9.08 | 10.38 | 10.90 | 9.01 | 8.31 | 11.43 | 11.84 | 10.04 | 9.28 | 10.14 | 7.72 | 8.74 | 9.56 |
Ours | 8.27 | 8.84 | 8.65 | 8.68 | 9.69 | 9.98 | 8.13 | 8.03 | 10.66 | 10.34 | 9.44 | 8.59 | 9.63 | 7.76 | 8.33 | 9.00 |
[41] | [32] | [33] | [38] | [44] | [42] | [35] | [45] | Ours | Ours | Ours | |
---|---|---|---|---|---|---|---|---|---|---|---|
3DPCK | 64.7 | 69.2 | 71.9 | 69.0 | 69.6 | 68.7 | 64.6 | 67.9 | 69.4 | 70.5 | 71.1 |
AUC | 31.7 | 32.5 | 35.3 | 32.0 | 35.5 | 34.6 | 32.1 | - | 37.3 | 37.4 | 38.3 |
Methods | MPJPE |
---|---|
Baseline | 59.3 |
Baseline + Denoising | 58.9 |
Baseline + Denoising + | 56.7 |
Baseline + Denoising + + Flip | 55.2 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, J.; Luo, Z. Pointless Pose: Part Affinity Field-Based 3D Pose Estimation without Detecting Keypoints. Electronics 2021, 10, 929. https://doi.org/10.3390/electronics10080929
Wang J, Luo Z. Pointless Pose: Part Affinity Field-Based 3D Pose Estimation without Detecting Keypoints. Electronics. 2021; 10(8):929. https://doi.org/10.3390/electronics10080929
Chicago/Turabian StyleWang, Jue, and Zhigang Luo. 2021. "Pointless Pose: Part Affinity Field-Based 3D Pose Estimation without Detecting Keypoints" Electronics 10, no. 8: 929. https://doi.org/10.3390/electronics10080929
APA StyleWang, J., & Luo, Z. (2021). Pointless Pose: Part Affinity Field-Based 3D Pose Estimation without Detecting Keypoints. Electronics, 10(8), 929. https://doi.org/10.3390/electronics10080929