A New Multi-Person Pose Estimation Method Using the Partitioned CenterPose Network
Abstract
:1. Introduction
- (1)
- We propose a novel partition pose representation method to construct a relationship between body joints and the body center, while preserving correlations between adjacent body joints.
- (2)
- We propose a new bottom-up model with an improved loss to efficiently and robustly predict and partition body joints to multiple people.
- (3)
- In experiments, our PCP Network is competitive with state-of-the-art methods using the MS COCO and CrowdPose datasets while achieving a higher inference speed.
2. Related Work
2.1. Multi-Person Pose Estimation
2.2. Backbone Network
3. Partition Pose Representation
4. Partitioned CenterPose Network
4.1. Network Architecture
4.2. Training and Inference
5. Experiments
5.1. Dataset
5.2. Experimental Setup
5.3. Experimental Results
5.4. Ablation Analysis
5.5. CrowdPose
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Dong, J.; Gao, Y.; Lee, H.J.; Zhou, H.; Yao, Y.; Fang, Z.; Huang, B. Action Recognition Based on the Fusion of Graph Convolutional Networks with High Order Features. Appl. Sci. 2020, 10, 1482. [Google Scholar] [CrossRef] [Green Version]
- Insafutdinov, E.; Andriluka, M.; Pishchulin, L.; Tang, S.; Levinkov, E.; Andres, B.; Schiele, B. Arttrack: Articulated mul-ti-person tracking in the wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, USA, 2017; pp. 6457–6465. [Google Scholar]
- Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image Underst. 2020, 192, 102897. [Google Scholar] [CrossRef]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499. [Google Scholar]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards Accurate Multi-Person Pose Estimation in the Wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3711–3719. [Google Scholar]
- Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the European Conference on Computer Vision, GASTEIG Cultural Center, Munich, Germany, 10–13 September 2018; pp. 472–487. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5686–5696. [Google Scholar]
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-person Pose Estimation. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7103–7112. [Google Scholar]
- Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral Human Pose Regression. In Proceedings of the European Conference on Computer Vision, GASTEIG Cultural Center, Munich, Germany, 10–13 September 2018; pp. 536–553. [Google Scholar]
- Fang, H.-S.; Xie, S.; Tai, Y.-W.; Lu, C. RMPE: Regional Multi-person Pose Estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2353–2362. [Google Scholar]
- Cao, Z.; Šimon, T.; Wei, S.-E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1302–1310. [Google Scholar]
- Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Red Hook, NY, USA, 4–9 December 2017; pp. 2274–2284. [Google Scholar]
- Papandreou, G.; Zhu, T.; Chen, L.-C.; Gidaris, S.; Tompson, J.; Murphy, K. PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model. In Proceedings of the European Conference on Computer Vision, GASTEIG Cultural Center, Munich, Germany, 10–13 September 2018; pp. 282–299. [Google Scholar]
- Kreiss, S.; Bertoni, L.; Alahi, A. PifPaf: Composite Fields for Human Pose Estimation. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 11969–11978. [Google Scholar]
- Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. Higherhrnet: Scale-aware representation learning for bot-tom-up human pose estimation. In Proceedings of the International Conference on Computer Vision and Pattern Recogni-tion (CVPR), Seattle, WA, USA, 16–28 June 2020; pp. 5386–5395. [Google Scholar]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
- Tompson, J.; Goroshin, R.; Jain, A.; LeCun, Y.; Bregler, C. Efficient object localization using Convolutional Networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 648–656. [Google Scholar]
- Wei, S.-E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4724–4732. [Google Scholar]
- Bulat, A.; Tzimiropoulos, G. Human Pose Estimation via Convolutional Part Heatmap Regression. In Proceedings of the Haptics: Science, Technology, Applications, London, UK, 4–7 July 2016; Springer Science and Business Media LLC: Berlin/Heidelberg, Germany, 2016; Volume 9911, pp. 717–732. [Google Scholar]
- Chu, X.; Yang, W.; Ouyang, W.; Ma, C.; Yuille, A.L.; Wang, X. Multi-context Attention for Human Pose Estimation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5669–5678. [Google Scholar]
- Lifshitz, I.; Fetaya, E.; Ullman, S. Human Pose Estimation Using Deep Consensus Voting. In European Conference on Computer Vision; Springer Science and Business Media LLC: Berlin/Heidelberg, Germany, 2016; Volume 9906, pp. 246–260. [Google Scholar]
- Carreira, J.; Agrawal, P.; Fragkiadaki, K.; Malik, J. Human Pose Estimation with Iterative Error Feedback. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4733–4742. [Google Scholar]
- Hu, P.; Ramanan, D. Bottom-Up and Top-Down Reasoning with Hierarchical Rectified Gaussians. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5600–5609. [Google Scholar]
- Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep Layer Aggregation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2403–2412. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
- Kim, S.-T.; Lee, H.J. Lightweight Stacked Hourglass Network for Human Pose Estimation. Appl. Sci. 2020, 10, 6497. [Google Scholar] [CrossRef]
- Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision, GASTEIG Cultural Center, Munich, Germany, 10–13 September 2018; pp. 734–750. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.B.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Girshick, R. Fast R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 1440–1448. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Computer Vision—ECCV ECCV Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Li, Y.; Wang, X.; Liu, W.; Feng, B. Pose Anchor: A Single-stage Hand Keypoint Detection Network. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1. [Google Scholar] [CrossRef]
- Xia, H.; Zhang, T. Self-Attention Network for Human Pose Estimation. Appl. Sci. 2021, 11, 1826. [Google Scholar] [CrossRef]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the NeurIPS Workshop, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Tompson, J.J.; Jain, A.; LeCun, Y.; Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference Learn, Represent, (ICLR), San Diego, CA, USA, 5–8 May 2015. [Google Scholar]
- Li, J.; Wang, C.; Zhu, H.; Mao, Y.; Fang, H.-S.; Lu, C. CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, 16–20 June; pp. 10855–10864.
Method | Backbone | Input Size | AP | AP0.5 | AP0.75 | APM | APL | AR | Time [s] |
---|---|---|---|---|---|---|---|---|---|
CMU-Pose [12] | - | - | 0.618 | 0.849 | 0.675 | 0.571 | 0.682 | 0.665 | 0.5 |
Mask-RCNN [5] | ResNet-101 | - | 0.631 | 0.873 | 0.687 | 0.578 | 0.714 | - | 0.2 |
G-RMI [6] | ResNet-101 | 353 | 0.649 | 0.855 | 0.713 | 0.623 | 0.700 | 0.697 | - |
AssocEmbedding [13] | Hourglass | 512 | 0.655 | 0.868 | 0.723 | 0.606 | 0.726 | 0.710 | 0.19 |
PifPaf [15] | - | - | 0.667 | - | - | 0.624 | 0.729 | - | - |
PersonLab [14] | ResNet-152 | 1401 | 0.687 | 0.890 | 0.754 | 0.641 | 0.755 | 0.754 | 0.381 |
HigherHRNet-1 [16] | HRNet-W32 | 512 | 0.664 | 0.875 | 0.728 | 0.612 | 0.742 | - | 0.052 |
HigherHRNet-2 [16] | HRNet-W48 | 640 | 0.705 | 0.893 | 0.772 | 0.666 | 0.758 | 0.749 | 0.142 |
Ours (DLA) | DLA-34 | 512 | 0.634 | 0.864 | 0.693 | 0.575 | 0.739 | 0.698 | 0.039 |
Ours (ResNet) | ResNet-101 | 512 | 0.651 | 0.868 | 0.703 | 0.642 | 0.737 | 0.721 | 0.073 |
Ours (Hourglass) | Hourglass | 512 | 0.663 | 0.881 | 0.731 | 0.662 | 0.747 | 0.748 | 0.132 |
Ours (HRNet) | HRNet-W32 | 512 | 0.668 | 0.883 | 0.740 | 0.665 | 0.748 | 0.751 | 0.078 |
Method | AP | AP0.5 | AP0.75 | APM | APL | AR |
---|---|---|---|---|---|---|
PCP Network (TPR) | 0.648 | 0.854 | 0.715 | 0.603 | 0.700 | 0.697 |
PCP Network (PPR) | 0.660 | 0.869 | 0.725 | 0.608 | 0.742 | 0.704 |
Method | AP | AP0.5 | AP0.75 | APM | APL | AR |
---|---|---|---|---|---|---|
PCP Network (original loss) | 0.657 | 0.867 | 0.722 | 0.607 | 0.728 | 0.701 |
PCP Network (improved loss) | 0.660 | 0.869 | 0.725 | 0.608 | 0.742 | 0.704 |
Method | Backbone | Input Size | AP | AP0.5 | AP0.75 | APE | APM | APH |
---|---|---|---|---|---|---|---|---|
Top-down methods | ||||||||
Mask-RCNN [5] | ResNet-101 | - | 0.572 | 0.835 | 0.603 | 0.694 | 0.579 | 0.458 |
AlphaPose [11] | - | - | 0.610 | 0.813 | 0.660 | 0.712 | 0.614 | 0.511 |
SPPE [38] | ResNet-101 | - | 0.660 | 0.842 | 0.715 | 0.755 | 0.663 | 0.574 |
Bottom-up methods | ||||||||
CMU-Pose [12] | - | - | - | - | - | 0.627 | 0.487 | 0.323 |
HigherHRNet [16] | HRNet-W48 | 640 | 0.659 | 0.864 | 0.706 | 0.733 | 0.665 | 0.579 |
HigherHRNet * [16] | HRNet-W48 | 640 | 0.676 | 0.874 | 0.726 | 0.758 | 0.681 | 0.589 |
Ours (HRNet) | HRNet-W32 | 512 | 0.657 | 0.855 | 0.705 | 0.742 | 0.668 | 0.574 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, J.; Lee, H.-J. A New Multi-Person Pose Estimation Method Using the Partitioned CenterPose Network. Appl. Sci. 2021, 11, 4241. https://doi.org/10.3390/app11094241
Wu J, Lee H-J. A New Multi-Person Pose Estimation Method Using the Partitioned CenterPose Network. Applied Sciences. 2021; 11(9):4241. https://doi.org/10.3390/app11094241
Chicago/Turabian StyleWu, Jiahua, and Hyo-Jong Lee. 2021. "A New Multi-Person Pose Estimation Method Using the Partitioned CenterPose Network" Applied Sciences 11, no. 9: 4241. https://doi.org/10.3390/app11094241
APA StyleWu, J., & Lee, H.-J. (2021). A New Multi-Person Pose Estimation Method Using the Partitioned CenterPose Network. Applied Sciences, 11(9), 4241. https://doi.org/10.3390/app11094241