Techniques of 2D Human Pose Estimation Based on Computer Vision: A Survey
Abstract
1. Introduction
- (1)
- The mainstream body pose datasets and related evaluation metrics are introduced in a comprehensive way through mathematical formulas, accompanied by a critical assessment of their applicability in complex scenarios, which can help researchers and practitioners to understand the benchmark of model training and evaluation;
- (2)
- An in-depth analysis on the performance of these algorithms in single-person and multi-person scenarios is conducted, providing a comprehensive comparative framework that evaluates not only the strengths, but also the structural limitations and computational complexity of each algorithmic model;
- (3)
- The current state of research in the field of 2D human pose estimation is summarized, and three main challenges and proposed solutions and expected development trends are pointed out, with the aim to direct future research efforts and help stakeholders understand where the technology is headed.
2. Relevant Datasets and Evaluation Metrics
2.1. Related Datasets
2.2. Evaluation Metrics
2.2.1. Percentage of Correct Parts
2.2.2. Percentage of Correct Keypoints (PCK)
2.2.3. Object Keypoint Similarity (OKS)
2.2.4. Average Precision (AP)
2.2.5. Limitations of Distance-Based Metrics
3. Detailed Review Concerning 2D Human Pose Estimation Algorithms
3.1. 2D Single-Person Pose Estimation
3.1.1. Coordinate-Based Methods
- Multi-stage Direct Regression
- Multi-stage stepwise regression
3.1.2. Heatmap-Based Methods
- Adding Priori Information of Human Body Structures
- Optimize Network Structure
- Introducing Time Constraints
3.2. Multi-Person Pose Estimation
3.2.1. Two-Stage Approach
- Top–down Method
- Bottom–up Method
- Top–down and Bottom–up Combination
3.2.2. One-Stage Approach
3.2.3. End-to-End Framework
3.2.4. Transformer-Based
3.3. Comparison of Test Results of Classical Algorithms on Mainstream Datasets
3.3.1. Single-Person Pose Estimation Algorithm
3.3.2. Multi-Person Pose Estimation Algorithm
4. Current Challenges and Future Directions
4.1. Challenges and Solutions
4.1.1. Complex Environmental Factors
| Single/Multi | Category | Sub-Category | Methods | Key Strategies | Shortcomings |
|---|---|---|---|---|---|
| Single-person pose estimation | Coordinate Net | DeepPose [46] | Convolutional network with multiple iterations of direct coordinate regression | Complex Environmental Factors | |
| IEF [61] | Multi-stage stepwise regression | Imbalance in the Number of Human Postures | |||
| CPR [62] | Exploiting inter-joint dependencies | Lacks Robustness | |||
| Heatmap Net | Adding prior information of human body structure | Li, S et al. [49] | Graph neural network | Lacks Robustness | |
| Szegedy et al. [64] | Tree-structured models | Lacks Robustness | |||
| Optimize network structure | CPM [65], SHN [60] | Cascade Module Feature Fusion | Ignoring Timeliness Requirements | ||
| Yang et al. [58] | Design Pyramid Residuals Module | Ignoring Timeliness Requirements | |||
| MSSA [73] | Multi-scale structural perceptual neural networks | Ignoring Timeliness Requirements | |||
| DLCM [67] | Design of the body part hierarchy representation for intermediate supervision | Ignoring Timeliness Requirements | |||
| Jain, A et al. [68], Adversarial Posenet [67] | Introducing Generative Adversarial Networks | Imbalance in the Number of Human Postures | |||
| Chen et al. [69] | Proposed structure-aware convolutional networks | Ignoring Timeliness Requirements | |||
| Bulat et al. [68] | Combining Stacked Hourglass Network and U-Net Network | Ignoring Timeliness Requirements | |||
| FPD [70], Simple Baselines [55] | Lightweight network structure | Ignoring Timeliness Requirements | |||
| Introducing time constraints | Jain et al. [69] | Use RGB images and motion features as input | Complex Environmental Factors | ||
| Pfister et al. [72] | Use of optical flow diagrams as supervisory information | Complex Environmental Factors | |||
| Luo et al. [70], UniPose [75] | Adjacent frame processing using LSTM network | Complex Environmental Factors | |||
| GPE [74] | Time Delay | Complex Environmental Factors | |||
| Multi-person pose estimation | Two-stage | Top–down | Mask R-CNN [92] | Example of detection frame + mask | Imbalance in the Number of Human Postures |
| CPN [93] | Feature Pyramid Network for Feature Fusion | Ignoring Timeliness Requirements | |||
| RMPE [94] | Spatial transformation sample correction | Ignoring Timeliness Requirements | |||
| HRNet [58] | High-resolution maintenance | Ignoring Timeliness Requirements | |||
| G-RMI [95] | Offset-assisted positioning | Ignoring Timeliness Requirements | |||
| CrowdedPose [95] | Introduction of interference joints to improve the accuracy of node positioning | Ignoring Timeliness Requirements | |||
| LKConvPose [99] | Combines large kernel convolution and multi-scale feature fusion mechanisms | Ignoring Timeliness Requirements | |||
| Bottom–up | DeepCut [126], DeeperCut [99] | Figure optimization strategy | Complex Environmental Factors | ||
| OpenPose [97], PifPaf [2] | Complex field vector representation | Complex Environmental Factors | |||
| Associative Embedding [83] | Associative Coding Clustering | Complex Environmental Factors | |||
| HigherHRNet [104] | Uses multi-resolution training and heatmap aggregation | Complex Environmental Factors | |||
| CenterGroup [102] | Introduction of attention mechanism for keypoint grouping | Complex Environmental Factors | |||
| Top–down and Bottom–up Combination | Hu and Ramanan [103] | Stratified Corrected Gaussian Model | Lacks Robustness | ||
| Li et al. [104] | Reuse prediction results from previous frames | Lacks Robustness | |||
| Tang et al. [67] | A novel network with a hierarchical compositional architecture | Lacks Robustness | |||
| One-Stage | SPM [85] | Graded structured gestalt representation | Complex Environmental Factors | ||
| InsPose [108] | Adaptively adjusts the network parameters of each instance using instance-aware dynamic networks | Complex Environmental Factors | |||
| PoseDet [106] | Propose node-aware pose embedding to represent objects based on the location of keypoints | Complex Environmental Factors | |||
| End-to-End Framework | ED-Pose [105] | Reconceptualization as two explicit box detection processes with unified representation and regression supervision | Ignoring Timeliness Requirements | ||
| Group Pose [109] | A simple modification of the decoder’s self-concern eliminates interaction between different queries across instance types | Ignoring Timeliness Requirements | |||
| DiffusionRgePose [110] | Converting single-stage, end-to-end keypoint regression models to a diffusion-based sampling process | Ignoring Timeliness Requirements | |||
| Transformer structure | TransPose [111] | A human pose estimation model by combining Transformer and CNNs | Ignoring Timeliness Requirements | ||
| TokenPose [112] | Constraints between visual cues and keypoints are learnt simultaneously by representing keypoints as ‘markers’ based on a transformer structure | Ignoring Timeliness Requirements | |||
| SDPose [113] | Introduction of the MCT module and self-distillation methods | Ignoring Timeliness Requirements | |||
4.1.2. Timeliness Requirements
4.1.3. Imbalance in the Number of Human Postures
4.2. Future Directions
4.2.1. Occlusion
- Occlusion Repair
- Optimization of Network Structure and Human Detector
4.2.2. Timeliness Issues
- Simplify Network Structure
4.2.3. Lack of Complex Gestures
- Using Continuity of Posture
- Data Generation
4.2.4. Other Emerging Trends
- 3D Human Pose Estimation
- Multi-task Learning
- Source-free Domain Adaptation
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Souto, H.; Musse, S. Automatic detection of 2D human postures based on single images. In Proceedings of the 2011 24th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI’11), Alagoas, Brazil, 28–31 August 2011; pp. 48–55. [Google Scholar]
- Kreiss, S.; Bertoni, L.; Alahi, A. Pifpaf: Composite fields for human pose estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19), Long Beach, CA, USA, 15–20 June 2019; pp. 11977–11986. [Google Scholar]
- Insafutdinov, E.; Andrilukam, M.; Pishchulin, L.; Tang, S.; Levinkov, E.; Andres, B.; Schiele, B. ArtTrack: Articulated multi-person tracking in the wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI, USA, 21–26 July 2017; pp. 1293–1301. [Google Scholar]
- Fischler, M.A.; Elschlager, R.A. The representation and matching of pictorial structures. IEEE Trans. Comput. 1973, 22, 67–92. [Google Scholar] [CrossRef]
- Yang, Y.; Ramanan, D. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the 2011 IEEE conference on computer vision and pattern recognition (CVPR’11), Colorado Springs, CO, USA, 20–25 June 2011; pp. 1385–1392. [Google Scholar]
- Johnson, S.; Everingham, M. Learning effective human pose estimation from inaccurate annotation. In Proceedings of the 2011 IEEE conference on computer vision and pattern recognition (CVPR’11), Colorado Springs, CO, USA, 20–25 June 2011; pp. 1465–1472. [Google Scholar]
- Yu, Z.; Li, Y.; Liu, Y.; Liu, T.; Fu, Y. Synpose: A large-scale and densely annotated synthetic dataset for human pose estimation in classroom. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22), Singapore, 23–27 May 2022; pp. 3428–3432. [Google Scholar]
- An, W.; Yu, S.; Makihara, Y.; Wu, X.; Xu, C.; Yu, Y.; Liao, R.; Yagi, Y. Performance evaluation of model-based gait on multi-view very large population database with pose sequences. IEEE Trans. Biom. Behav. Identity Sci. 2020, 2, 421–430. [Google Scholar]
- Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the 1999 International Conference on Computer Vision (ICCV’99), Kerkyra, Greece, 20–27 September 1999; pp. 1150–1157. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar]
- Dang, Q.; Yin, J.; Wang, B.; Zheng, W. Deep learning based 2D human pose estimation: A survey. Tsinghua Sci. Technol. 2019, 24, 663–676. [Google Scholar] [CrossRef]
- Zhang, C.; Jiang, X.; Zhao, Y. Efficient instantaneous channel propagation modeling for aeronautical communications systems with compressed sensing. IEEE Trans. Antennas Propag. 2022, 70, 1211–1220. [Google Scholar]
- Zhao, Y.; Zhang, C. Orbital angular momentum beamforming for index modulation with partial arc reception. Electron. Lett. 2019, 55, 1271–1273. [Google Scholar] [CrossRef]
- Ismail, A.M.; Zhao, Y.; Wang, Z.; Guan, Y.L.; Yuen, C. Visually steered reconfigurable intelligent surface-assisted mobile communications. IEEE Antennas Wirel. Propag. Lett. 2025, 24, 4497–4501. [Google Scholar] [CrossRef]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Huang, Z.; Liu, Y.; Fang, Y.; Horn, B.K.P. Video-based fall detection for seniors with human pose estimation. In Proceedings of the 2018 4th international conference on Universal Village (UV’18), Boston, MA, USA, 12–14 October 2018; pp. 1–4. [Google Scholar]
- Perez-Sala, X.; Escalera, S.; Angulo, C.; Gonzàlez, J. A survey on model based approaches for 2D and 3D visual human pose recovery. Sensors 2014, 14, 4189–4210. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image Underst. 2020, 192, 102897. [Google Scholar] [CrossRef]
- Wang, C.; Zhang, F.; Ge, S.S. A comprehensive survey on 2D multi-person pose estimation methods. Eng. Appl. Artif. Intel. 2021, 102, 104260. [Google Scholar]
- Murphy-Chutorian, E.; Trivedi, M.M. Head pose estimation in computer vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 607–626. [Google Scholar] [CrossRef]
- Saroja, M.N.; Baskaran, K.R.; Priyanka, P. Human pose estimation approaches for human activity recognition. In Proceedings of the 2021 International Conference on Advancements in Electrical, Electronics, Communication, Computing and Automation (ICAECA’21), Coimbatore, India, 8–9 October 2021; pp. 1–4. [Google Scholar]
- Mitra, S.; Acharya, T. Gesture recognition: A survey. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2007, 37, 311–324. [Google Scholar] [CrossRef]
- Liu, Z.; Zhu, J.; Bu, J.; Chen, C. A survey of human pose estimation: The body parts parsing based methods. J. Vis. Commun. Image R 2015, 32, 10–19. [Google Scholar] [CrossRef]
- Zhang, H.; Lei, Q.; Zhong, B.; Du, J.; Peng, J. A survey on human pose estimation. Intell. Autom. Soft Comput. 2015, 22, 483–489. [Google Scholar] [CrossRef]
- Gong, W.; Zhang, X.; Gonzàlez, J.; Sobral, A.; Bouwmans, T.; Tu, C.; Zahzah, E. Human pose estimation from monocular images: A comprehensive survey. Sensors 2016, 16, 1966. [Google Scholar] [CrossRef] [PubMed]
- Sun, J.; Chen, X.; Lu, Y.; Cao, J. 2D human pose estimation from monocular images: A survey. In Proceedings of the 2020 IEEE 3rd International Conference on Computer and Communication Engineering Technology (CCET’20), Beijing, China, 21–23 August 2020; pp. 111–121. [Google Scholar]
- Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J. Deep learning-based human pose estimation: A survey. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
- Sapp, B.; Taskar, B. MODEC: Multimodal decomposable models for human pose estimation. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’13), Portland, OR, USA, 23–28 June 2013; pp. 3674–3681. [Google Scholar]
- Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the 13th European on Computer Vision (ECCV’14), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Johnson, S.; Everingham, M. Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference (BMVC’10), Aberystwyth, UK, 31 August–3 September 2010; pp. 1–11. [Google Scholar]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the 2014 the IEEE Conference on computer Vision and Pattern Recognition (CVPR’14), Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
- Wu, J.; Zheng, H.; Zhao, B.; Li, Y.; Yan, B.; Liang, R.; Wang, W.; Zhou, S.; Lin, G.; Fu, Y.; et al. AI challenger: A large-scale dataset for going deeper in image understanding. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME’19), Shanghai, China, 8–12 July 2019; pp. 1–11. [Google Scholar]
- Zhang, W.; Zhu, M.; Derpanis, K.G. From actemes to action: A strongly supervised representation for detailed action understanding. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV’13), Sydney, Australia, 1–8 December 2013; pp. 2248–2255. [Google Scholar]
- Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J. Towards understanding action recognition. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV’13), Sydney, Australia, 1–8 December 2013; pp. 3192–3199. [Google Scholar]
- Andriluka, M.; Iqbal, U.; Milan, A.; Insafutdinov, E.; Pishchulin, L.; Gall, J.; Schiele, B. Posetrack: A benchmark for human pose estimation and tracking. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18), Salt Lake City, UT, USA, 18–22 June 2018; pp. 5167–5176. [Google Scholar]
- Park, S.; Lee, S.; Lee, S.H. Enhanced prediction model for human activity using an end-to-end approach. IEEE Internet Thing J. 2023, 10, 6031–6041. [Google Scholar]
- Lin, J.; Zeng, A.; Wang, H.; Zhang, L.; Li, Y. UBody: A Million-Scale Dataset for Whole-Body Human Pose Estimation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–21 June 2023; pp. 2131–2141. [Google Scholar]
- Ju, X.; Zeng, A.; Wang, J.; Xu, Q.; Zhang, L. Human-Art: A Versatile Human-Centric Dataset for Artworks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–21 June 2023; pp. 618–629. [Google Scholar]
- Ferrari, V.; Marin-Jimenez, M.; Zisserman, A. Progressive search space reduction for human pose estimation. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
- Yang, Y.; Ramanan, D. Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 2878–2890. [Google Scholar] [CrossRef]
- Zhao, L.; Xu, J.; Gong, C.; Yang, J.; Zuo, W.; Gao, X. Learning to acquire the quality of human pose estimation. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 1555–1568. [Google Scholar]
- Ning, G.; Zhang, Z.; He, Z. Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multimed. 2018, 20, 1246–1259. [Google Scholar]
- Pfister, T.; Simonyan, K.; Charles, J.; Zisserman, A. Deep convolutional neural networks for efficient pose estimation in gesture videos. In Proceedings of the 12th Asian Conference on Computer Vision (ACCV’14), Singapore, 1–5 November 2014; pp. 538–552. [Google Scholar]
- Luvizon, D.C.; Hedi, T.; David, P. Human pose regression by combining indirect part detection and contextual information. In Proceedings of the 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)/Symposium on Virtual and Augmented Reality (SVR’19), Rio de Janeiro, Brazil, 28–31 October 2019; pp. 15–22. [Google Scholar]
- Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14), Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
- Zhang, F.; Zhu, X.; Ye, M. Fast human pose estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19), Long Beach, CA, USA, 16–20 June 2019; pp. 3512–3521. [Google Scholar]
- Das, A.; Chakraborty, A.; Roy-Chowdhury, A.K. Consistent re-identification in a camera network. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14), Zurich, Switzerland, 6–12 September 2014; pp. 330–345. [Google Scholar]
- Li, S.; Liu, Z.; Chan, A. Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’14), Columbus, OH, USA, 23–28 June 2014; pp. 488–496. [Google Scholar]
- Lifshitz, I.; Fetaya, E.; Ullman, S. Human pose estimation using deep consensus voting. In Proceedings of the14th European Conference on Computer Vision (ECCV’16), Amsterdam, The Netherlands, 8–16 October 2016; pp. 246–260. [Google Scholar]
- Chen, X.; Yuille, A.L. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Proceedings of the 28th Conference on Neural Information Processing Systems (NIPS’14), Montreal, QC, Canada, 8–13 December 2014; pp. 1736–1744. [Google Scholar]
- Varamesh, A.; Tuytelaars, T. Mixture dense regression for object detection and human pose estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20), Seattle, WA, USA, 14–19 June 2020; pp. 13086–13095. [Google Scholar]
- Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18), Munich, Germany, 8–14 September 2018; pp. 472–487. [Google Scholar]
- Belagiannis, V.; Zisserman, A. Recurrent human pose estimation. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG’2017), Washington, DC, USA, 30 May –3 June 2017; pp. 468–475. [Google Scholar]
- Yang, W.; Li, S.; Ouyang, W.; Li, H.; Wang, X. Learning feature pyramids for human pose estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17), Venice, Italy, 22–29 October 2017; pp. 1281–1290. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19), Long Beach, CA, USA, 16–20 June 2019; pp. 5686–5696. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16), Amsterdam, The Netherlands, 8–16 October 2016; pp. 483–499. [Google Scholar]
- Carreira, J.; Agrawal, P.; Fragkiadaki, K.; Malik, J. Human pose estimation with iterative error feedback. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, USA, 27–30 June 2016; pp. 4733–4742. [Google Scholar]
- Sun, X.; Shang, J.; Liang, S.; Wei, Y. Compositional human pose regression. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17), Venice, Italy, 22–29 October 2017; pp. 2621–2630. [Google Scholar]
- Ke, L.; Chang, M.; Qi, H.; Lyu, S. Multi-scale structure aware network for human pose estimation. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18), Munich, Germany, 8–14 September 2018; pp. 731–746. [Google Scholar]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the 31st Conference on Artificial Intelligence (AAAI’17), San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
- Wei, S.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, USA, 27–30 June 2016; pp. 4724–4732. [Google Scholar]
- Chu, X.; Yang, W.; Ouyang, W.; Ma, C.; Yuille, A.L.; Wang, X. Multi-context attention for human pose estimation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI, USA, 21–26 July 2017; pp. 1831–1840. [Google Scholar]
- Tang, W.; Yu, P.; Wu, Y. Deeply learned compositional models for human pose estimation. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18), Munich, Germany, 8–14 September 2018; pp. 190–206. [Google Scholar]
- Chou, C.; Chien, J.; Chen, H. Self adversarial training for human pose estimation. In Proceedings of the 10th Asia-Pacific-Signal-and-Information-Processing-Association Annual Summit and Conference (APSIPA ASC’18), Honolulu, HI, USA, 12–15 November 2018; pp. 17–30. [Google Scholar]
- Chen, Y.; Shen, C.; Wei, X.; Liu, L.; Yang, J. Adversarial posenet: A structureaware convolutional network for human pose estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17), Venice, Italy, 22–29 October 2017; pp. 1212–1221. [Google Scholar]
- Bulat, A.; Kossaifi, J.; Tzimiropoulos, G.; Pantic, M. Toward fast and accurate human pose estimation via softgated skip connections. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG’20), Buenos Aires, Argentina, 16–20 November 2020; pp. 8–15. [Google Scholar]
- Jain, A.; Tompson, J.; LeCun, Y.; Bregler, C. Modeep: A deep learning framework using motion features for human pose estimation. In Proceedings of the 12th Asian Conference on Computer Vision (ACCV’14), Singapore, 1–5 November 2014; pp. 302–315. [Google Scholar]
- Pfister, T.; Charles, J.; Zisserman, A. Flowing convnets for human pose estimation in videos. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, 7–13 December 2015; pp. 1913–1921. [Google Scholar]
- Luo, Y.; Ren, J.; Wang, Z.; Sun, W.; Pan, J.; Liu, J.; Pang, J.; Lin, L. Lstm pose machines. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18), Salt Lake City, UT, USA, 18–22 June 2018; pp. 5207–5215. [Google Scholar]
- Vidanpathirana, M.; Sudasingha, I.; Vidanapathirana, J.; Kanchana, P.; Perera, I. Tracking and frame-rate enhancement for real-time 2D human pose estimation. Vis. Comput. 2019, 36, 1501–1519. [Google Scholar] [CrossRef]
- Artacho, B.; Savakis, A. UniPose, unified human pose estimation in single images and videos. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20), Seattle, WA, USA, 14–19 June 2020; pp. 7035–7044. [Google Scholar]
- Yang, W.; Ouyang, W.; Li, H.; Wang, X. End-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, USA, 27–30 June 2016; pp. 3073–3082. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Cai, Y.; Wang, Z.; Luo, Z.; Yin, B.; Du, A.; Wang, H.; Zhang, X.; Zhou, X.; Zhou, E.; Sun, J. Learning delicate local representations for multi-person pose estimation. In Proceedings of the 16th European Conference on Computer Vision (ECCV’20), Glasgow, UK, 23–28 August 2020; pp. 455–472. [Google Scholar]
- Zhang, F.; Zhu, X.; Dai, H.; Ye, M.; Zhu, C. Distribution-aware coordinate representation for human pose estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20), Seattle, WA, USA, 14–19 June 2020; pp. 7091–7100. [Google Scholar]
- Iqbal, U.; Gall, J. Multi-person pose estimation with local joint-to-person associations. In Proceedings of the 14th European Conference on Computer Vision (ECCV’14), Amsterdam, The Netherlands, 8–16 October 2016; pp. 627–642. [Google Scholar]
- Huang, S.; Gong, M.; Tao, D. A coarse-fine network for keypoint localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17), Venice, Italy, 22–29 October 2017; pp. 3047–3056. [Google Scholar]
- Shi, D.; Wei, X.; Yu, X.; Tan, W.; Ren, Y.; Pu, S. Inspose: Instance-aware networks for single-stage multi-person pose estimation. In Proceedings of the 29th ACM International Conference on Multimedia (ACM Multimedia’21), Chengdu, China, 17–21 October 2021; pp. 3079–3087. [Google Scholar]
- Kocabas, M.; Karagoz, S.; Akbas, E. Multiposenet: Fast multi-person pose estimation using pose residual network. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18), Munich, Germany, 8–14 September 2018; pp. 417–433. [Google Scholar]
- Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; pp. 2278–2288. [Google Scholar]
- Nie, X.; Feng, J.; Zhang, J.; Yan, S. Single-stage multi-person pose machines. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6950–6959. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16), Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose++: Vision Transformer Foundation Model for Generic Body Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2024, 46, 1212–1230. [Google Scholar]
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7103–7112. [Google Scholar]
- Yang, J.; Zeng, A.; Liu, S.; Li, F.; Zhang, R.; Zhang, L. Explicit box detection unifies end-to-end multi-person pose estimation. In Proceedings of the International Conference on Learning Representations (ICLR’23), Kigali, Rwanda, 1–5 May 2023; pp. 577–594. [Google Scholar]
- Papandreou, G.; Zhu, T.; Chen, L.C.; Gidaris, S.; Tompson, J.; Murphy, K. PersonLab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference on Computer Vision (ECCV’18), Munich, Germany, 8–14 September 2018; pp. 1–7. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Yang, Z.; Zeng, A.; Yuan, C.; Li, Y. Effective Whole-body Pose Estimation with Two-stage Distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 14850–14860. [Google Scholar]
- Fang, H.; Xie, S.; Tai, Y.; Lu, C. Rmpe: Regional multi-person pose estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17), Venice, Italy, 22–29 October 2017; pp. 2334–2343. [Google Scholar]
- Hu, P.; Ramanan, D. Bottom-up and top-down reasoning with hierarchical rectified gaussians. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas, NV, USA, 27–30 June 2016; pp. 5600–5609. [Google Scholar]
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS-Improving object detection with one line of code. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17), Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
- Osokin, D. Real-time 2D multi-person pose estimation on CPU: Lightweight openpose. In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM’19), Prague, Czech Republic, 19–21 February 2019; pp. 744–748. [Google Scholar]
- Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards accurate multi-person pose estimation in the wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI, USA, 21–26 July 2017; pp. 4903–4911. [Google Scholar]
- Brasó, G.; Kister, N.; Leal-Taixé, L. The center of attention: Center-keypoint grouping via attention for multi-person pose estimation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV’21), Montreal, QC, Canada, 11–17 October 2021; pp. 11853–11863. [Google Scholar]
- Li, J.; Wang, C.; Zhu, H.; Mao, Y.; Fang, H.S.; Lu, C. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19), Long Beach, CA, USA, 16–20 June 2019; pp. 10863–10872. [Google Scholar]
- Wang, D.; Zhang, S. Contextual Instance Decoupling for Robust Multi-Person Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13134–13143. [Google Scholar]
- Xia, F.; Wang, P.; Chen, X.; Yuille, A.L. Joint multi-person pose estimation and semantic part segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI, USA, 21–26 July 2017; pp. 6769–6778. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
- Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20), Seattle, WA, USA, 14–19 June 2020; pp. 5385–5394. [Google Scholar]
- Li, M.; Zhou, Z.; Liu, X. Multi-person pose estimation using bounding box constraint and LSTM. IEEE Trans. Multimed. 2019, 21, 2653–2663. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Tian, C.; Yu, R.; Zhao, X.; Xia, W.; Wang, H.; Yang, Y. Posedet: Fast multi-person pose estimation using pose embedding. In Proceedings of the 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG’21), Jodhpur, India, 15–18 December 2021; pp. 1–8. [Google Scholar]
- Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16), Amsterdam, The Netherlands, 8–16 October 2016; pp. 34–50. [Google Scholar]
- Liu, H.; Chen, Q.; Tan, Z.; Liu, J.; Wang, J.; Su, X.; Li, X.; Yao, K.; Han, J.; Ding, E.; et al. Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 2146–2156. [Google Scholar]
- Tan, D.; Chen, H.; Tian, W.; Xiong, L. DiffusionRegPose: Enhancing multi-person pose estimation using a diffusion-based end-to-end regression approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’24), Seattle, WA, USA, 17–21 June 2024; pp. 2230–2239. [Google Scholar]
- Yang, S.; Quan, Z.; Nie, M.; Yang, W. TransPose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’2021), Montreal, QC, Canada, 11–17 October 2021; pp. 11802–11812. [Google Scholar]
- Li, Y.; Zhang, S.; Wang, Z.; Yang, S.; Yang, W.; Xia, S.T.; Zhou, E. Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21), Montreal, QC, Canada, 11–17 October 2021; pp. 11313–11322. [Google Scholar]
- Chen, S.; Zhang, Y.; Huang, S.; Yi, R.; Fan, K.; Zhang, R.; Chen, P.; Wang, J.; Ding, S.; Ma, L. SDPose: Tokenized pose estimation via circulation-guide self-distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’2024), Seattle, WA, USA, 17–21 June 2024; pp. 1082–1090. [Google Scholar]
- Fan, X.; Zheng, K.; Lin, Y.; Wang, S. Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, USA, 7–12 June 2015; pp. 1347–1355. [Google Scholar]
- Nie, X.; Feng, J.; Xing, J.; Yan, S. Pose partition networks for multi-person pose estimation. In Proceedings of the 15th European Conference on Computer Vision (ECCV’15), Munich, Germany, 8–14 September 2018; pp. 684–699. [Google Scholar]
- Wei, F.; Sun, X.; Li, H.; Wang, J.; Lin, S. Point-set anchors for object detection, instance segmentation and pose estimation. In Proceedings of the 16th European Conference on Computer Vision (ECCV’15), Glasgow, UK, 23–28 August 2020; pp. 527–544. [Google Scholar]
- Jin, S.; Xu, L.; Xu, J.; Wang, C.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P. Whole-body human pose estimation in the wild. In Proceedings of the 16th European Conference on Computer Vision (ECCV’20), Glasgow, UK, 23–28 August 2020; pp. 196–214. [Google Scholar]
- Geng, Z.; Sun, K.; Xiao, B.; Zhang, Z.; Wang, J. Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21), Nashville, TN, USA, 20–25 June 2021; pp. 14671–14681. [Google Scholar]
- Zhao, L.; Wen, J.; Wang, P.; Zheng, N. Context-guided adaptive network for efficient human pose estimation. In Proceedings of the 35th Conference on Artificial Intelligence (AAAI’21), Virtual Event, 2–9 February 2021; pp. 3492–3499. [Google Scholar]
- Xiao, P.; Qin, Z.; Chen, D.; Zhang, N.; Ding, Y.; Deng, F.; Qin, Z.; Pang, M. Fastnet: A lightweight convolutional neural network for tumors fast identification in mobile computer-assisted devices. IEEE Internet Things J. 2023, 10, 9878–9891. [Google Scholar]
- Wang, Y.; Li, M.; Cai, H.; Chen, W.; Han, S. Lite pose: Efficient architecture design for 2d human pose estimation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’22), New Orleans, LA, USA, 18–24 June 2022; pp. 13116–13126. [Google Scholar]
- Zhang, S.; Li, R.; Dong, X.; Rosin, P.L.; Cai, Z.; Han, X.; Yang, D.; Huang, H.; Hu, S. Pose2seg: Detection free human instance segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19), Long Beach, CA, USA, 16–20 June 2019; pp. 889–898. [Google Scholar]
- Wang, J.; Long, X.; Gao, Y.; Ding, E.; Wen, S. Graph-pcnn: Two stage human pose estimation with graph pose refinement. In Proceedings of the 16th European Conference on Computer Vision (ECCV’20), Glasgow, UK, 23–28 August 2020; pp. 492–508. [Google Scholar]
- Isack, H.; Haene, C.; Keskin, C.; Bouaziz, S.; Boykov, Y.; Izadi, S.; Khamis, S. Repose: Learning deep kinematic priors for fast human pose estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20), Seattle, WA, USA, 14–19 June 2020; pp. 1–9. [Google Scholar]
- Zhang, K.; Yao, P.; Wu, R.; Yang, C.; Li, D.; Du, M.; Deng, K.; Liu, R.; Zheng, T. Learning positional priors for pretraining 2D pose estimators. In Proceedings of the 2nd International Workshop on Human-Centric Multimedia Analysis (HUMA’21), Virtual Event, China, 17 October 2021; pp. 3–11. [Google Scholar]
- Su, Z.; Xu, L.; Zheng, Z.; Yu, T.; Liu, Y.; Fang, L. Robustfusion: Human volumetric capture with data-driven visual cues using a rgbd camera. In Proceedings of the 16th European Conference on Computer Vision (ECCV’20), Glasgow, UK, 23–28 August 2020; pp. 246–264. [Google Scholar]
- Vu, H.T.; Wilkinson, R.H.; Lech, M.; Cheng, E. A hybrid neural network for graph-based human pose estimation from 2D images. IEEE Access 2020, 8, 52830–52840. [Google Scholar] [CrossRef]
- Silva, L.J.S.; Silva, D.L.S.; Raposo, A.; Velho, L.; Lopes, H. Tensorpose: Real-time pose estimation for interactive applications. Comput. Graph. 2019, 85, 1–14. [Google Scholar] [CrossRef]
- Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-HRNet: A lightweight high-resolution network. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21), Nashville, TN, USA, 20–25 June 2021; pp. 10435–10445. [Google Scholar]
- Nie, X.; Li, Y.; Luo, L.; Zhang, N.; Feng, J. Dynamic kernel distillation for efficient pose estimation in videos. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV’19), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6941–6949. [Google Scholar]
- Zhou, Y. Role of human body posture recognition method based on wireless network kinect in line dance aerobics and gymnastics training. Wirel. Commun. Mob. Comput. 2021, 2021, 9208891. [Google Scholar] [CrossRef]
- Romero, J.; Loper, M.; Black, M.J. Flowcap: 2d human pose from optical flow. In Proceedings of the 37th German Conference on Pattern Recognition (GCPR’15), Aachen, Germany, 7–10 October 2015; pp. 412–423. [Google Scholar]
- Ghosh, A.; Kulharia, V.; Namboodiri, V.P.; Torr, P.H.S.; Dokania, P.K. Multi-agent diverse generative adversarial networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8513–8521. [Google Scholar]
- Lin, Z.; Khetan, A.; Fanti, G.; Oh, S. PacGAN: The power of two samples in generative adversarial networks. In Proceedings of the 32th Annual Conference on Neural Information Processing Systems (NeurIPS’18), Montreal, QC, Canada, 3–8 December 2018; pp. 1498–1507. [Google Scholar]
- Wang, K.; Lin, L.; Jiang, C.; Zheng, W.S. 3D human pose machines with self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1069–1082. [Google Scholar] [PubMed]
- Fries, J.A.; Steinberg, E.; Khattar, S.; Fleming, S.L.; Posada, J.; Callahan, A.; Shah, N.H. Ontology-driven weak supervision for clinical entity classification in electronic health records. Nat. Commun. 2021, 12, 2017. [Google Scholar] [CrossRef] [PubMed]
- Raychaudhuri, D.S.; Ta, C.K.; Dutta, A.; Lal, R.; Roy-Chowdhury, A.K. Prior-guided source-free domain adaptation for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’23), Paris, France, 2–6 October 2023; pp. 14996–15006. [Google Scholar]




















| Characteristics | Advantages | Disadvantages | |
|---|---|---|---|
| Zhang, H et al. [25] | Identity-invariant head pose estimation | A comprehensive review of head pose estimation methods was presented and compared | Only head pose estimation has been studied, and the study is limited in scope |
| Gong, W et al. [26] | Human posture estimation based on cameras and sensors | These two methods of human posture estimation were described in detail and compared | No comparisons of the accuracy of the estimation methods were presented |
| Sun, J et al. [27] | Gesture recognition | Discussed in detail the application of various techniques in gesture recognition | The different gesture recognition methods were not compared, and their performance aspects were not specifically analyzed and compared |
| Zheng, C et al. [28] | Focus on the body part parsing methods | Discussed in detail the application of human pose estimation and the limitations of existing methods | These methods are very limited for irregular postures and do not specifically investigate solutions |
| Sapp, B et al. [29] | Methods using depth and RGB image data | Focused on the specific area of human pose estimation, depth and RGB image data were utilized to achieve human pose detection | The focus on depth and RGB image data did not fully encompass other emerging technologies or sensors being developed for human posture estimation |
| Lin, T et al. [30] | Human posture estimation from monocular images | Detailed description and comparison of the two methods from the images were conducted | No specific comparisons with other methods and the scope of the investigation was not comprehensive enough |
| Johnson et al. [31] | Human pose estimation from 2D monocular images | Comprehensive survey on human pose estimation from 2D monocular images was conducted | Did not systematically analyze and compare the advantages, disadvantages, and applicability of different computational methods |
| Andriluka et al. [32] | Deep learning-based 2D and 3D posture estimation | 2D and 3D human posture estimation were described in detail | Research trends and solutions corresponding to the three main challenges were not indicated in detail |
| Dataset | Year | Single/Multiple | Number of Joints | Number of Samples/103 | Description |
|---|---|---|---|---|---|
| LSP [31] | 2010 | Single | 14 | 2 | Full body pose image downloaded from Flickr |
| FLIC [32] | 2013 | Single | 10 | 20 | Video frames captured by a Hollywood movie |
| MPII [35] | 2014 | Single, Multiple | 16 | 25 | YouTube downloaded video, manually select the screen in the video |
| MSCOCO [36] | 2014 | Multiple | 17 | 300 | Pictures downloaded by Google, Bing, and Flickr |
| HKD [37] | 2017 | Multiple | 14 | 300 | Daily images from internet |
| Penn Action [39] | 2013 | Single | 13 | 2 | YouTube downloaded videos |
| PoseTrack [40] | 2018 | Multiple | 15 | 0.5 | Extends MPII and can be used for pose tracking |
| HiEve [41] | 2020 | Multiple | 14 | 50 | 9 real scenes such as the airport, restaurant, and school |
| Evaluation Metrics | Characteristics |
|---|---|
| PCP [43] | A key assessment metric for early pose estimation; primarily used to assess the localization accuracy of a limb |
| PCK [44] | More widely used; the performance of the model is quantified by evaluating the distance between detected and real joints |
| OKS [45] | An evaluation of multi-person pose estimation is introduced; the performance of the model is assessed by means of calculating the weighted Euclidean distance between the detected joints and the labeled data |
| AP [34] | Assessment metrics specific to the COCO dataset; for single and multi-person pose estimation, a harmonized and standardized assessment methodology is provided |
| Category | Sub-Category | Advantages | Disadvantages | Scope of Application | Representatives |
|---|---|---|---|---|---|
| Coordinate Net | Multi-stage direct regression | Simple model structure with high time efficiency | Poor learning structure and information ability, low accuracy | Simple single pose full body | [46] |
| Multi-stage stepwise regression | Improved the accuracy of the regression | Poor learning structure and information ability, more influenced by the initial posture | Simple single pose full body | [61,62] | |
| Heatmap Net | Adding prior information of human body structure | Using the structural priorities to improve the accuracy rate, has clear joint relationships | Network structure is complex, the figure model structure is single fixed | Single pose half body full body | [49,50,64,74] |
| Optimize network structure | The accuracy rate of the model prediction has been improved | Huge number of network participants, time efficiency needs to be improved | Single, multiple pose full body | [48,58,60,65,66,67,68,69,70,75,76] | |
| Introducing time constraints | It can be extended to video attitude estimation and can use video front and back frames to solve the occlusion problem | Large computation and susceptible to gradient disappearance | Single pose monocular video | [71,72,73,74,75] |
| Category | Advantages | Disadvantages |
|---|---|---|
| Top–down | False detection and redundant detection can be reduced by improving the body detector, higher accuracy of node positioning | Needs to perform human detection for each person in the image, which takes up more memory, resulting in less time efficiency |
| Bottom–up | Less influenced by increasing number of people, time does not increase linearly with growing number of people | The influence of occlusion is large, and it is easy to lead to misconnection of nodes in complex scenes, and the accuracy of the algorithm is low |
| Top–down and Bottom–up Combination | Combines the advantages of both models with higher accuracy and reduced impact of increasing numbers | Detection performance still needs to be improved and remains vulnerable to changes in human scale |
| Method | Dataset | |
|---|---|---|
| MPII (PCKh@0.5) | LSP (PCK@0.2) | |
| DeepPose [46] | - | 61.0 |
| CNN + MRF [33] | 82.0 | - |
| CPM [65] | 88.5 | 87.9 |
| SHN [60] | 90.9 | - |
| HRU [71] | 91.5 | 92.6 |
| PRM [58] | 92.0 | 93.9 |
| Jain, A et al. [68] | 91.8 | 94.0 |
| Adversarial Posenet [67] | 92.1 | 93.1 |
| MSSA [73] | 92.1 | - |
| DLCM [67] | 92.3 | 95.1 |
| Li, S et al. [49] | - | 93.9 |
| Jain, A et al. [69] | 91.9 | - |
| Bulat et al. [68] | 94.1 | 94.8 |
| Method | AP | AP50 | AP75 | APM | APL | Time(s) |
|---|---|---|---|---|---|---|
| Mask R-CNN [92] | 62.7 | 87.0 | 68.4 | 57.4 | 71.l | 0.2 |
| RMPE [94] | 61.8 | 83.7 | 69.8 | 58.6 | 67.6 | 0.4 |
| G-RMI [95] | 64.9 | 85.5 | 71.3 | 62.3 | 70.0 | - |
| CPN [93] | 73.0 | 91.7 | 80.9 | 69.5 | 78.1 | - |
| LKConvPose [99] | 75 | 92.6 | 82.7 | 72.0 | 79.6 | - |
| OpenPose [97] | 6l.8 | 84.9 | 67.5 | 57.1 | 68.2 | 0.6 |
| Associative Embedding [83] | 65.5 | 86.8 | 72.3 | 60.6 | 72.6 | 0.25 |
| PersonLab [91] | 68.7 | 89.0 | 75.4 | 64.1 | 75.5 | 0.464 |
| PifPaf [2] | 66.7 | - | - | 62.4 | 72.9 | 0.24 |
| HigherHRnet [104] | 70.5 | 89.3 | 75.4 | 64.1 | 75.5 | - |
| CenterGroup [102] | 71.4 | 90.4 | 78.1 | 67.2 | 77.5 | - |
| ED-Pose [105] | 71.6 | 89.6 | 78.1 | 65.9 | 79.8 | - |
| Group Pose [109] | 72.0 | 89.4 | 79.1 | 66.8 | 79.7 | - |
| DiffusionRegPose [110] | 72.5 | 89.8 | 79.5 | 66.8 | 80.5 | - |
| TransPose [111] | 74.2 | 89.6 | 80.8 | 70.6 | 81 | |
| TokenPose [112] | 73.2 | 89.5 | 80.2 | 70.1 | 79.8 | |
| SDPose [113] | 73.7 | 89.6 | 80.4 | 70.3 | 80.5 |
| Challenges | Specific Difficulties | Technical Limitations | Research Trends and Solutions |
|---|---|---|---|
| Complex environmental factors | Shading, light changes, cast shadows | Limited ability of network to extract key features, insufficient ability to filter environmental noise | Occlusion repair, precise human pose target detection frame, combined human a priori information and data-driven |
| Timeliness requirements | High timeliness requirements, complex network structure, numerous network parameters | High-resolution output feature maps lead to complex network structures and higher time costs | Simplify network structure, guarantee accuracy while using lightweight network model |
| Imbalance in the number of human postures | Lack of complex postures such as falling and overturning | Limited to simple postures such as upright, poor robustness to complex postures | Focus on the collection of complex human postures, study of multi-frame continuous postures |
| Method Family | Representative Models | Core Strengths | Key Limitations | Occlusion Robustness | Computational Complexity | Suitable Scenarios |
|---|---|---|---|---|---|---|
| Coordinate-based (Single-person) | DeepPose, IEF, CPR | Intuitive; low memory footprint; fast inference. | Destroys spatial inductive bias; highly non-linear mapping. | Low | Low | Simple single-person tracking on edge devices. |
| Heatmap-based (Single-person) | CPM, Stacked Hourglass, HRNet | Preserves spatial context; high sub-pixel localization precision. | Bounded by quantization errors; heavier memory overhead. | Medium | Medium | High-precision single-person tasks. |
| Top–Down Two-Stage (Multi-person) | Mask R-CNN, RMPE, CPN | Scale-invariant instances via cropping; superior individual accuracy. | Highly dependent on upstream detector; latency scales with crowd. | Medium–High | (Scales with person count) | High-accuracy multi-person analysis (non-real-time). |
| Bottom–Up Two-Stage (Multi-person) | OpenPose, HigherHRNet | Identity-agnostic extraction; fast multi-person inference. | Relies on fragile, non-differentiable heuristic grouping. | Low–Medium | Near (Constant time) | Real-time tracking in moderately crowded scenes. |
| End-to-End and Transformer (Multi-person) | ED-Pose, TransPose, TokenPose | Fully differentiable global reasoning; eliminates heuristic glue. | High training cost; requires massive datasets. | High | High (Self-attention overhead) | Complex interactions with severe occlusions. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lin, D.; Zhang, Y.; Yu, Y.; Gao, S.; Zhou, L.; Zhao, Y. Techniques of 2D Human Pose Estimation Based on Computer Vision: A Survey. Electronics 2026, 15, 2809. https://doi.org/10.3390/electronics15132809
Lin D, Zhang Y, Yu Y, Gao S, Zhou L, Zhao Y. Techniques of 2D Human Pose Estimation Based on Computer Vision: A Survey. Electronics. 2026; 15(13):2809. https://doi.org/10.3390/electronics15132809
Chicago/Turabian StyleLin, Deyu, Yujie Zhang, Yang Yu, Shuaibo Gao, Lu Zhou, and Yufei Zhao. 2026. "Techniques of 2D Human Pose Estimation Based on Computer Vision: A Survey" Electronics 15, no. 13: 2809. https://doi.org/10.3390/electronics15132809
APA StyleLin, D., Zhang, Y., Yu, Y., Gao, S., Zhou, L., & Zhao, Y. (2026). Techniques of 2D Human Pose Estimation Based on Computer Vision: A Survey. Electronics, 15(13), 2809. https://doi.org/10.3390/electronics15132809

