A Systematic Review of the Application of Camera-Based Human Pose Estimation in the Field of Sport and Physical Exercise
Abstract
:1. Introduction
2. Analysis of the System Evaluation Methods
2.1. Metrics
- Object Keypoint Similarity (OKS):
- ○
- Commonly used in the COCO Keypoint Challenge.
- ○
- It is formulated as Equation (1):
- ○
- Where di is the Euclidean distance between the detected keypoint and the corresponding ground truth, vi is the visibility flag of the ground truth, is referring to those samples that are labeled, s is the object’s scale (square root of the object segment area), and kis a per-keypoint constant that controls falloff.
- ○
- To put it simply, OKS plays the same role that Intersection over Union plays in object detection. It is calculated from the distance between predicted points and ground truth points normalized by the scale of the person. Typically, standard average precision and recall scores are reported in papers: AP50 (Average Precision at OKS = 0.50) AP75, AP (the mean of AP scores at 10 positions, OKS = 0.50, 0.55..., 0.90, 0.95), APM for medium objects, APL for large objects, and Average recall (AR) at OKS = 0.50, 0.55..., 0.90, 0.955.
- Percentage of Correct Keypoints (PCK): A detected joint is considered correct if the distance between the predicted and the true joint is within a certain threshold.
- ○
- Some examples:
- ◾
- PCKh@0.5 is when the threshold = 50% of the head bone link
- ◾
- PCK@0.2 = Distance between predicted and true joint < 0.2 * torso diameter
- ○
- Sometimes 150 mm is taken as the threshold.
- ○
- This alleviates the shorter limb problem since shorter limbs have smaller torsos and head bone links.
- ○
- PCK is used for 2D and 3D (PCK3D). Again, the higher the better.
2.2. Data
3. Methodology
3.1. Inclusion and Exclusion Criteria
- Date: only papers from the year 2014 to 2021 have been included in the search, as 2014 is the year in which authors started to use Deep Learning for HPE tasks, so, the performance improved and its use started to increase.
- Publication type: only papers published in journals and conferences with high impact in the field of Computer Science have been included.
- Estimation type: only HPE has been considered, understood as an overall body pose estimation, as explained in the introduction. So, for example, no eye-pose estimation or hand-pose estimation has been considered during the research. In any case, only general-purpose systems have been found related to those two pose estimation systems, not specifically applied in SPE.
3.2. Quality Criteria
3.3. Information Sources
4. Results
5. Discussion
- A bigger amount of 3D data is needed.
- A higher variety in the type of actions/sports present in 3D datasets is needed.
- The amount of 2D data could be enough for the development of a generic 2D HPE system to be applied in sports, but, when applying that system to specific sports, with their specific characteristics and problems, the error could be higher than expected from the overall sport evaluation. So, more variety of sports is needed, and a bigger amount of data per action/activity, including different challenges for the task of HPE.
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Data
Dataset | Size & Source | N of Joints/N of People | Summary |
---|---|---|---|
LSP: Leeds Sports Pose [65] |
| 14 (x,y,visibility)/1 |
|
Penn Action [66] |
| 13 (x,y,visibility)/1 |
|
KTH Multiview Football dataset II (2D part) (extended version of the original) [19] |
| 14 (x,y)/1 |
|
Dataset | Size & Source | N of Joints/N of People | Summary |
---|---|---|---|
KTH Multiview Football dataset II (3D part) (extended version of the original) |
| 14 (x,y,z)/1 |
|
Martial Arts, Dancing and Sports (MADS) [68] |
| 19 (x,y,z)/1 |
|
Dataset | Size & Source | N of Joints/N of People | Summary |
---|---|---|---|
PoseTrack [69] |
| 15 (x,y,visibility)/+1 |
|
COCO: Common Objects in Context [70] |
| Up to 17 (x,y,visibility)/+1 |
|
MPII: Max Planek Institut Informatik [16] |
| Up to 16 (x,y,visibility) (for the test set 3D torso and head orientation and body parts occlusions included)/+1 |
|
Paper | Topic | Dataset/Data Source |
---|---|---|
Estimation of Gait Parameters from 3D Pose for Elderly Care [20] | Analysis of gait parameters (i.e., cadence, step length and step duration) of elderly people using HPE. |
|
Discriminative hierarchical part-based models for human parsing and action recognition [22] | Human body parsing and action recognition. |
|
Athlete pose estimation by non-sequential key-frame propagation [12] | HPE from uncalibrated unconstrained monocular TV sports footage. |
|
HPE of Diver Based on Improved Stacked Hourglass Model [24] | HPE of divers. |
|
Pose Estimation of Complex Human Motion [25] | HPE of “complex human motion”, including a lot of sports activities (not managing properly occlusions and character inversion) |
|
AI Coach: Deep HPE and Analysis for Personalized Athletic Training Assistance [27] | Development of an AI Coach using HPE to analyze the pose of the athlete and detect “bad” poses, focused on Freestyle Skiing (athlete detection and tracking, HPE, bad pose detection). |
|
Real-time dance evaluation by markerless human pose estimation [30] | A framework that evaluates dance performance by markerless HPE, with a special focus on correct detection in full-body rotation and self-occlusion situations. |
Own publicly available K-Pop (true positions labeled using a marker-based MoCap system) (https://goo.gl/NoVDm4 link provided but not working at the last accessed date: 6 September 2021). |
Human Pose Estimation-Based Real-Time Gait Analysis Using Convolutional Neural Network [13] | Approach that uses HPE to detect abnormalities in gait patterns with 5 possible outputs: normal, abnormal left toe, abnormal left foot, abnormal right toe, abnormal right foot. |
|
Can Markerless Pose Estimation Algorithms Estimate 3D Mass Centre Positions and Velocities during Linear Sprinting Activities? [17] | Test the capacity of estimating the 3D mass center positions and velocities during linear sprinting activities using 3D HPE. (in such actions in which skeleton is pushing, current HPE methods show quite high error for the objective of the paper, at least for the proposed method) |
|
Human Posture Recognition and Estimation Method Based on 3D Multiview Basketball Sports Dataset [14] | 3D HPE using multiview basketball sports dataset. |
|
A Mobile Application for Running Form Analysis Based On Pose Estimation Technique [15] | 2D HPE applied for running form analysis using a phone. |
|
HyperStackNet: A Hyper Stacked Hourglass Deep Convolutional Neural Network Architecture for Joint Player and Stick Pose Estimation in Hockey [35] | HPE in combination with stick estimation applied to hockey players. |
|
Kinematic Pose Rectification for Performance Analysis and Retrieval in Sports [37] | HPE of athletes using the example of swimming, with images from a single camera which records inside and out the water at the same time (additionally, implements its own method of improving the estimation by inserting the swimming style by hand). |
|
Estimation of Center of Mass for Sports Scene Using Weighted Visual Hull [18] | Estimation of the CoM in sports using 3D HPE information as input. |
|
Development of a markerless optical motion capture system for daily use of training in swimming [38] | Estimation of the pose and rotation and velocity of joints of swimmers, and fluid force simulation. |
|
Athlete pose estimation by a global-local network [39] | HPE of athletes using a global-local approach. |
|
Human Body Parts Estimation and Detection for Physical Sports Movements [41] | HPE for physical sports movements |
|
Robust Estimation of Flight Parameters for SKI Jumpers [42] | HPE and flight parameter estimation for ski jumpers during the flight phase. |
|
Synthetic Image Translation for Football Players Pose Estimation [43] | HPE applied to football using cameras placed far from the field. |
|
FuturePose—Mixed Reality Martial Arts Training using Real-time 3D Human Pose Forecasting with an RGB Camera [46] | HPE applied to martial arts using a single 720p camera and combined with a pose forecasting method and VR technology. |
|
References
- Toshev, A.; Szegedy, C. DeepPose: Human Pose Estimation via Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image Underst. 2020, 192, 102897. [Google Scholar] [CrossRef]
- Zheng, C.; Wu, W.; Yang, T.; Zhu, S.; Chen, C.; Liu, R.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep Learning-Based Human Pose Estimation: A Survey. arXiv 2020, arXiv:2012.13392. [Google Scholar] [CrossRef]
- Shapoval, S.; García Zapirain, B.; Mendez Zorrilla, A.; Mugueta-Aguinaga, I. Biofeedback Applied to Interactive Serious Games to Monitor Frailty in an Elderly Population. Appl. Sci. 2021, 11, 3502. [Google Scholar] [CrossRef]
- Salti, S.; Schreer, O.; Di Stefano, L. Real-time 3d arm pose estimation from monocular video for enhanced HCI. In Proceedings of the 1st ACM Workshop on Vision Networks for Behavior Analysis, Vancouver, BC, Canada, 31 October 2008; Canada Association for Computing Machinery: New York, NY, USA, 2008; pp. 1–8. [Google Scholar]
- Li, M.; Zhou, Z.; Liu, X. Cross Refinement Techniques for Markerless Human<?brk?> Motion Capture. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–18. [Google Scholar] [CrossRef] [Green Version]
- Liu, X.; Feng, X.; Pan, S.; Peng, J.; Zhao, X. Skeleton Tracking Based on Kinect Camera and the Application in Virtual Reality System. In Proceedings of the 4th International Conference on Virtual Reality, Hong Kong, China, 24–26 February 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 21–25. [Google Scholar]
- Ro, H.; Park, Y.J.; Byun, J.-H.; Han, T.-D. Display methods of projection augmented reality based on deep learning pose estimation. In Proceedings of the ACM SIGGRAPH 2019 Posters, Los Angeles, CA, USA, 28 July 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–2. [Google Scholar]
- Ganesan, S.; Anthony, L. Using the kinect to encourage older adults to exercise: A prototype. In Proceedings of the CHI ’12 Extended Abstracts on Human Factors in Computing Systems, Austin, TX, USA, 5–10 May 2012; Association for Computing Machinery: New York, NY, USA, 2012; pp. 2297–2302. [Google Scholar]
- Moon, G.; Lee, K.M. I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image. ECCV 2020, 752–768. [Google Scholar] [CrossRef]
- Müller, L.; Osman, A.A.A.; Tang, S.; Huang, C.-H.P.; Black, M.J. On Self-Contact and Human Pose. arXiv 2021, arXiv:2104.03176. [Google Scholar]
- Fastovets, M.; Guillemaut, J.-Y.; Hilton, A. Athlete pose estimation by non-sequential key-frame propagation. In Proceedings of the 11th European Conference on Visual Media Production, London, UK, 13–14 November 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 1–9. [Google Scholar]
- Rohan, A.; Rabah, M.; Hosny, T.; Kim, S.-H. Human Pose Estimation-Based Real-Time Gait Analysis Using Convolutional Neural Network. IEEE Access 2020, 8, 191542–191550. [Google Scholar] [CrossRef]
- Song, X.; Fan, L. Human Posture Recognition and Estimation Method Based on 3D Multiview Basketball Sports Dataset. Complexity 2021, 2021, e6697697. [Google Scholar] [CrossRef]
- Takeichi, K.; Ichikawa, M.; Shinayama, R.; Tagawa, T. A Mobile Application for Running Form Analysis Based On Pose Estimation Technique. In Proceedings of the 2018 IEEE International Conference on Multimedia Expo Workshops (ICMEW), San Diego, CA, USA, 23–27 July 2018; pp. 1–4. [Google Scholar]
- Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
- Needham, L.; Evans, M.; Cosker, D.P.; Colyer, S.L. Can Markerless Pose Estimation Algorithms Estimate 3D Mass Centre Positions and Velocities during Linear Sprinting Activities? Sensors 2021, 21, 2889. [Google Scholar] [CrossRef] [PubMed]
- Kaichi, T.; Mori, S.; Saito, H.; Takahashi, K.; Mikami, D.; Isogawa, M.; Kimata, H. Estimation of Center of Mass for Sports Scene Using Weighted Visual Hull. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1890–18906. [Google Scholar]
- Kazemi, V.; Burenius, M.; Azizpour, H.; Sullivan, J. Multi-view Body Part Recognition with Random Forests. In Proceedings of the 24th British Machine Vision Conference, Bristol, UK, 9–13 September 2013. [Google Scholar]
- Kondragunta, J.; Jaiswal, A.; Hirtz, G. Estimation of Gait Parameters from 3D Pose for Elderly Care. In Proceedings of the 2019 6th International Conference on Biomedical and Bioinformatics Engineering, Shanghai, China, 13–15 November 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 66–72. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wang, Y.; Tran, D.; Liao, Z.; Forsyth, D. Discriminative Hierarchical Part-Based Models for Human Parsing and Action Recognition. In Gesture Recognition; Escalera, S., Guyon, I., Athitsos, V., Eds.; The Springer Series on Challenges in Machine Learning; Springer International Publishing: Cham, Switzerland, 2017; pp. 273–301. [Google Scholar]
- Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations|IEEE Conference Publication|IEEE Xplore. Available online: https://ieeexplore.ieee.org/document/5459303 (accessed on 29 July 2021).
- Lei, F.; Yan, J.; Wang, X. Human Pose Estimation of Diver Based on Improved Stacked Hourglass Model. In Proceedings of the 3rd International Conference on Video and Image Processing, Wuhan, China, 19–21 November 2021; Association for Computing Machinery: New York, NY, USA, 2019; pp. 178–182. [Google Scholar]
- Lei, F.; An, Z.; Wang, X. Pose Estimation of Complex Human Motion. In Proceedings of the 3rd International Conference on Video and Image Processing, Wuhan, China, 19–21 November 2021; Association for Computing Machinery: New York, NY, USA, 2019; pp. 153–156. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
- Wang, J.; Qiu, K.; Peng, H.; Fu, J.; Zhu, J. AI Coach: Deep Human Pose Estimation and Analysis for Personalized Athletic Training Assistance. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 374–382. [Google Scholar]
- Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Mask R-CNN. Available online: https://ieeexplore.ieee.org/document/8237584/ (accessed on 31 July 2021).
- Kim, Y.; Kim, D. Real-time dance evaluation by markerless human pose estimation. Multimed. Tools Appl. 2018, 77, 31199–31220. [Google Scholar] [CrossRef]
- A general Approach to Connected-Component Labeling for Arbitrary Image Representations. Journal of the ACM. Available online: https://dl.acm.org/doi/10.1145/128749.128750 (accessed on 31 July 2021).
- Kim, Y.; Kim, D. Efficient body part tracking using ridge data and data pruning. In Proceedings of the 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), Seoul, Korea, 3–5 November 2015; pp. 114–120. [Google Scholar]
- Khan, S.M.; Shah, M. Tracking Multiple Occluding People by Localizing on Multiple Scene Planes. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 505–519. [Google Scholar] [CrossRef] [PubMed]
- Wei, S.-E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2016; pp. 4724–4732. [Google Scholar]
- Neher, H.; Vats, K.; Wong, A.; Clausi, D.A. HyperStackNet: A Hyper Stacked Hourglass Deep Convolutional Neural Network Architecture for Joint Player and Stick Pose Estimation in Hockey. In Proceedings of the 2018 15th Conference on Computer and Robot Vision (CRV), Toronto, ON, Canada, 8–10 May 2018; pp. 313–320. [Google Scholar]
- Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the Computer Vision –ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 483–499. [Google Scholar]
- Zecha, D.; Einfalt, M.; Eggert, C.; Lienhart, R. Kinematic Pose Rectification for Performance Analysis and Retrieval in Sports. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1872–18728. [Google Scholar]
- Ferryanto, F.; Nakashima, M. Development of a markerless optical motion capture system for daily use of training in swimming. Sports Eng. 2017, 20, 63–72. [Google Scholar] [CrossRef]
- Hwang, J.; Park, S.; Kwak, N. Athlete Pose Estimation by a Global-Local Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 114–121. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona Spain, 5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 379–387. [Google Scholar]
- Jalal, A.; Nadeem, A.; Bobasu, S. Human Body Parts Estimation and Detection for Physical Sports Movements. In Proceedings of the 2019 2nd International Conference on Communication, Computing and Digital systems (C-CODE), Islamabad, Pakistan, 6–7 March 2019; pp. 104–109. [Google Scholar]
- Ludwig, K.; Einfalt, M.; Lienhart, R. Robust Estimation of Flight Parameters for SKI Jumpers. In Proceedings of the 2020 IEEE International Conference on Multimedia Expo Workshops (ICMEW), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
- Sypetkowski, M.; Sarwas, G.; Trzcinski, T. Synthetic Image Translation for Football Players Pose Estimation. J. Univers. Comput. Sci. 2019, 25, 683–700. [Google Scholar]
- Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded Pyramid Network for Multi-person Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar]
- Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar]
- Wu, E.; Koike, H. FuturePose—Mixed Reality Martial Arts Training Using Real-Time 3D Human Pose Forecasting With a RGB Camera. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1384–1392. [Google Scholar]
- Mehta, D.; Sridhar, S.; Sotnychenko, O.; Rhodin, H.; Shafiei, M.; Seidel, H.-P.; Xu, W.; Casas, D.; Theobalt, C. VNect: Real-time 3D Human Pose Estimation with a Single RGB Camera. ACM Trans. Graph. 2017, 36, 1–14. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A Simple Yet Effective Baseline for 3d Human Pose Estimation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; 2017; pp. 2659–2668. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
- Promrit, N.; Waijanya, S. Model for Practice Badminton Basic Skills by using Motion Posture Detection from Video Posture Embedding and One-Shot Learning Technique. In Proceedings of the 2019 2nd Artificial Intelligence and Cloud Computing Conference, Kobe, Japan, 21–23 December 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 117–124. [Google Scholar]
- Suda, S.; Makino, Y.; Shinoda, H. Prediction of Volleyball Trajectory Using Skeletal Motions of Setter Player. In Proceedings of the 10th Augmented Human International Conference 2019, Reims, France, 11–12 March 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–8. [Google Scholar]
- Shimizu, T.; Hachiuma, R.; Saito, H.; Yoshikawa, T.; Lee, C. Prediction of Future Shot Direction using Pose and Position of Tennis Player. In Proceedings of the 2nd International Workshop on Multimedia Content Analysis in Sports, Nice, France, 25 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 59–66. [Google Scholar]
- Wu, E.; Koike, H. FuturePong: Real-time Table Tennis Trajectory Forecasting using Pose Prediction Network. In Proceedings of the Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–8. [Google Scholar]
- Einfalt, M.; Dampeyrou, C.; Zecha, D.; Lienhart, R. Frame-Level Event Detection in Athletics Videos with Pose-Based Convolutional Sequence Networks. In Proceedings of the 2nd International Workshop on Multimedia Content Analysis in Sports, Nice, France, 25 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 42–50. [Google Scholar]
- Tharatipyakul, A.; Choo, K.T.W.; Perrault, S.T. Pose Estimation for Facilitating Movement Learning from Online Videos. In Proceedings of the International Conference on Advanced Visual Interfaces, Ischia Island, Italy, 28 September–2 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
- Trejo, E.W.; Yuan, P. Recognition of Yoga Poses Through An Interactive System With Kinect Based On Confidence Value; IEEE: New York, NY, USA, 2018; pp. 606–611. ISBN 978-1-5386-7066-8. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Baclig, M.M.; Ergezinger, N.; Mei, Q.; Gül, M.; Adeeb, S.; Westover, L. A Deep Learning and Computer Vision Based Multi-Player Tracker for Squash. Appl. Sci. 2020, 10, 8793. [Google Scholar] [CrossRef]
- Fani, M.; Neher, H.; Clausi, D.A.; Wong, A.; Zelek, J. Hockey Action Recognition via Integrated Stacked Hourglass Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 85–93. [Google Scholar]
- Cai, Z.; Neher, H.; Vats, K.; Clausi, D.A.; Zelek, J. Temporal Hockey Action Recognition via Pose and Optical Flows. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 2543–2552. [Google Scholar]
- Becker, A.; Herrebrøden, H.; Sánchez, V.E.G.; Nymoen, K.; Freitas, C.M.D.S.; Torresen, J.; Jensenius, A.R. Functional Data Analysis of Rowing Technique Using Motion Capture Data. In Proceedings of the 6th International Conference on Movement and Computing, Tempe, AZ, USA, 10–12 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–8. [Google Scholar]
- Toyoda, K.; Kono, M.; Rekimoto, J. Post-Data Augmentation to Improve Deep Pose Estimation of Extreme and Wild Motions; IEEE: New York, NY, USA, 2019; pp. 1570–1574. ISBN 978-1-72811-377-7. [Google Scholar]
- Xu, Y.; Peng, Y. Real-Time Possessing Relationship Detection for Sports Analytics. In Proceedings of the 2020 39th Chinese Control Conference (CCC), Shenyang, China, 27–29 July 2020; pp. 7373–7378. [Google Scholar]
- Johnson, S.; Everingham, M. Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation. In Proceedings of the British Machine Vision Conference 2010, Aberystwyth, UK, 31 August–3 September 2010; British Machine Vision Association: Aberystwyth, UK, 2010; pp. 12.1–12.11. [Google Scholar]
- Zhang, W.; Zhu, M.; Derpanis, K. From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013. [Google Scholar]
- Burenius, M.; Sullivan, J.; Carlsson, S. Motion capture from dynamic orthographic cameras. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 1634–1641. [Google Scholar]
- Zhang, W.; Liu, Z.; Zhou, L.; Leung, H.; Chan, A.B. Martial Arts, Dancing and Sports dataset. Image Vis. Comput. 2017, 61, 22–39. [Google Scholar] [CrossRef]
- Andriluka, M.; Iqbal, U.; Milan, A.; Insafutdinov, E.; Pishchulin, L.; Gall, J.; Schiele, B. PoseTrack: A Benchmark for Human Pose Estimation and Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2016. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar]
- Tran, D.; Forsyth, D. Improved human parsing with a full relational model. In Proceedings of the 11th European Conference on Computer Vision: Part IV, Crete, Greece, 5–11 September 2010; Springer-Verlag: Berlin/Heidelberg, Germany, 2010; pp. 227–240. [Google Scholar]
- Wang, Y.; Tran, D.; Liao, Z. Learning hierarchical poselets for human parsing. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 1705–1712. [Google Scholar]
Question | Purpose |
---|---|
Do literature and public content have bases to start applying HPE in SPE? | Understanding what are the needs of HPE systems applied to SPE and if the actual general-purpose HPE research is enough to work with its application in this context. |
How is HPE applied in SPE? Which are the used architectures? Which methods improved the performance in the applied context? Is using a general-purpose system enough for getting good performance or any special adaptation or aggregation of methods is needed? | Analyze how HPE applied in SPE differs from other applications and how it is applied to each context, understanding the specific needs, and whether it is necessary or not to do extra development work for improving general-purpose systems in the application context. |
Can public 2D HPE data be found in order to be applied to SPE? | Researching on the amount of data available for training and evaluating 2D HPE systems in SPE. |
Can public 3D HPE data be found to be applied to SPE? | Same purpose as the previous one, but focused on 3D systems. |
Are there a higher number of papers working in 2D or 3D HPE applied to SPE? | Knowing if most of the research has been focused on 2D or 3D systems, and why. |
Can we find a variety of sports in which HPE has been applied? | Check in which type of sports has HPE been applied. |
Do most of the authors fulfill the concept of replicability? | Reviewing the training and evaluation process of the authors and checking if they provide the used data as well as other resources to replicate the experimentation and be able to compare their system with others. |
Metric Type | Item N | Description | Value | Weight |
---|---|---|---|---|
About the content of the paper (7 points) | 1 | Provides in the abstract an informative and balanced summary of the context of the problem, what was done and what was found | (0,1) (YES/NO) | 1 |
2 | Provides the details about the evaluation process of the system (used data, evaluation metric, protocol and setup) | [0–2] | 2 | |
3 | Implements one or more methods that improve the HPE for the problems faced in one or more sport or exercise types | [0–2] | 2 | |
4 | Give a cautious overall interpretation of results considering objectives, limitations, the multiplicity of analyses, results from similar studies, and other relevant evidence | [0–1] | 1 | |
5 | Discuss limitations of the study, considering sources of potential bias or imprecision | (0,1) (YES/NO) | 1 | |
Other Quality Metrics (4 points) | 6 | Dataset used in the research is a benchmark or it has been made publicly available | (0,1) (YES/NO) | 1 |
7 | Code is publicly available | (0,0.5) (YES/NO) | 0.5 | |
8 | Innovation | [0–0.5] | 0.5 | |
9 | Performance of the system: Accuracy and error | Depending on the average and maximum results of other works in relation to the same dataset or implementation, the work will obtain the following score depending on the percentage of quality of results in which it is: between 60–70% (0.5), between 71–85% (1), and 85%+ (1.5). If it is not specified, it is in the group of results under 60%, only qualitative results are provided or the experiment is not clear (0). | 1.5 | |
10 | It has any citation out of the author’s self-references (at the time of writing this literature review) | (0,0.5) | 0.5 |
Name | Description | Topics | Numbers |
---|---|---|---|
ACM DL | Research, discovery, and networking platform focused on publications about computing | Computing topics: hardware, networks, applied computing, etc. |
|
WOS | Website that provides access to multiple databases (online + regional) that provide comprehensive citation data for many different academic disciplines | 256 disciplines, including related to Computer Science |
|
dblp | Computer Science bibliography website | Computer Science |
|
Paper | Base Architecture/System | Methodology |
---|---|---|
[20] | Openpose * [21]. | The RGB image and depth data is obtained using Kinect. Using Openpose the 2D pose is predicted and mapped with the acquired depth data to generate the 3D pose. Then, the 3D pose is used to estimate gait parameters, as explained in Table A4. |
[22] | Hierarchical poselets, based on the concept of ‘poselet’ introduced in [23]. | For each poselet, Histogram of Oriented Gradients (HOG) features are constructed and a linear SVM classifier is used for detecting the presence of each poselet. A poselet represents a specific configuration and appearance of a body part, working in this case with 20 body parts. |
[12] | The framework could incorporate any part detector. In the example, spatio-temporally-linked Pictorial Structures are used to estimate the human pose. | Implementation of an algorithm for non-sequential propagation of keyframes to other similar frames using a Minimum Spaning Tree (MSP), reducing the amount of manual interaction or pose estimations. |
[24] | 10-layer hourglass network cascade model. | To solve the problem of self-occlusions of athletes in the air, the authors used the mutual relations between the key nodes in the heatmap generated by each level network, to continuously optimize the key nodes of shielding, and to improve the prediction accuracy of all key nodes. |
[25] | 3-part CNN architecture. | The first part is formed by the first twelve layers of VGG-19 [26]. The second part takes the set of features generated by the first part and estimated the hot spot map and loss, and the third part is divided at the same time into six parts, which use the hot spot map and loss of the previous part, and the set of inputs, to estimate hot spot maps and loss, till the result. |
[27] | ResNet-50. | First of all, a binary human detection module is used to detect a human, similar to R-CNN serial models [28,29]. The CNN model ResNet-50 is used to extract features from each frame of a video. Sports videos usually suffer from blur due to the fast movement of athletes, so, to solve this, and, at the same time improve the performance of the system, the authors created a structural-aware Spatial-Temporal relation convolution module. This module analyzes the spatial relation of different keypoints in each time frame, as well as the temporal relation of each keypoint among different frames. These features are concatenated to obtain the keypoints of the analyzed person. |
[30] | Processing of depth data. | The authors use a Kinect camera to obtain the depth image of a person. Then, apply an initial process for human extraction: floor-removal, a 3D-connected component-labeling technique [31] to segment the objects in the original depth image and identify human objects among the segmented ones by assuming that only humans move. Then, ridge data is generated making use of a distance transform map as in [32]. Finally, the estimation is done, starting with a calibration position of the body, and applying a hierarchical top-down HPE method, which makes the method invariant to rotation and occlusion, two things very frequent in dancing. |
[13] | The architecture is based in [21]. | Takes advantage of part affinity fields (PAFs) to preserve both location and orientation information across the region of support of the limb, which improves the estimation. |
[17] | OpenPose. | The authors make use of an approach based on occupancy maps to associate person detections between viewpoints [33]. To reconstruct the person in 3D, each joint detection is back-projected using the calibration of the relevant camera to produce a ray in space, and with a least-squares solution, the “intersection” of the 3D rays is solved. In this way, the authors obtain an accuracy similar to the one obtained by marker-based systems. |
[14] | VGG11 | A feature fusion network is constructed using a pointwise feature, global feature, and RGB feature. C3D CNN model is used as feature extractor. |
[15] | Convolutional Pose Machine (CPM) [34]. | The HPE method is implemented as it is to be able to estimate other parameters related to the running form, such as speed, step frequency, and swing angles. |
[35] | Stacked hourglass network proposed by [36]. | The HyperStackNet architecture is divided into three parts: the original stacked hourglass network, which produces the initial heatmap of 16 joint positions, the latent pose vector, which concatenates each hourglass (there are 8 hourglass modules in the original stacked hourglass network) module’s output, and finally, the modified stacked hourglass network, which takes advantage from the information provided by the previous part to, on the one hand, improves the prediction, and on the other hand, add two more keypoints: the hockey stick. |
[37] | CPM. | One fine-tuned CPM for each of the four main swimming styles (freestyle, backstroke, butterfly, and breaststroke). CPMs can perform very well in general-purpose context, but visually challenging footage of swimmers may still confuse the HPE systems, due to heavy splashes, water bubbles, or refractions, producing many false estimates and problems such as complete swaps of left and right body sides and single joint outliers. So, the authors implement three methods to improve the performance in this context: optimization for untangling joint swaps, a novel method for robust regression to approach the problem of filtering coordinate outliers and signal noise, and data-dependent filters for fine-tunning joint coordinates. |
[18] | OpenPose. | The authors obtain the 3D position of each joint obtained by OpenPose, by applying the direct linear transform to each 2D keypoint to triangulate them. |
[38] | Segmentation of the participant’s silhouettes. | Image thresholding was used for segmentation, it was applied to the blue color channel of the frame due to its significant contrast between the participant’s body and the environment. Obviously, this is a method that can only be applied in contexts like the one of this use case. The model was obtained from a swimming frame that contained a complete body segment, and the joint positions could be determined by looking for the centroid of intersection between two body parts. The proposed system was limited to the swimmers who have symmetrical butterfly stroke movement, as left and right body parts are not divided. |
[39] | ResNet-101 (global network) and Region-based Fully Convolutional Network (R-FCN) (for local network). | The global network, a big deep network, estimates locations of parts using the global features, which are fed into the small network, the local one, in which position-sensitive ROI pooing based on R-FCN [40] is applied to refine the predictions using local information. |
[41] | Segmentation of the participant’s silhouettes. | First, the salient region detection method is used to detect the visibly noticeable regions in the image, and then, a method for foreground segmentation by skin tone detection is implemented. By these two steps, the silhouette of a person is got. Then, five basic body keypoints are detected by using the body parts model, and seven more body keypoints are detected based on the previously detected keypoints. |
[42] | Mask R-CNN [29]. | Other HPE methods such as CPM were used previously, but even if the performance was acceptable, the error was higher due to outliers, and ski detection was a big problem. The authors developed a new model based on Mask R-CNN, which uses a branch to detect keypoints instead of generating segmentation masks, being able even to learn non-body keypoints, such as ski tips and ski tails, very interesting to be applied in the field of sports, in which, sometimes, the detection of sports tools is very interesting or even necessary depending on the objective of the application of the system. |
[43] | Cascaded Pyramid Networks (CPN) [44]. | First, a synthetic dataset is rendered, which is converted to a synthetic realistic dataset by the use of CycleGAN [45]. Then, the initial synthetic data, in combination with the cycled-synthetic one, and mixed with COCO, is used to train CPN. |
[46] | VNect [47]. | VNect is used for 2D pose estimation, which is based on ResNet50 [48]. Then, a residual linear network, based in [49], is used to recover the 2D joint positions to 3D. |
Paper | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Result |
---|---|---|---|---|---|---|---|---|---|---|---|
[35] | 1 | 2 | 2 | 0.5 | 1 | 1 | 0 | 0.5 | 1.5 | 0.5 | 10 |
[27] | 1 | 2 | 2 | 1 | 0 | 1 | 0 | 0.5 | 1.5 | 0.5 | 9.5 |
[30] | 1 | 2 | 2 | 1 | 0 | 1 | 0 | 0.5 | 1.5 | 0.5 | 9.5 |
[39] | 1 | 2 | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 0.5 | 9.5 |
[43] | 1 | 2 | 2 | 1 | 1 | 0 | 0.5 | 0.5 | 1 | 0.5 | 9.5 |
[46] | 1 | 2 | 2 | 1 | 1 | 0 | 0 | 0.5 | 1.5 | 0.5 | 9.5 |
[12] | 1 | 2 | 2 | 1 | 0 | 1 | 0 | 0.5 | 1.5 | 0 | 9 |
[37] | 1 | 2 | 2 | 0.5 | 1 | 0 | 0 | 0.5 | 1.5 | 0.5 | 9 |
[41] | 1 | 2 | 2 | 1 | 0 | 1 | 0 | 0 | 1.5 | 0.5 | 9 |
[22] | 1 | 2 | 2 | 1 | 0 | 1 | 0 | 0.5 | 0.5 | 0.5 | 8.5 |
[25] | 0 | 2 | 2 | 1 | 1 | 1 | 0 | 0.5 | 1 | 0 | 8.5 |
[42] | 1 | 2 | 2 | 0.5 | 0 | 0 | 0 | 0.5 | 1.5 | 0 | 7.5 |
[24] | 1 | 2 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 7 |
[13] | 1 | 2 | 2 | 0.5 | 0 | 0 | 0 | 0 | 1.5 | 0 | 7 |
[15] | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0.5 | 1.5 | 0.5 | 6.5 |
[18] | 1 | 2 | 0 | 0.5 | 1 | 0 | 0 | 0.5 | 1 | 0.5 | 6.5 |
[38] | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0.5 | 1.5 | 0.5 | 6.5 |
[20] | 1 | 1 | 1 | 0.5 | 0 | 0 | 0 | 0.25 | 1.5 | 0.5 | 5.75 |
[17] | 1 | 2 | 0 | 1 | 1 | 0 | 0 | 0.5 | 0 | 0 * | 5.5 |
[14] | 1 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0.5 | 0 * | 5.5 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Badiola-Bengoa, A.; Mendez-Zorrilla, A. A Systematic Review of the Application of Camera-Based Human Pose Estimation in the Field of Sport and Physical Exercise. Sensors 2021, 21, 5996. https://doi.org/10.3390/s21185996
Badiola-Bengoa A, Mendez-Zorrilla A. A Systematic Review of the Application of Camera-Based Human Pose Estimation in the Field of Sport and Physical Exercise. Sensors. 2021; 21(18):5996. https://doi.org/10.3390/s21185996
Chicago/Turabian StyleBadiola-Bengoa, Aritz, and Amaia Mendez-Zorrilla. 2021. "A Systematic Review of the Application of Camera-Based Human Pose Estimation in the Field of Sport and Physical Exercise" Sensors 21, no. 18: 5996. https://doi.org/10.3390/s21185996
APA StyleBadiola-Bengoa, A., & Mendez-Zorrilla, A. (2021). A Systematic Review of the Application of Camera-Based Human Pose Estimation in the Field of Sport and Physical Exercise. Sensors, 21(18), 5996. https://doi.org/10.3390/s21185996