RotJoint-Based Action Analyzer: A Robust Pose Comparison Pipeline
Abstract
:1. Introduction
2. Related Work
2.1. Human Pose Estimation
2.2. Action Comparison
3. Benchmark
3.1. Ambiguity Pose Definition
3.2. PoseAMBench Collection
3.3. PoseAMBench Metric
4. RotJoint-Based Pipeline
4.1. What Are RotJoints?
4.2. How to Use RotJoints in Static Pose Comparison?
4.3. How to Use RotJoints in Dynamic Action Comparison?
5. Experiment
5.1. Subjective Pose Similarity Assessment
5.2. Performance on PoseAMBench
5.3. Action Sequence Similarity Identification Based on Frame-Matching Integration
5.4. TemporalRotNet-Based Action Recognition and Similarity Assessment
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Garg, S.; Saxena, A.; Gupta, R. Yoga pose classification: A CNN and MediaPipe inspired deep learning approach for real-world application. J. Ambient Intell. Humaniz. Comput. 2023, 14, 16551–16562. [Google Scholar]
- Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image Underst. 2021, 210, 103225. [Google Scholar]
- Lee, S.; Lee, K. CheerUp: A Real-time Ambient Visualization of Cheerleading Pose Similarity. In Proceedings of the Companion Proceedings of the 28th International Conference on Intelligent User Interfaces, Sydney, Australia, 27–31 March 2023; pp. 72–74. [Google Scholar]
- Sebernegg, A.; Kán, P.; Kaufmann, H. Motion similarity modeling—A state of the art report. arXiv 2020, arXiv:2008.05872. [Google Scholar]
- Li, K.; Wang, S.; Zhang, X.; Xu, Y.; Xu, W.; Tu, Z. Pose recognition with cascade transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1944–1953. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
- Chen, Z.; Li, S.; Yang, B.; Li, Q.; Liu, H. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 1113–1122. [Google Scholar]
- Zeng, A.; Sun, X.; Yang, L.; Zhao, N.; Liu, M.; Xu, Q. Learning skeletal graph neural networks for hard 3D pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11436–11445. [Google Scholar]
- Krüger, B.; Tautges, J.; Weber, A.; Zinke, A. Fast local and global similarity searches in large motion capture databases. In Proceedings of the Symposium on Computer Animation, Madrid, Spain, 2–4 July 2010; pp. 1–10. [Google Scholar]
- Mazzia, V.; Angarano, S.; Salvetti, F.; Angelini, F.; Chiaberge, M. Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognit. 2022, 124, 108487. [Google Scholar]
- Estevam, V.; Pedrini, H.; Menotti, D. Zero-shot action recognition in videos: A survey. Neurocomputing 2021, 439, 159–175. [Google Scholar]
- Sun, B.; Kong, D.; Wang, S.; Li, J.; Yin, B.; Luo, X. GAN for vision, KG for relation: A two-stage network for zero-shot action recognition. Pattern Recognit. 2022, 126, 108563. [Google Scholar]
- Mishra, A.; Pandey, A.; Murthy, H.A. Zero-shot learning for action recognition using synthesized features. Neurocomputing 2020, 390, 117–130. [Google Scholar]
- Gao, J.; Zhang, T.; Xu, C. Learning to model relationships for zero-shot video classification. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3476–3491. [Google Scholar]
- Lei, J.; Song, M.; Li, Z.N.; Chen, C. Whole-body humanoid robot imitation with pose similarity evaluation. Signal Process. 2015, 108, 136–146. [Google Scholar]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 38571–38584. [Google Scholar]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose++: Vision transformer for generic body pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1212–1230. [Google Scholar] [CrossRef]
- Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef]
- Ngan, P.T.H.; Hochin, T.; Nomiya, H. Similarity measure of human body movement through 3D chaincode. In Proceedings of the 2017 18th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), Kanazawa, Japan, 26–28 June 2017; pp. 607–614. [Google Scholar]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef]
- Zhou, J.; Feng, W.; Lei, Q.; Liu, X.; Zhong, Q.; Wang, Y.; Jin, J.; Gui, G.; Wang, W. Skeleton-based human keypoints detection and action similarity assessment for fitness assistance. In Proceedings of the 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 22–24 October 2021; pp. 304–310. [Google Scholar]
- Lee, J.J.; Choi, J.H.; Chuluunsaikhan, T.; Nasridinov, A. Pose evaluation for dance learning application using joint position and angular similarity. In Proceedings of the Adjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, Virtual, 14–17 September 2020; pp. 67–70. [Google Scholar]
- Lee, K.; Kim, W.; Lee, S. From human pose similarity metric to 3D human pose estimator: Temporal propagating LSTM networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1781–1797. [Google Scholar] [CrossRef]
- Loper, M.; Mahmood, N.; Romero, J.; Pons-Moll, G.; Black, M.J. SMPL: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries; ACM: New York, NY, USA, 2023; Volume 2, pp. 851–866. [Google Scholar]
- Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.; Tzionas, D.; Black, M.J. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10975–10985. [Google Scholar]
- Chen, W.; Jiang, Z.; Guo, H.; Ni, X. Fall detection based on key points of human-skeleton using openpose. Symmetry 2020, 12, 744. [Google Scholar] [CrossRef]
- Kendall, A.; Grimes, M.; Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 2938–2946. [Google Scholar]
- Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2637–2646. [Google Scholar]
- Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 466–481. [Google Scholar]
- Charles, J.; Pfister, T.; Magee, D.; Hogg, D.; Zisserman, A. Personalizing human video pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2016; pp. 3063–3072. [Google Scholar]
- Tang, W.; Li, Y.; Osimiri, L.; Zhang, C. Osteoblast-specific transcription factor Osterix (Osx) is an upstream regulator of Satb2 during bone formation. J. Biol. Chem. 2011, 286, 32995–33002. [Google Scholar] [CrossRef]
- Lin, J.; Zeng, A.; Wang, H.; Zhang, L.; Li, Y. One-stage 3d whole-body mesh recovery with component aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21159–21168. [Google Scholar]
- Moon, G.; Choi, H.; Lee, K.M. Accurate 3D hand pose estimation for whole-body 3D human mesh estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2308–2317. [Google Scholar]
- Huynh, D.Q. Metrics for 3D rotations: Comparison and analysis. J. Math. Imaging Vis. 2009, 35, 155–164. [Google Scholar]
- Switonski, A.; Michalczuk, A.; Josinski, H.; Polanski, A.; Wojciechowski, K. Dynamic time warping in gait classification of motion capture data. In Proceedings of the World Academy of Science, Engineering and Technology, World Academy of Science, Engineering and Technology (WASET), Paris, France, 22–23 August 2012; Number 71. p. 53. [Google Scholar]
- Abdulghani, M.M.; Ghazal, M.T.; Salih, A.B.M. Discover human poses similarity and action recognition based on machine learning. Bull. Electr. Eng. Inform. 2023, 12, 1570–1577. [Google Scholar] [CrossRef]
- Chan, J.; Leung, H.; Tang, K.T.; Komura, T. Immersive performance training tools using motion capture technology. In Proceedings of the 1st Intenational ICST Conference on Immersive Telecommunications & Workshops, Verona, Italy, 10–12 October 2010. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 2017, 30, 5105–5114. [Google Scholar]
- Wang, Y.; Xiao, Y.; Xiong, F.; Jiang, W.; Cao, Z.; Zhou, J.T.; Yuan, J. 3DV: 3D dynamic voxel for action recognition in depth video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 511–520. [Google Scholar]
- Tang, J.; Luo, J.; Tjahjadi, T.; Guo, F. Robust arbitrary-view gait recognition based on 3D partial similarity matching. IEEE Trans. Image Process. 2016, 26, 7–22. [Google Scholar] [CrossRef]
- Sweeney, C.; Kneip, L.; Hollerer, T.; Turk, M. Computing similarity transformations from only image correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3305–3313. [Google Scholar]
- Huang, P.; Hilton, A.; Starck, J. Shape similarity for 3D video sequences of people. Int. J. Comput. Vis. 2010, 89, 362–381. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Ionescu, C.; Li, F.; Sminchisescu, C. Latent Structured Models for Human Pose Estimation. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011. [Google Scholar]
- Wang, L.; Liu, J.; Zheng, L.; Gedeon, T.; Koniusz, P. Meet JEANIE: A Similarity Measure for 3D Skeleton Sequences via Temporal-Viewpoint Alignment. Int. J. Comput. Vis. 2024, 132, 4091–4122. [Google Scholar] [CrossRef]
- Slama, R.; Wannous, H.; Daoudi, M. 3D human motion analysis framework for shape similarity and retrieval. Image Vis. Comput. 2014, 32, 131–154. [Google Scholar] [CrossRef]
- Rybarczyk, Y.; Deters, J.K.; Gonzalo, A.A.; Esparza, D.; Gonzalez, M.; Villarreal, S.; Nunes, I.L. Recognition of physiotherapeutic exercises through DTW and low-cost vision-based motion capture. In Proceedings of the Advances in Human Factors and Systems Interaction: Proceedings of the AHFE 2017 International Conference on Human Factors and Systems Interaction, Los Angeles, CA, USA, 17–21 July 2017; Springer: Berlin/Heidelberg, Germany, 2018; pp. 348–360. [Google Scholar]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Yang, L.; Huang, J.; Feng, T.; Hong-An, W.; Guo-Zhong, D. Gesture interaction in virtual reality. Virtual Real. Intell. Hardw. 2019, 1, 84–112. [Google Scholar]
- Fan, H.; Yu, X.; Ding, Y.; Yang, Y.; Kankanhalli, M. Pstnet: Point spatio-temporal convolution on point cloud sequences. In Proceedings of the ICLR 2021—9th International Conference on Learning Representations, ICLR, Virtual, 3–7 May 2021. [Google Scholar]
- Shah, J.; Bikshandi, G.; Zhang, Y.; Thakkar, V.; Ramani, P.; Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. Adv. Neural Inf. Process. Syst. 2024, 37, 68658–68685. [Google Scholar]
- Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13359–13368. [Google Scholar]
Statistics | Value |
---|---|
Total Pose Pairs | 980 |
Shape-Variations Set | 300 |
Viewpoint-Variations Set | 380 |
Torsional-Pose Set | 300 |
Age (min/avg/max) | 6/24/50 |
Participants | 10 |
Gender (male/female) | 5/5 |
BMI Range | (16, 27) |
Types of Viewpoint | 4 |
Different Poses | 30 + 20 + 10 × 4 |
Number of Pictures | 750 |
Dataset | 2D Skeleton | 3D Skeleton | RotJoints | ||||||
---|---|---|---|---|---|---|---|---|---|
Threshold | 0.73 | 0.71 | 0.72 | 0.72 | 0.74 | 0.73 | 0.69 | 0.71 | 0.70 |
Viewpoint | 53.2 | 54.4 | 57.8 | 81.3 | 82.5 | 84.6 | 89.7 | 92.1 | 93.5 |
Shape | 73.2 | 75.5 | 78.0 | 68.6 | 70.0 | 72.1 | 88.7 | 91.0 | 92.7 |
Torsional | 72.7 | 74.0 | 75.9 | 74.1 | 75.4 | 77.6 | 88.4 | 90.7 | 91.6 |
Model | Axis Angle | Quaternion | Rotation Matrix | Euler Angle |
---|---|---|---|---|
Method (1) | 84.3 | 85.2 | 87.3 | 92.6 |
Method (2) | 91.3 | 87.9 | 91.2 | 90.2 |
Method | Feature Base | NTU RGB+D 60 | NTU RGB+D 120 | ||
---|---|---|---|---|---|
Cross-Subject | Cross-View | Cross-Subject | Cross-View | ||
SGN [53] | Skeleton | 89.0% | 94.5% | 79.2% | 81.5% |
2s-AGCN [7] | Skeleton | 88.5% | 95.1% | 82.9% | 84.9% |
PointNet++ [40] | Point | 80.1% | 85.1% | 72.1% | 79.4% |
3DV-Motion [41] | Voxel | 84.5% | 95.4% | 76.9% | 92.5% |
3DV-PointNet++ [41] | Voxel + Point | 88.8% | 96.3% | 82.4% | 93.5% |
PSTNet [54] | Point | 90.5% | 96.5% | 87.0% | 93.8% |
TRNet (Ours) | RotJoints | 93.7% | 97.2% | 90.2% | 94.0% |
Model | Feature Base | Accuracy | Precision | Recall | F1 Score | AUC |
---|---|---|---|---|---|---|
Temporal2DNet | 2D skeleton | 0.81 | 0.75 | 0.78 | 0.76 | 0.84 |
Temporal3DNet | 3D skeleton | 0.85 | 0.80 | 0.82 | 0.81 | 0.87 |
TemporalRotNet | RotJoints | 0.93 | 0.91 | 0.93 | 0.92 | 0.95 |
Model | Accuracy | Precision | Recall | F1 Score | AUC |
---|---|---|---|---|---|
ST-GCN [6] | 0.82 | 0.80 | 0.83 | 0.81 | 0.90 |
2s-AGCN [7] | 0.88 | 0.87 | 0.89 | 0.88 | 0.94 |
CTR-GCN [56] | 0.85 | 0.84 | 0.86 | 0.85 | 0.92 |
TRNet (Ours) | 0.88 | 0.91 | 0.89 | 0.90 | 0.96 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gan, G.; Yang, G.; Liu, Z.; Xia, R.; Zhu, Z.; Qiu, Y.; Zhou, H.; Ying, Y. RotJoint-Based Action Analyzer: A Robust Pose Comparison Pipeline. Appl. Sci. 2025, 15, 3737. https://doi.org/10.3390/app15073737
Gan G, Yang G, Liu Z, Xia R, Zhu Z, Qiu Y, Zhou H, Ying Y. RotJoint-Based Action Analyzer: A Robust Pose Comparison Pipeline. Applied Sciences. 2025; 15(7):3737. https://doi.org/10.3390/app15073737
Chicago/Turabian StyleGan, Guo, Guang Yang, Zhengrong Liu, Ruiyan Xia, Zhenqing Zhu, Yuke Qiu, Hong Zhou, and Yangwei Ying. 2025. "RotJoint-Based Action Analyzer: A Robust Pose Comparison Pipeline" Applied Sciences 15, no. 7: 3737. https://doi.org/10.3390/app15073737
APA StyleGan, G., Yang, G., Liu, Z., Xia, R., Zhu, Z., Qiu, Y., Zhou, H., & Ying, Y. (2025). RotJoint-Based Action Analyzer: A Robust Pose Comparison Pipeline. Applied Sciences, 15(7), 3737. https://doi.org/10.3390/app15073737