A Unified Framework for Recognizing Dynamic Hand Actions and Estimating Hand Pose from First-Person RGB Videos
Abstract
1. Introduction
2. Related Work
2.1. 3D Hand Pose Estimation from RGB Image/Video
2.2. Action Recognition from RGB Image/Video
2.3. Transformers in Vision
3. Methodology
3.1. Hand Pose Estimation Module
3.2. Action Recognition Module
3.3. Loss Functions
4. Experiments
4.1. Experiment Details
4.2. Datasets and Metrics
4.3. Experimental Results
4.3.1. Comparison with State-of-the-Art Hand Pose Estimation Methods
4.3.2. Comparison with State-of-the-Art Action Recognition Methods
4.3.3. Ablation Studies
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Garcia-Hernando, G.; Yuan, S.; Baek, S.; Kim, T.-K. First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 409–419. [Google Scholar]
- Kwon, T.; Tekin, B.; Stühmer, J.; Bogo, F.; Pollefeys, M. H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10138–10148. [Google Scholar]
- Tekin, B.; Bogo, F.; Pollefeys, M. H+ o: Unified egocentric recognition of 3d hand-object poses and interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4511–4520. [Google Scholar]
- Wen, Y.; Pan, H.; Yang, L.; Pan, J.; Komura, T.; Wang, W. Hierarchical temporal transformer for 3D hand pose estimation and action recognition from egocentric RGB videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, Canada, 18–22 June 2023; pp. 21243–21253. [Google Scholar]
- Yang, S.; Liu, J.; Lu, S.; Er, M.H.; Kot, A.C. Collaborative learning of gesture recognition and 3D hand pose estimation with multi-order feature analysis. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part III 16, 2020. Springer: Berlin/Heidelberg, Germany, 2020; pp. 769–786. [Google Scholar]
- Fan, Z.; Liu, J.; Wang, Y. Adaptive computationally efficient network for monocular 3d hand pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part IV 16, 2020. Springer: Berlin/Heidelberg, Germany, 2020; pp. 127–144. [Google Scholar]
- Iqbal, U.; Molchanov, P.; Gall, T.B.J.; Kautz, J. Hand pose estimation via latent 2.5 d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 18–22 June 2018; pp. 118–134. [Google Scholar]
- Kim, D.U.; Kim, K.I.; Baek, S. End-to-end detection and pose estimation of two interacting hands. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11189–11198. [Google Scholar]
- Lin, K.; Wang, L.; Liu, Z. Mesh graphormer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12939–12948. [Google Scholar]
- Meng, H.; Jin, S.; Liu, W.; Qian, C.; Lin, M.; Ouyang, W.; Luo, P. 3d interacting hand pose estimation by hand de-occlusion and removal. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2020; pp. 380–397. [Google Scholar]
- Moon, G.; Yu, S.-I.; Wen, H.; Shiratori, T.; Lee, K.M. Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XX 16. 2020. Springer: Berlin/Heidelberg, Germany, 2020; pp. 548–564. [Google Scholar]
- Mueller, F.; Bernard, F.; Sotnychenko, O.; Mehta, D.; Sridhar, S.; Casas, D.; Theobalt, C. Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 49–59. [Google Scholar]
- Pan, H.; Cai, Y.; Yang, J.; Niu, S.; Gao, Q.; Wang, X. HandFI: Multilevel Interacting Hand Reconstruction Based on Multilevel Feature Fusion in RGB Images. Sensors 2024, 25, 88. [Google Scholar] [CrossRef] [PubMed]
- Spurr, A.; Iqbal, U.; Molchanov, P.; Hilliges, O.; Kautz, J. Weakly supervised 3d hand pose estimation via biomechanical constraints. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 211–228. [Google Scholar]
- Zimmermann, C.; Brox, T. Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4903–4911. [Google Scholar]
- Cai, Y.; Ge, L.; Liu, J.; Cai, J.; Cham, T.-J.; Yuan, J.; Thalmann, N.M. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2272–2281. [Google Scholar]
- Chen, L.; Lin, S.-Y.; Xie, Y.; Lin, Y.-Y.; Xie, X. Temporal-aware self-supervised learning for 3d hand pose and mesh estimation in videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 1050–1059. [Google Scholar]
- Wang, J.; Mueller, F.; Bernard, F.; Sorli, S.; Sotnychenko, O.; Qian, N.; Otaduy, M.A.; Casas, D.; Theobalt, C. Rgb2hands: Real-time tracking of 3d hand interactions from monocular rgb video. ACM Trans. Graph. (ToG) 2020, 39, 1–16. [Google Scholar] [CrossRef]
- Cosma, A.; Radoi, E. GaitFormer: Learning Gait Representations with Noisy Multi-Task Learning. arXiv 2023, arXiv:2310.19418. [Google Scholar]
- Hu, L.; Gao, L.; Liu, Z.; Feng, W. Continuous sign language recognition with correlation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2529–2539. [Google Scholar]
- Kim, J.-H.; Kim, N.; Won, C.S. Multi Modal Facial Expression Recognition with Transformer-Based Fusion Networks and Dynamic Sampling. arXiv 2023, arXiv:2303.08419. [Google Scholar]
- Xia, Z.; Peng, W.; Khor, H.-Q.; Feng, X.; Zhao, G. Revealing the invisible with model and data shrinking for composite-database micro-expression recognition. IEEE Trans. Image Process. 2020, 29, 8590–8605. [Google Scholar] [CrossRef]
- Zhu, X.; Huang, P.-Y.; Liang, J.; De Melo, C.M.; Hauptmann, A.G. Stmt: A spatial-temporal mesh transformer for mocap-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 1526–1536. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
- Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2020; pp. 816–833. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. Adv. Neural Inf. Process. Syst. 2014, 27, 568–576. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 4489–4497. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Lin, K.; Wang, L.; Liu, Z. End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1954–1963. [Google Scholar]
- Hampali, S.; Sarkar, S.D.; Rad, M.; Lepetit, V. Keypoint transformer: Solving joint identification in challenging hands and object interactions for accurate 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11090–11100. [Google Scholar]
- Beyer, L.; Izmailov, P.; Kolesnikov, A.; Caron, M.; Kornblith, S.; Zhai, X.; Minderer, M.; Tschannen, M.; Alabdulmohsin, I.; Pavetic, F. Flexivit: One model for all patch sizes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14496–14506. [Google Scholar]
- Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Lee, J.; Toutanova, K. Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Hasson, Y.; Tekin, B.; Bogo, F.; Laptev, I.; Pollefeys, M.; Schmid, C. Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 571–580. [Google Scholar]
- Aboukhadra, A.T.; Malik, J.; Elhayek, A.; Robertini, N.; Stricker, D. Thor-net: End-to-end graformer-based realistic two hands and object reconstruction with self-supervision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 1001–1010. [Google Scholar]
- Hu, J.-F.; Zheng, W.-S.; Lai, J.; Zhang, J. Jointly learning heterogeneous features for RGB-D activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 5344–5352. [Google Scholar]
- Liu, J.; Wang, Y.; Xiang, S.; Pan, C. HAN: An efficient hierarchical self-attention network for skeleton-based gesture recognition. Pattern Recognit. 2025, 162, 111343. [Google Scholar] [CrossRef]
- Peng, S.-H.; Tsai, P.-H. An efficient graph convolution network for skeleton-based dynamic hand gesture recognition. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 2179–2189. [Google Scholar] [CrossRef]
- Narayan, S.; Mazumdar, A.P.; Vipparthi, S.K. SBI-DHGR: Skeleton-based intelligent dynamic hand gestures recognition. Expert Syst. Appl. 2023, 232, 120735. [Google Scholar] [CrossRef]
- Prasse, K.; Jung, S.; Zhou, Y.; Keuper, M. Local spherical harmonics improve skeleton-based hand action recognition. In Proceedings of the DAGM German Conference on Pattern Recognition, Heidelberg, Germany, 19–22 September 2023; Springer: Berlin/Heidelberg, Germany, 2020; pp. 67–82. [Google Scholar]
- Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 10–17 October 2021; pp. 13359–13368. [Google Scholar]
- Li, R.; Wang, H. Graph convolutional networks and LSTM for first-person multimodal hand action recognition. Mach. Vis. Appl. 2022, 33, 84. [Google Scholar] [CrossRef]
- Mucha, W.; Kampel, M. In my perspective, in my hands: Accurate egocentric 2d hand pose and action recognition. In Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkey, 27–31 May 2024; pp. 1–9. [Google Scholar]
- Li, X.; Hou, Y.; Wang, P.; Gao, Z.; Xu, M.; Li, W. Trear: Transformer-based rgb-d egocentric action recognition. IEEE Trans. Cogn. Dev. Syst. 2021, 14, 246–252. [Google Scholar] [CrossRef]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
- Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2969–2978. [Google Scholar]
- Wang, R.; Wu, X.-J.; Kittler, J. SymNet: A simple symmetric positive definite manifold deep learning method for image set classification. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 2208–2222. [Google Scholar] [CrossRef] [PubMed]
Method | Left | Right |
---|---|---|
H+O [3] | 41.42 | 38.86 |
LPC [41] | 39.56 | 41.87 |
H2O [2] | 41.45 | 37.21 |
HTT [4] | 35.02 | 35.63 |
THOR-NET [42] | 36.8 | 36.5 |
Ours | 31.16 | 35.21 |
Method | Modality | Acc. |
---|---|---|
Joule-color-all [43] | RGB + Depth + Skeleton | 78.78 |
Two stream-color [26] | RGB | 61.56 |
Two stream-flow [26] | RGB | 69.91 |
Two stream-all [26] | RGB | 75.30 |
FPHA [1] | Pose | 78.73 |
H+O [3] | RGB | 82.43 |
HAN-2S [44] | Skeleton | 89.04 |
Collaborative [5] | RGB | 85.22 |
ResGCNeXt [45] | Skeleton | 89.04 |
SBI-DHGR [46] | Skeleton | 92.48 |
Li et al. [49] | RGB + Depth + Skeleton | 91.95 |
HTT [4] | RGB | 94.09 |
EffHandEgoNet-Transformer [50] | RGB | 94.43 |
Trear-depth [51] | Depth | 92.17 |
K. Prasse et al. [47] | Skeleton | 92.52 |
SymNet-v2 [54] | RGB | 82.96 |
GCN-BL [48] | Skeleton | 80.52 |
Ours | RGB | 94.82 |
Method | Modality | Acc. |
---|---|---|
H2O w/ST-GCN [30] | Skeleton | 73.86 |
H2O w/TA-GCN [2] | RGB + Depth | 79.25 |
PoseConv3D [53] | Skeleton | 83.47 |
H+O [3] | RGB | 68.88 |
SlowFast [25] | RGB | 77.69 |
C2D [52] | RGB | 70.66 |
I3D [24] | RGB | 75.21 |
HTT [4] | RGB | 86.36 |
Ours | RGB | 87.92 |
FPHA | Acc. | 94.43 | 94.82 |
H20 | Acc. | 87.04 | 87.92 |
hand.Left | 35.26 | 31.16 | |
hand.Right | 38.87 | 35.21 |
Acc. | |
---|---|
0.5 | 92.47 |
0.75 | 93.88 |
1.0 | 94.82 |
1.25 | 94.09 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, J.; Liang, J.; Pan, H.; Cai, Y.; Gao, Q.; Wang, X. A Unified Framework for Recognizing Dynamic Hand Actions and Estimating Hand Pose from First-Person RGB Videos. Algorithms 2025, 18, 393. https://doi.org/10.3390/a18070393
Yang J, Liang J, Pan H, Cai Y, Gao Q, Wang X. A Unified Framework for Recognizing Dynamic Hand Actions and Estimating Hand Pose from First-Person RGB Videos. Algorithms. 2025; 18(7):393. https://doi.org/10.3390/a18070393
Chicago/Turabian StyleYang, Jiayi, Jiao Liang, Huimin Pan, Yuting Cai, Quanli Gao, and Xihan Wang. 2025. "A Unified Framework for Recognizing Dynamic Hand Actions and Estimating Hand Pose from First-Person RGB Videos" Algorithms 18, no. 7: 393. https://doi.org/10.3390/a18070393
APA StyleYang, J., Liang, J., Pan, H., Cai, Y., Gao, Q., & Wang, X. (2025). A Unified Framework for Recognizing Dynamic Hand Actions and Estimating Hand Pose from First-Person RGB Videos. Algorithms, 18(7), 393. https://doi.org/10.3390/a18070393