SARN: Shifted Attention Regression Network for 3D Hand Pose Estimation
Abstract
:1. Introduction
- We propose a novel depth-image-based method, SARN, for convenient and accurate 3D hand pose estimation during functional tasks.
- We propose a novel structure, soft input aggregation, to connect a multi-stage model to reduce the error of 3D hand pose estimation.
- We construct a dataset consisting of 26K depth images from 17 healthy subjects based on the finger tapping test, often used in neurological examinations of Parkinson’s patients.
- For PAKH, the proposed method achieved a mean error of 2.99 mm for the hand keypoints and comparable performance on three task-specific metrics: the distance, velocity, and acceleration of the relative movement of the two fingertips.
2. Related Work
2.1. Sensor-Based HPE Methods
2.2. Learning-Based HPE Methods
3. Materials and Methods
3.1. Overview of the Framework
3.2. Dense Extraction Module
3.2.1. 3D Offset Map
3.2.2. Shifted Attention Heatmap
3.3. Backbone Network
3.4. Soft Input Aggregation
3.5. Loss Function Design
3.6. Implementation Details
4. Experiments and Results
4.1. Datasets and Evaluation Metrics
4.2. Comparison with State-of-the-Art Methods
4.3. Ablation Study
4.4. Performance on Our PAKH Dataset
5. Discussion
5.1. A Novel Deep Learning Framework for Hand Movement Recognition
5.2. Model Performance on Different Datasets
5.3. Limitations and Future Work
5.4. Future Prospective
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
HPE | Hand pose estimation |
SARN | Shifted attention regression network |
AR | Augmented reality |
VR | Virtual reality |
HCI | Human-computer interaction |
CNN | Convolutional neural network |
PIP | Proximal interphalangea |
MCP | Metacarpophalangeal |
SE | Squeeze-and-excitation |
References
- Guleryuz, O.G.; Kaeser-Chen, C. Fast Lifting for 3D Hand Pose Estimation in AR/VR Applications. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 106–110. [Google Scholar] [CrossRef]
- Krejov, P.G. Real time hand pose estimation for human computer interaction. Ph.D. thesis, University of Surrey, Guildford, UK, 2016. [Google Scholar]
- Xu, J.; Kun, Q.; Liu, H.; Ma, X. Hand Pose Estimation for Robot Programming by Demonstration in Object Manipulation Tasks. In Proceedings of the 2018 37th Chinese Control Conference (CCC), Wuhan, China, 25–27 July 2018; pp. 5328–5333. [Google Scholar] [CrossRef]
- Hsiao, P.C.; Yang, S.Y.; Lin, B.S.; Lee, I.J.; Chou, W. Data glove embedded with 9-axis IMU and force sensing sensors for evaluation of hand function. In Proceedings of the 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015; pp. 4631–4634. [Google Scholar] [CrossRef]
- Zheng, Y.; Peng, Y.; Wang, G.; Liu, X.; Dong, X.; Wang, J. Development and evaluation of a sensor glove for hand function assessment and preliminary attempts at assessing hand coordination. Measurement 2016, 93, 1–12. [Google Scholar] [CrossRef]
- Chen, K.Y.; Patel, S.N.; Keller, S. Finexus: Tracking Precise Motions of Multiple Fingertips Using Magnetic Sensing. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, San Jose, CA, USA, 7–12 May 2016; ACM: New York, NY, USA, 2016; pp. 1504–1514. [Google Scholar] [CrossRef]
- Guo, Z.; Zeng, W.; Yu, T.; Xu, Y.; Xiao, Y.; Cao, X.; Cao, Z. Vision-Based Finger Tapping Test in Patients With Parkinson’s Disease via Spatial-Temporal 3D Hand Pose Estimation. IEEE J. Biomed. Health Inform. 2022, 26, 3848–3859. [Google Scholar] [CrossRef] [PubMed]
- Moreira, A.H.; Queirós, S.; Fonseca, J.; Rodrigues, P.L.; Rodrigues, N.F.; Vilaca, J.L. Real-time hand tracking for rehabilitation and character animation. In Proceedings of the 2014 IEEE 3nd International Conference on Serious Games and Applications for Health (SeGAH), Rio de Janeiro, Brazil, 14–16 May 2014; pp. 1–8. [Google Scholar] [CrossRef]
- Sano, Y.; Kandori, A.; Shima, K.; Yamaguchi, Y.; Tsuji, T.; Noda, M.; Higashikawa, F.; Yokoe, M.; Sakoda, S. Quantifying Parkinson’s disease finger-tapping severity by extracting and synthesizing finger motion properties. Med. Biol. Eng. Comput. 2016, 54. [Google Scholar] [CrossRef] [PubMed]
- Stamatakis, J.; Ambroise, J.; Crémers, J.; Sharei, H.; Delvaux, V.; Macq, B.; Garraux, G. Finger Tapping Clinimetric Score Prediction in Parkinson’s Disease Using Low-Cost Accelerometers. Intell. Neurosci. 2013, 2013. [Google Scholar] [CrossRef] [PubMed]
- Kim, J.; Lee, J.H.; Kwon, Y.; Kim, C.; Eom, g.m.; Koh, S.B.; Kwon, D.Y.; Park, K.W. Quantification of bradykinesia during clinical finger taps using a gyrosensor in patients with Parkinson’s disease. Med. Biol. Eng. Comput. 2010, 49, 365–371. [Google Scholar] [CrossRef] [PubMed]
- Khan, T.; Nyholm, D.; Westin, J.; Dougherty, M. A computer vision framework for finger-tapping evaluation in Parkinson’s disease. Artif. Intell. Med. 2013, 60. [Google Scholar] [CrossRef] [PubMed]
- Sucar, L.E.; Azcárate, G.; Leder, R.S.; Reinkensmeyer, D.; Hernández, J.; Sanchez, I.; Saucedo, P. Gesture Therapy: A Vision-Based System for Arm Rehabilitation after Stroke. In International Joint Conference on Biomedical Engineering Systems and Technologies; Fred, A., Filipe, J., Gamboa, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 531–540. [Google Scholar]
- Oberweger, M.; Lepetit, V. DeepPrior++: Improving Fast and Accurate 3D Hand Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 585–594. [Google Scholar]
- Chen, X.; Wang, G.; Guo, H.; Zhang, C. Pose Guided Structured Region Ensemble Network for Cascaded Hand Pose Estimation. Neurocomputing 2018, 395, 138–149. [Google Scholar] [CrossRef] [Green Version]
- Zhou, X.; Wan, Q.; Zhang, W.; Xue, X.; Wei, Y. Model-based Deep Hand Pose Estimation. arXiv 2016, arXiv:1606.06854. [Google Scholar]
- Rezaei, M.; Rastgoo, R.; Athitsos, V. TriHorn-Net: A Model for Accurate Depth-Based 3D Hand Pose Estimation. arXiv 2022, arXiv:2206.07117. [Google Scholar]
- Oberweger, M.; Wohlhart, P.; Lepetit, V. Hands Deep in Deep Learning for Hand Pose Estimation. In Proceedings of the 20th Computer Vision Winter Workshop, Seggau, Austria, 9–11 February 2015; pp. 21–30. [Google Scholar]
- Wang, G.; Chen, X.; Guo, H.; Zhang, C. Region Ensemble Network: Towards Good Practices for Deep 3D Hand Pose Estimation. J. Vis. Commun. Image Represent. 2018, 55. [Google Scholar] [CrossRef]
- Chen, X.; Wang, G.; Zhang, C.; Kim, T.K.; Ji, X. SHPR-Net: Deep Semantic Hand Pose Regression From Point Clouds. IEEE Access 2018, 6, 43425–43439. [Google Scholar] [CrossRef]
- Tompson, J.; Stein, M.; Lecun, Y.; Perlin, K. Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Trans. Graph. 2014, 33, 1–10. [Google Scholar] [CrossRef]
- Tang, D.; Chang, H.J.; Tejani, A.; Kim, T.K. Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3786–3793. [Google Scholar] [CrossRef]
- Sun, X.; Wei, Y.; Liang, S.; Tang, X.; Sun, J. Cascaded Hand Pose Regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Abraham, L.; Urru, A.; Normani, N.; Wilk, M.P.; Walsh, M.; O’Flynn, B. Hand Tracking and Gesture Recognition Using Lensless Smart Sensors. Sensors 2018, 18, 2834. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gosala, N.B.; Wang, F.; Cui, Z.; Liang, H.; Glauser, O.; Wu, S.; Sorkine-Hornung, O. Self-Calibrated Multi-Sensor Wearable for Hand Tracking and Modeling. IEEE Trans. Vis. Comput. Graph. 2021, 1. [Google Scholar] [CrossRef] [PubMed]
- Ge, L.; Cai, Y.; Weng, J.; Yuan, J. Hand PointNet: 3D Hand Pose Estimation Using Point Sets. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8417–8426. [Google Scholar] [CrossRef]
- Pengfei, R.; Sun, H.; Qi, Q.; Wang, J.; Huang, W. SRN: Stacked Regression Network for Real-time 3D Hand Pose Estimation. In Proceedings of the British Machine Vision Conference, Cardiff, UK, 9–12 September 2019. [Google Scholar]
- Moon, G.; Chang, J.Y.; Lee, K.M. V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5079–5088. [Google Scholar]
- Xiong, F.; Zhang, B.; Xiao, Y.; Cao, Z.; Yu, T.; Zhou, J.T.; Yuan, J. A2J: Anchor-to-Joint Regression Network for 3D Articulated Pose Estimation from a Single Depth Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 793–802. [Google Scholar]
- Huang, W.; Ren, P.; Wang, J.; Qi, Q.; Sun, H. AWR: Adaptive Weighting Regression for 3D Hand Pose Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Wan, C.; Probst, T.; Gool, L.V.; Yao, A. Dense 3D Regression for Hand Pose Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5147–5156. [Google Scholar] [CrossRef] [Green Version]
- Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–24 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef] [Green Version]
- Su, Z.; Ye, M.; Zhang, G.; Dai, L.; Sheng, J. Cascade Feature Aggregation for Human Pose Estimation. arXiv 2019, arXiv:1902.07837. [Google Scholar]
- Zhang, X.; Zhang, F. Pixel-wise Regression: 3D Hand Pose Estimation via Spatial-form Representation and Differentiable Decoder. arXiv 2019, arXiv:1905.02085. [Google Scholar]
- Bulat, A.; Kossaifi, J.; Tzimiropoulos, G.; Pantic, M. Toward fast and accurate human pose estimation via soft-gated skip connections. arXiv 2020, arXiv:2002.11098. [Google Scholar]
- Ge, L.; Ren, Z.; Yuan, J. Point-to-Point Regression PointNet for 3D Hand Pose Estimation. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 489–505. [Google Scholar]
- Postuma, R.B.; Berg, D.; Stern, M.; Poewe, W.; Olanow, C.W.; Oertel, W.; Obeso, J.; Marek, K.; Litvan, I.; Lang, A.E.; et al. MDS clinical diagnostic criteria for Parkinson’s disease. Mov. Disord. 2015, 30, 1591–1601. [Google Scholar] [CrossRef] [PubMed]
- Intel RealSense. Depth Camera D435i. Available online: https://www.intelrealsense.com/depth-camera-d435i/ (accessed on 12 November 2022).
- Ge, L.; Liang, H.; Yuan, J.; Thalmann, D. 3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation from Single Depth Images. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5679–5688. [Google Scholar] [CrossRef]
- Du, K.; Lin, X.; Sun, Y.; Ma, X. CrossInfoNet: Multi-Task Information Sharing Based Hand Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9888–9897. [Google Scholar] [CrossRef]
- Buongiorno, D.; Bortone, I.; Cascarano, G.; Trotta, G.; Brunetti, A.; Bevilacqua, V. A low-cost vision system based on the analysis of motor features for recognition and severity rating of Parkinson’s Disease. BMC Med. Inform. Decis. Mak. 2019, 19, 243. [Google Scholar] [CrossRef] [PubMed]
Dataset | Subject | Gesture 1 | Gesture 2 | Gesture 3 | Gesture 4 | Total |
---|---|---|---|---|---|---|
Training | 1 | 462 | 375 | 359 | 407 | 1603 |
2 | 472 | 399 | 399 | 402 | 1672 | |
3 | 321 | 370 | 233 | 292 | 1216 | |
4 | 360 | 333 | 302 | 305 | 1300 | |
5 | 469 | 426 | 367 | 408 | 1670 | |
6 | 299 | 269 | 333 | 340 | 1241 | |
7 | 325 | 316 | 309 | 326 | 1276 | |
8 | 335 | 325 | 276 | 301 | 1237 | |
9 | 357 | 358 | 415 | 404 | 1534 | |
10 | 542 | 399 | 383 | 374 | 1698 | |
11 | 564 | 515 | 468 | 514 | 2061 | |
12 | 415 | 447 | 401 | 423 | 1686 | |
13 | 482 | 462 | 433 | 453 | 1830 | |
Total | 5403 | 4994 | 4678 | 4949 | 20,024 | |
Test | 14 | 343 | 310 | 351 | 333 | 1337 |
15 | 357 | 319 | 283 | 310 | 1269 | |
16 | 413 | 404 | 441 | 401 | 1659 | |
17 | 511 | 459 | 460 | 439 | 1869 | |
Total | 1624 | 1492 | 1535 | 1483 | 6134 |
Methods | NYU | ICVL | MSRA |
---|---|---|---|
Ren-9x6x6 [19] | 12.69 | 7.31 | 9.79 |
Pose-REN [15] | 11.81 | 6.79 | 8.65 |
DenseReg [32] | 10.2 | 7.3 | 7.23 |
3DCNN [41] | 14.1 | - | 9.58 |
V2V-PoseNet [28] | 8.42 | 6.28 | 7.59 |
SHPR-Net [20] | 10.78 | 7.22 | 7.76 |
HandPointNet [26] | 10.54 | 6.94 | 8.5 |
Point-to-Point [38] | 9.1 | 6.3 | 7.7 |
CrossInfoNet [42] | 10.08 | 6.73 | 7.86 |
TriHorn-Net [17] | 7.68 | 5.73 | 7.13 |
Ours | 7.32 | 5.91 | 7.17 |
Methods | Mean Error (mm) |
---|---|
Spatial | 7.41 |
Geometry | 7.68 |
Shifted (shared weights) | 7.36 |
Shifted (stage-wise weights) | 7.32 |
Methods | Params | Mean Error (mm) |
---|---|---|
ResNet-18 | 15.23M | 8.03 |
ResNet-34 | 25.49M | 7.86 |
ResNet-50 | 34.01M | 7.69 |
Hourglass (one stage) | 4.58M | 7.84 |
SE-Hourglass (one stage) | 4.70M | 7.78 |
Hourglass (two stages) | 8.74M | 7.53 |
SE-Hourglass (two stages) | 8.98M | 7.49 |
Methods | Mean Error (mm) |
---|---|
no processing | 7.43 |
Conv | 7.49 |
Conv+ | 7.38 |
Soft (Ours) | 7.32 |
Methods | Pos Err | Dis Err | Vel Err | Acc Err |
---|---|---|---|---|
SARN | 2.99 ± 2.33 | 2.98 ± 2.97 | 3.32 ± 3.32 | 5.65 ± 5.52 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhu, C.; Hu, B.; Chen, J.; Ai, X.; Agrawal, S.K. SARN: Shifted Attention Regression Network for 3D Hand Pose Estimation. Bioengineering 2023, 10, 126. https://doi.org/10.3390/bioengineering10020126
Zhu C, Hu B, Chen J, Ai X, Agrawal SK. SARN: Shifted Attention Regression Network for 3D Hand Pose Estimation. Bioengineering. 2023; 10(2):126. https://doi.org/10.3390/bioengineering10020126
Chicago/Turabian StyleZhu, Chenfei, Boce Hu, Jiawei Chen, Xupeng Ai, and Sunil K. Agrawal. 2023. "SARN: Shifted Attention Regression Network for 3D Hand Pose Estimation" Bioengineering 10, no. 2: 126. https://doi.org/10.3390/bioengineering10020126
APA StyleZhu, C., Hu, B., Chen, J., Ai, X., & Agrawal, S. K. (2023). SARN: Shifted Attention Regression Network for 3D Hand Pose Estimation. Bioengineering, 10(2), 126. https://doi.org/10.3390/bioengineering10020126