Dynamic Edge Convolutional Neural Network for Skeleton-Based Human Action Recognition
Abstract
:1. Introduction
- We analyzed the 3D skeleton datasets and found that the original joints, as well as the velocity between neighboring frames, have a great impact on the performance of HAR methods. Consequently, we used both original joints and velocity to build a new HAR method.
- We studied edge convolution to design a novel deep learning model called the dynamic edge convolution neural network to recognize human action. The proposed deep learning model is updated dynamically after each layer by recomputing the graph for each joint using k-nearest neighbors.
- We also explored criss-cross attention (CCA) before applying edge convolution to emphasize intra- and inter-frame relationships in the spatial and temporal directions.
- The proposed method was evaluated on two benchmark human action datasets to show its effectiveness and robustness.
- In addition, we provide extensive experimental results for original joints and velocity, as well as spatial and temporal attention, to show the effects of each component on the performance.
2. Preliminaries
2.1. Traditional Convolution Layer
2.2. Graph Convolution
2.3. Attention Mechanisms
3. State-of-the-Art Methods
3.1. Traditional Machine Learnning-Based Approaches for HAR Using 3D Skeleton Datasets
3.2. Deep Learning-Based Approaches for HAR Using 3D Skeleton Datasets
4. Proposed Methodology
4.1. Architectural Overviews of the Proposed Method
4.2. Preprocessing of 3D Skeleton Joints and Dynamic Graph Updates
5. Experimental Results
5.1. Experimental Setup
5.2. Evaluation Metrics
5.3. Dataset Descriptions
5.4. Performance Evaluations and Comparisons on the UTD-MHAD Dataset
5.5. Performance Evaluations and Comparisons on the MSR-Action3D Dataset
5.6. Effects of Criss-Cross Attention on Recognitoion Performance
5.7. Network Architecture and Complexity Analysis
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chu, X.; Ouyang, W.; Li, H.; Wang, X. Structured feature learning for pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar] [CrossRef] [Green Version]
- Liao, Y.; Vakanski, A.; Xian, M. A deep learning framework for assessing physical rehabilitation exercises. IEEE Trans. Neural Syst. Rehabil. Eng. 2020, 28, 468–477. [Google Scholar] [CrossRef]
- Chaaraoui, A.A.; Padilla-Lopez, J.R.; Ferrandez-Pastor, F.J.; Nieto-Hidalgo, M.; Florez-Revuelta, F. A vision-based system for intelligent monitoring: Human behaviour analysis and privacy by context. Sensors 2014, 14, 8895–8925. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wen, R.; Nguyen, B.P.; Chng, C.B.; Chui, C.K. In Situ Spatial AR Surgical Planning Using projector-Kinect System. In Proceedings of the Fourth Symposium on Information and Communication Technology, Da Nang, Vietnam, 5–6 December 2013. [Google Scholar] [CrossRef]
- Azuma, R.T. A survey of augmented reality. Presence Teleoperators Virtual Environ. 1997, 6, 355–385. [Google Scholar] [CrossRef]
- Zheng, Y.; Ding, X.; Poon, C.C.Y.; Lo, B.P.L.; Zhang, H.; Zhou, X.; Yang, G.; Zhao, N.; Zhang, Y. Unobtrusive Sensing and Wearable Devices for Health Informatics. IEEE Trans. Biomed. Eng. 2014, 61, 1538–1554. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.; Ma, N.; Wang, P.; Li, J.; Wang, P.; Pang, G.; Shi, X. Survey of pedestrian action recognition techniques for au-tonomous driving. Tsinghua Sci. Technol. 2020, 25, 458–470. [Google Scholar] [CrossRef]
- Bloom, V.; Makris, D.; Argyriou, V. G3D: A gaming action dataset and real time action recognition evaluation framework. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012. [Google Scholar] [CrossRef]
- Mahjoub, A.B.; Atri, M. Human action recognition using RGB data. In Proceedings of the 11th International Design & Test Symposium (IDT), Hammamet, Tunisia, 18–20 December 2016. [Google Scholar] [CrossRef]
- Dhiman, C.; Vishwakarma, D.K. View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Proc. 2020, 29, 3835–3844. [Google Scholar] [CrossRef]
- Chen, C.; Liu, K.; Kehtarnavaz, N. Real-time human action recognition based on depth motion maps. J. Real-Time Image Proc. 2016, 12, 155–163. [Google Scholar] [CrossRef]
- Jin, K.; Jiang, M.; Kong, J.; Huo, H.; Wang, X. Action recognition using vague division DMMs. J. Eng. 2017, 4, 77–84. [Google Scholar] [CrossRef]
- Liang, C.; Liu, D.; Qi, L.; Guan, L. Multi-modal human action recognition with sub-action exploiting and class-privacy pre-served collaborative representation learning. IEEE Access 2020, 8, 39920–39933. [Google Scholar] [CrossRef]
- Sahoo, S.P.; Ari, S.; Mahapatra, K.; Mohanty, S.P. HAR-depth: A novel framework for human action recognition using sequential learning and depth estimated history images. IEEE Trans. Emerg. Top. Comput. Intell. 2020, 5, 813–825. [Google Scholar] [CrossRef]
- Ahmad, Z.; Khan, N. Inertial Sensor Data to Image Encoding for Human Action Recognition. IEEE Sens. J. 2021, 9, 10978–10988. [Google Scholar] [CrossRef]
- O’Shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. [Google Scholar] [CrossRef]
- Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef] [Green Version]
- Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
- Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI conference on artificial intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar] [CrossRef]
- Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
- Le, V.T.; Tran-Trung, K.; Hoang, V.T. A Comprehensive Review of Recent Deep Learning Techniques for Human Activity Recognition. Comput. Intell. Neurosci. 2022, 2022, 8323962. [Google Scholar] [CrossRef]
- Yang, Y.; Deng, C.; Gao, S.; Liu, W.; Tao, D.; Gao, X. Discriminative multi-instance multitasks learning for 3D action recogni-tion. IEEE Trans. Multimed. 2017, 19, 519–529. [Google Scholar] [CrossRef]
- Hussein, M.E.; Torki, M.; Gowayyed, M.A.; El-Saban, M. Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 3–9 August 2013. [Google Scholar]
- Xia, L.; Chen, C.C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3d joints. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012. [Google Scholar] [CrossRef]
- Yang, X.; Tian, Y.L. Eigenjoints-based action recognition using naive-bayes-nearest-neighbor. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Providence, RI, USA, 16–21 June 2012. [Google Scholar] [CrossRef]
- Vemulapalli, R.; Arrate, F.; Chellappa, R. Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar] [CrossRef]
- Agahian, S.; Negin, F.; Köse, C. Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition. Vis. Comput. 2019, 35, 591–607. [Google Scholar] [CrossRef] [Green Version]
- Chaudhry, R.; Ofli, F.; Kurillo, G.; Bajcsy, R.; Vidal, R. Bio-inspired dynamic 3d discriminative skeletal features for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Washington, DC, USA, 23–28 June 2013. [Google Scholar] [CrossRef]
- Hou, Y.; Li, Z.; Wang, P.; Li, W. Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 807–811. [Google Scholar] [CrossRef]
- Wang, P.; Li, W.; Li, C.; Hou, Y. Action recognition based on joint trajectory maps with convolutional neural networks. Knowl.-Based Syst. 2018, 158, 43–53. [Google Scholar] [CrossRef] [Green Version]
- Chen, Y.; Wang, L.; Li, C.; Hou, Y.; Li, W. ConvNets-based action recognition from skeleton motion maps. Multimed. Tools Appl. 2020, 79, 1707–1725. [Google Scholar] [CrossRef]
- Tasnim, N.; Islam, M.K.; Baek, J.H. Deep Learning Based Human Activity Recognition Using Spatio-Temporal Image For-mation of Skeleton Joints. Appl. Sci. 2021, 11, 2675. [Google Scholar] [CrossRef]
- Wang, H.; Yu, B.; Xia, K.; Li, J.; Zuo, X. Skeleton edge motion networks for human action recognition. Neurocomputing 2021, 423, 1–12. [Google Scholar] [CrossRef]
- Zhao, R.; Wang, K.; Su, H.; Ji, Q. Bayesian graph convolution lstm for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
- Ahmad, T.; Jin, L.; Lin, L.; Tang, G. Skeleton-based action recognition using sparse spatio-temporal GCN with edge effective resistance. Neurocomputing 2021, 423, 389–398. [Google Scholar] [CrossRef]
- Liu, X.; Li, Y.; Xia, R. Adaptive multi-view graph convolutional networks for skeleton-based action recognition. Neurocomputing 2021, 444, 288–300. [Google Scholar] [CrossRef]
- Liu, S.; Bai, X.; Fang, M.; Li, L.; Hung, C.C. Mixed graph convolution and residual transformation network for skeleton-based action recognition. Appl. Intell. 2022, 52, 1544–1555. [Google Scholar] [CrossRef]
- Zhang, C.; Liang, J.; Li, X.; Xia, Y.; Di, L.; Hou, Z.; Huan, Z. Human action recognition based on enhanced data guidance and key node spatial temporal graph convolution. Multimed. Tools Appl. 2022, 81, 8349–8366. [Google Scholar] [CrossRef]
- Cha, J.; Saqlain, M.; Kim, D.; Lee, S.; Lee, S.; Baek, S. Learning 3D skeletal representation from transformer for action recognition. IEEE Access 2022, 10, 67541–67550. [Google Scholar] [CrossRef]
- Lv, F.; Nevatia, R. Recognition and segmentation of 3-d human action using hmm and multi-class adaboost. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006. [Google Scholar] [CrossRef]
- Wu, Q.; Huang, Q.; Li, X. Multimodal human action recognition based on spatio-temporal action representation recognition model. Multimed. Tools Appl. 2022, 81, 1–22. [Google Scholar] [CrossRef]
- Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. 2019, 38, 1–2. [Google Scholar] [CrossRef] [Green Version]
- Uddin, K.; Jeong, T.H.; Oh, B.T. Incomplete Region Estimation and Restoration of 3D Point Cloud Human Face Datasets. Sensors 2022, 22, 723. [Google Scholar] [CrossRef]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5105–5114. [Google Scholar] [CrossRef]
- Bottou, L. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2012; pp. 421–436. [Google Scholar] [CrossRef] [Green Version]
- Chen, C.; Jafari, R.; Kehtarnavaz, N. UTD-MHAD: A Multimodal Dataset for Human Action Recognition Utilizing a Depth Camera and a Wearable Inertial Sensor. In Proceedings of the IEEE International Conference on Image Processing, Quebec City, QC, Canada, 27–30 September 2015. [Google Scholar] [CrossRef]
- Li, W.; Zhang, Z.; Liu, Z. Action recognition based on a bag of 3d points. In Proceedings of the Conference on Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, San Francisco, CA, USA, 13–18 June 2010. [Google Scholar] [CrossRef]
Methods | Year | Features/Classifiers | Dataset |
---|---|---|---|
Tradtional machine learning-based | 2017 | MIMTL [22] | MSR-Action3D |
2013 | Cov3DJ, SVM [23] | MSR-Action3D | |
2012 | HOJ3D, LDA, HMM [24] | MSR-Action3D | |
2012 | EigenJoints, NBNN [25] | MSR-Action3D | |
2014 | Lie group, SVM [26] | MSR-Action3D | |
2019 | Bag of poses, K-means, SVM, ELM [27] | MSR-Action3D, UTD-MHAD | |
2013 | LDS, Discriminative metric learning, MKL [28] | MSR-Action3D | |
Deep Learning-based | 2016 | SOS, ConvNets [29] | UTD-MHAD |
2018 | JTM, ConvNets [30] | UTD-MHAD | |
2020 | TPSMMs, ConvNets [31] | UTD-MHAD | |
2021 | STIF, MobileNetV2, DenseNet121, ResNet18 [32] | UTD-MHAD | |
2021 | Joint location, Edge motion, SEMN [33] | UTD-MHAD | |
2019 | GCN, LSTM [34] | MSR-Action3D, UTD-MHAD | |
2021 | Attention, ST-GCN [35] | UTD-MHAD | |
2021 | AMV-GCN [36] | UTD-MHAD | |
2022 | ST-GCN and ResNeXt [37] | UTD-MHAD | |
2022 | Key nodes selection, GCN [38] | MSR-Action3D, UTD-MHAD | |
2022 | HP-DMI, ST-GCN, STJD, SVM [39] | MSR-Action3D, UTD-MHAD |
Method | Precion | Recall | Accuracy | F1-Score |
---|---|---|---|---|
Joints (Base) | 98.92% | 98.84% | 98.84% | 98.83% |
Velocity | 97.02% | 96.53% | 96.51% | 96.52% |
Joints + Velocity | 99.13% | 99.07% | 99.07% | 99.07% |
Joints + Velocity + CCA | 99.56% | 99.54% | 99.53% | 99.54% |
Method | Precision | Recall | Accuracy | F1-Score |
---|---|---|---|---|
Agahian et al. [27], 2019 | 95.75% | 95.37% | 95.30% | 95.33% |
Hou et al. [29], 2016 | 89.25% | 87.04% | 86.97% | 86.94% |
Wang et al. [30], 2018 | - | - | 87.45% | - |
Chen et al. [31], 2020 | - | - | 88.10% | - |
Tasnim et al. [32], 2021 | - | - | 95.29% | - |
Wang et al. [33], 2021 | - | - | 95.59% | - |
Zhao et al. [34], 2019 | - | - | 92.10% | - |
Ahmad et al. [35], 2021 | - | - | 99.50% | - |
Liu et al. [36], 2021 | - | - | 95.11% | - |
Liu et al. [37], 2022 | - | - | 80.23% | - |
Zhang et al. [38], 2022 | 94.69% | 94.17% | 94.19% | 94.00% |
Cha et al [39], 2022 | - | - | 96.30% | - |
Wu et al. [41], 2022 | 95.41% | 94.89% | 94.85% | 94.86% |
Proposed (Joints + Velocity + CCA) | 99.56% | 99.54% | 99.53% | 99.54% |
Method | Precision | Recall | Accuracy | F1-Score |
---|---|---|---|---|
Joints (Base) | 95.20% | 93.57% | 94.18% | 93.66% |
Velocity | 89.79% | 87.98% | 89.09% | 88.41% |
Joints + Velocity | 95.33% | 94.60% | 94.91% | 94.52% |
Joints + Velocity + CCA | 96.12% | 95.44% | 95.64% | 95.47% |
Method | Precision | Recall | Accuracy | F1-Score |
---|---|---|---|---|
Yang et al. [22], 2017 | - | - | 93.63% | - |
Hussein et al. [23], 2013 | - | - | 90.53% | - |
Xia et al. [24], 2012 | - | - | 78.97% | - |
Yang et al. [25], 2012 | - | - | 83.30% | - |
Vemulapalli et al. [26], 2014 | - | - | 92.46% | - |
Agahian et al. [27], 2019 | 92.65% | 91.76% | 91.90% | 91.61% |
Zhao et al. [34], 2019 | - | - | 94.50% | - |
Zhang et al. [38], 2022 | 95.79% | 95.73% | 94.81% | 95.53% |
Wu et al. [41], 2022 | 95.41% | 95.30% | 95.18% | 95.24% |
Proposed (Joints + Velocity + CCA) | 96.12% | 95.44% | 95.64% | 95.47% |
Method | UTD-MHAD | MSR-Action3D |
---|---|---|
Joints + Velocity + CCA (Spatial) | 99.07% | 95.27% |
Joints + Velocity + CCA (Temporal) | 99.30% | 94.91% |
Joints + Velocity + CCA (Spatial + Temporal) | 99.53% | 95.64% |
Module | Component | [Input]→[Output] | Parameter |
---|---|---|---|
Attention Module (Joints) | Norm2D, Conv2D, ReLU | [20 × 20 × 3]→[20 × 20 × 64] | 262 |
Conv2D, ReLU | [20 × 20 × 64]→[20 × 20 × 64] | 4160 | |
CCA-Spatial (Conv2D, Conv2D, Conv2D, Softmax) | [20 × 20 × 64]→[20 × 20 × 64] | 504 | |
CCA-Temporal (Conv2D, Conv2D, Conv2D, Softmax) | [20 × 20 × 64]→[20 × 20 × 64] | 504 | |
Attention Module (Velocity) | Norm2D, Conv2D, ReLU | [20 × 20 × 3]→[20 × 20 × 64] | 262 |
Conv2D, ReLU | [20 × 20 × 64]→[20 × 20 × 64] | 4160 | |
CCA-Spatial (Conv2D, Conv2D, Conv2D, Softmax) | [20 × 20 × 64]→[20 × 20 × 64] | 504 | |
CCA-Temporal (Conv2D, Conv2D, Conv2D, Softmax) | [20 × 20 × 64]→[20 × 20 × 64] | 504 | |
EdgeConv Module | k-NN Graph, Conv2D, Norm2D, LeakyReLU, Max | [20 × 20 × 64]→[20 × 20 × 64] | 8320 |
k-NN Graph, Conv2D, Norm2D, LeakyReLU, Max | [20 × 20 × 64]→[20 × 20 × 64] | 8320 | |
k-NN Graph, Conv2D, Norm2D, LeakyReLU, Max | [20 × 20 × 64]→[20 × 20 × 128] | 16,640 | |
k-NN Graph, Conv2D, Norm2D, LeakyReLU, Max | [20 × 20 × 128]→[20 × 20 × 256] | 66,048 | |
Classification Module | AdaptMaxPool2D, Conv2D, Norm2D, ReLU | [20 × 20 × 512]→[1 × 20 × 512] | 787,968 |
Conv2D, Norm2D, ReLU | [1 × 20 × 512]→[1 × 20 × 512] | 263,680 | |
AdaptMaxPool2D, Linear, ReLU, Dropout | [1 × 20 × 512]→[1 × 256] | 131,328 | |
Linear, Softmax | [1 × 256]→[1 × 27] | 6939 | |
Total | 1,300,103 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tasnim, N.; Baek, J.-H. Dynamic Edge Convolutional Neural Network for Skeleton-Based Human Action Recognition. Sensors 2023, 23, 778. https://doi.org/10.3390/s23020778
Tasnim N, Baek J-H. Dynamic Edge Convolutional Neural Network for Skeleton-Based Human Action Recognition. Sensors. 2023; 23(2):778. https://doi.org/10.3390/s23020778
Chicago/Turabian StyleTasnim, Nusrat, and Joong-Hwan Baek. 2023. "Dynamic Edge Convolutional Neural Network for Skeleton-Based Human Action Recognition" Sensors 23, no. 2: 778. https://doi.org/10.3390/s23020778