VW-SC3D: A Sparse 3D CNN-Based Spatial–Temporal Network with View Weighting for Skeleton-Based Action Recognition
Abstract
:1. Introduction
- (1)
- This work presents a novel spatial–temporal architecture for SAR that uses sparse 3D CNNs to obtain spatial features for each skeleton and a temporal transformer to extract temporal dynamic information between skeletons, which take full advantage of the properties of different network structures.
- (2)
- To retain more information and generalize to any skeleton, we transformed the 3D skeleton to a 3D point cloud instead of converting a 3D skeleton to a 2D pseudo-image. The entire 3D point cloud was taken as features in place of the coordinates of joints. In addition, sparse 3D CNNs were employed instead of general 3D CNNs to make our model lighter.
- (3)
- A view-weighted transformation mechanism was introduced to address the view variation problem of 3D point cloud for better action recognition.
2. Related Work
2.1. CNN-Based Methods for Skeleton-Based Action Recognition
2.2. Transformer for Skeleton-Based Action Recognition
2.3. View Invariance in Skeleton-Based Action Recognition
3. Method
3.1. 3D Point Cloud Generating and View Weighting
3.2. Sparse 3D Convolutional Neural Networks
3.3. Transformer Encoder
4. Experiments
4.1. Datasets
4.2. Experiment Settings
4.3. Experiment Results
4.4. Ablation Study
4.4.1. Impact of View Weighting and Transformer Modules
4.4.2. Influence of 3D Sparse CNN Layers
4.5. Comparisons with the State-of-the-Art Approaches
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
RNN | Recurrent Neural Network |
LSTM | Long Short-Term Memory |
STA-LSTM | Spatio-Temporal Attention Long Short-Term Memory |
VA-RNN | View Adaptive Recurrent Neural Network |
DS-LSTM | Denoising Sparse Long Short-Term Memory |
GCN | Graph Convolutional Network |
ST-GCN | Spatio-Temporal Graph Convolutional Network |
AS-GCN | Actional–Structural Graph Convolution Network |
JTM | Joint Trajectory Maps |
SC4D | Sparse 4D Convolutional Network |
3D-CNN | 3D Convolutional Neural Network |
S3D-CNN | Sparse 3D Convolutional Neural Network |
VW-SC3D | View Weighting Sparse 3D Convolutional Neural Network |
VA-CNN | View Adaptation Convolutional Neural Network |
TCN | Temporal Convolutional Network |
SAR | Skeleton-based Action Recognition |
References
- Yang, L.; Shan, X.; Lv, C.; Brighton, J.; Zhao, Y. Learning Spatio-Temporal Representations with a Dual-Stream 3-D Residual Network for Nondriving Activity Recognition. IEEE Trans. Ind. Electron. 2021, 69, 7405–7414. [Google Scholar] [CrossRef]
- Dallel, M.; Havard, V.; Baudry, D.; Savatier, X. Inhard-industrial human action recognition dataset in the context of industrial collaborative robotics. In Proceedings of the 2020 IEEE International Conference on Human-Machine Systems (ICHMS), Rome, Italy, 7–9 September 2020; pp. 1–6. [Google Scholar]
- Xian, Y.; Rong, X.; Yang, X.; Tian, Y. Evaluation of low-level features for real-world surveillance event detection. IEEE Trans. Circuits Syst. Video Technol. 2016, 27, 624–634. [Google Scholar] [CrossRef]
- Deepak, K.; Vignesh, L.; Srivathsan, G.; Roshan, S.; Chandrakala, S. Statistical Features-Based Violence Detection in Surveillance Videos. In Cognitive Informatics and Soft Computing; Springer: Berlin/Heidelberg, Germany, 2020; pp. 197–203. [Google Scholar]
- Karbalaie, A.; Abtahi, F.; Sjöström, M. Event detection in surveillance videos: A review. Multimed. Tools Appl. 2022, 81, 35463–35501. [Google Scholar] [CrossRef]
- Yin, J.; Han, J.; Wang, C.; Zhang, B.; Zeng, X. A skeleton-based action recognition system for medical condition detection. In Proceedings of the 2019 IEEE Biomedical Circuits and Systems Conference (BioCAS), Nara, Japan, 17–19 October 2019; pp. 1–4. [Google Scholar]
- Wang, P. Research on sports training action recognition based on deep learning. Sci. Program. 2021, 2021, 3396878. [Google Scholar] [CrossRef]
- Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 1–20. [Google Scholar] [CrossRef] [PubMed]
- Pareek, P.; Thakkar, A. A survey on video-based human action recognition: Recent updates, datasets, challenges, and applications. Artif. Intell. Rev. 2021, 54, 2259–2322. [Google Scholar] [CrossRef]
- Xing, Y.; Zhu, J. Deep Learning-Based Action Recognition with 3D Skeleton: A Survey; Wiley Online Library: Hoboken, NJ, USA, 2021. [Google Scholar]
- Ren, B.; Liu, M.; Ding, R.; Liu, H. A survey on 3d skeleton-based action recognition using learning method. arXiv 2020, arXiv:2002.05907. [Google Scholar]
- Gu, X.; Xue, X.; Wang, F. Fine-grained action recognition on a novel basketball dataset. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2563–2567. [Google Scholar]
- Zhang, Z. Microsoft kinect sensor and its effect. IEEE Multimed. 2012, 19, 4–10. [Google Scholar] [CrossRef] [Green Version]
- Xu, L.; Wang, Q.; Yuan, L.; Ma, X. Using trajectory features for tai chi action recognition. In Proceedings of the 2020 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Dubrovnik, Croatia, 25–28 May 2020; pp. 1–6. [Google Scholar]
- Cao, Z.; Hidalgo Martinez, G.; Simon, T.; Wei, S.; Sheikh, Y.A. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 30, 7291–7299. [Google Scholar] [CrossRef] [Green Version]
- Song, L.; Yu, G.; Yuan, J.; Liu, Z. Human pose estimation and its application to action recognition: A survey. J. Vis. Commun. Image Represent. 2021, 76, 103055. [Google Scholar] [CrossRef]
- Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3d human pose estimation with spatial and temporal transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 11656–11665. [Google Scholar]
- Kumar, N.; Sukavanam, N. Motion trajectory for human action recognition using fourier temporal features of skeleton joints. J. Image Graph. 2018, 6, 174–180. [Google Scholar] [CrossRef]
- Cheng, G.; Wan, Y.; Saudagar, A.N.; Namuduri, K.; Buckles, B.P. Advances in human action recognition: A survey. arXiv 2015, arXiv:1501.05964. [Google Scholar]
- Han, T.; Yao, H.; Xie, W.; Sun, X.; Zhao, S.; Yu, J. TVENet: Temporal variance embedding network for fine-grained action representation. Pattern Recognit. 2020, 103, 107267. [Google Scholar] [CrossRef]
- Liu, J.; Wang, G.; Duan, L.Y.; Abdiyeva, K.; Kot, A.C. Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Trans. Image Process. 2017, 27, 1586–1599. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Li, W.; Wen, L.; Chang, M.C.; Nam Lim, S.; Lyu, S. Adaptive RNN tree for large-scale human action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1444–1452. [Google Scholar]
- Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
- Li, S.; Li, W.; Cook, C.; Zhu, C.; Gao, Y. Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5457–5466. [Google Scholar]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3595–3603. [Google Scholar]
- Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 183–192. [Google Scholar]
- Zhu, Q.; Deng, H.; Wang, K. Skeleton Action Recognition Based on Temporal Gated Unit and Adaptive Graph Convolution. Electronics 2022, 11, 2973. [Google Scholar] [CrossRef]
- Panagiotakis, C.; Papoutsakis, K.; Argyros, A. A graph-based approach for detecting common actions in motion capture data and videos. Pattern Recognit. 2018, 79, 1–11. [Google Scholar] [CrossRef]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Hou, Y.; Li, Z.; Wang, P.; Li, W. Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 807–811. [Google Scholar] [CrossRef]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-based action recognition with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 597–600. [Google Scholar]
- Ding, Z.; Wang, P.; Ogunbona, P.O.; Li, W. Investigation of different skeleton features for cnn-based 3d action recognition. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 617–622. [Google Scholar]
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1963–1978. [Google Scholar] [CrossRef] [Green Version]
- Caetano, C.; Brémond, F.; Schwartz, W.R. Skeleton image representation for 3D action recognition based on tree structure and reference joints. In Proceedings of the 2019 32nd SIBGRAPI IEEE Conference on Graphics, Patterns and Images (SIBGRAPI), Rio de Janeiro, Brazil, 28–31 October 2019; pp. 16–23. [Google Scholar]
- Yang, F.; Wu, Y.; Sakti, S.; Nakamura, S. Make skeleton-based action recognition model smaller, faster and better. In Proceedings of the ACM Multimedia Asia, Beijing, China, 15–18 December 2019; pp. 1–6. [Google Scholar]
- Wang, P.; Li, W.; Li, C.; Hou, Y. Action recognition based on joint trajectory maps with convolutional neural networks. Knowl.-Based Syst. 2018, 158, 43–53. [Google Scholar] [CrossRef] [Green Version]
- Caetano, C.; Sena, J.; Brémond, F.; Dos Santos, J.A.; Schwartz, W.R. Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, 18–21 September 2019; pp. 1–8. [Google Scholar]
- Liu, H.; Tu, J.; Liu, M. Two-stream 3d convolutional neural network for skeleton-based action recognition. arXiv 2017, arXiv:1705.08106. [Google Scholar]
- Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 2969–2978. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Sc4d: A sparse 4d convolutional network for skeleton-based action recognition. arXiv 2020, arXiv:2004.03259. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
- Kalyan, K.S.; Rajasekharan, A.; Sangeetha, S. Ammus: A survey of transformer-based pretrained models in natural language processing. arXiv 2021, arXiv:2108.05542. [Google Scholar]
- Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the ICML, Online, 18–24 July 2021; Volume 2, p. 4. [Google Scholar]
- Cho, S.; Maqbool, M.; Liu, F.; Foroosh, H. Self-attention network for skeleton-based human action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 635–644. [Google Scholar]
- Plizzari, C.; Cannici, M.; Matteucci, M. Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Underst. 2021, 208, 103219. [Google Scholar] [CrossRef]
- Mazzia, V.; Angarano, S.; Salvetti, F.; Angelini, F.; Chiaberge, M. Action Transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognit. 2022, 124, 108487. [Google Scholar] [CrossRef]
- Zhang, Y.; Wu, B.; Li, W.; Duan, L.; Gan, C. STST: Spatial-temporal specialized transformer for skeleton-based action recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 3229–3237. [Google Scholar]
- Ji, X.; Liu, H. Advances in view-invariant human motion analysis: A review. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2009, 40, 13–24. [Google Scholar]
- Trong, N.P.; Minh, A.T.; Nguyen, H.; Kazunori, K.; Le Hoai, B. A survey about view-invariant human action recognition. In Proceedings of the 2017 56th IEEE Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE), Kanazawa, Japan, 19–22 September 2017; pp. 699–704. [Google Scholar]
- Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 816–833. [Google Scholar]
- Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1010–1019. [Google Scholar]
- Ji, Y.; Xu, F.; Yang, Y.; Shen, F.; Shen, H.T.; Zheng, W.S. A large-scale RGB-D database for arbitrary-view human action recognition. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1510–1518. [Google Scholar]
- Gao, L.; Ji, Y.; Kumie, G.A.; Xu, X.; Zhu, X.; Shen, H.T. View-invariant Human Action Recognition via View Transformation Network. IEEE Trans. Multimed. 2021, 24, 4493–4503. [Google Scholar] [CrossRef]
- Tang, H.; Liu, Z.; Li, X.; Lin, Y.; Han, S. TorchSparse: Efficient Point Cloud Inference Engine. Proc. Mach. Learn. Syst. 2022, 4, 302–315. [Google Scholar]
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2117–2126. [Google Scholar]
- Jiang, X.; Xu, K.; Sun, T. Action Recognition Scheme Based on Skeleton Representation with DS-LSTM Network. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2129–2140. [Google Scholar] [CrossRef]
- Ke, Q.; An, S.; Bennamoun, M.; Sohel, F.; Boussaid, F. Skeletonnet: Mining deep part features for 3-d action recognition. IEEE Signal Process. Lett. 2017, 24, 731–735. [Google Scholar] [CrossRef] [Green Version]
- Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3288–3297. [Google Scholar]
- Liu, J.; Akhtar, N.; Mian, A. Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Banerjee, A.; Singh, P.K.; Sarkar, R. Fuzzy integral-based CNN classifier fusion for 3D skeleton action recognition. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 2206–2216. [Google Scholar] [CrossRef]
Dataset | Evaluation | Top-1 | Top-5 |
---|---|---|---|
NTU-RGB+D 60 | CS | 83.7 | 94.5 |
CV | 89.4 | 97.3 | |
Taichi | CS | 90.4 | 98.5 |
CV | 95.0 | 99.7 |
Module | NTU-RGB+D 60 | Tai Chi | |||
---|---|---|---|---|---|
VW | Transformer | CS | CV | CS | CV |
× | × | 78.1 | 79.2 | 87.3 | 89.8 |
✓ | × | 78.5 | 82.2 | 88.2 | 93.3 |
× | ✓ | 80.3 | 81.1 | 89.1 | 91.1 |
✓ | ✓ | 83.7 | 89.4 | 90.4 | 95.0 |
Main Network | Structure | CS | CV |
---|---|---|---|
3D Sparse CNN | 1 sparse encoder | 81.2 | 83.5 |
2 sparse encoders | 82.1 | 84.4 | |
3 sparse encoders | 83.5 | 87.2 | |
4 sparse encoders | 83.7 | 89.4 |
Method | CS | CV | |
---|---|---|---|
RNN-Based | STA-LSTM [53] | 73.4 | 81.2 |
VA-LSTM [58] | 79.4 | 87.6 | |
DS-LSTM [59] | 77.8 | 87.3 | |
GCN-based | ST-GCN [29] | 81.5 | 88.3 |
AS-GCN [25] | 86.8 | 94.2 | |
Shift-GCN [26] | 90.7 | 96.5 | |
CNN-Based | JTM [36] | 73.4 | 75.2 |
SkeletonNet [60] | 75.9 | 81.2 | |
Clips+CNN+MTLN [61] | 79.6 | 84.8 | |
SkeleMotion [37] | 76.5 | 84.7 | |
Skepxel [62] | 81.3 | 89.2 | |
Banerjee et al. [63] | 84.2 | 89.7 | |
VW-SC3D (Our) | 83.7 | 89.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lin, X.; Xu, L.; Zhuang, S.; Wang, Q. VW-SC3D: A Sparse 3D CNN-Based Spatial–Temporal Network with View Weighting for Skeleton-Based Action Recognition. Electronics 2023, 12, 117. https://doi.org/10.3390/electronics12010117
Lin X, Xu L, Zhuang S, Wang Q. VW-SC3D: A Sparse 3D CNN-Based Spatial–Temporal Network with View Weighting for Skeleton-Based Action Recognition. Electronics. 2023; 12(1):117. https://doi.org/10.3390/electronics12010117
Chicago/Turabian StyleLin, Xiaotian, Leiyang Xu, Songlin Zhuang, and Qiang Wang. 2023. "VW-SC3D: A Sparse 3D CNN-Based Spatial–Temporal Network with View Weighting for Skeleton-Based Action Recognition" Electronics 12, no. 1: 117. https://doi.org/10.3390/electronics12010117
APA StyleLin, X., Xu, L., Zhuang, S., & Wang, Q. (2023). VW-SC3D: A Sparse 3D CNN-Based Spatial–Temporal Network with View Weighting for Skeleton-Based Action Recognition. Electronics, 12(1), 117. https://doi.org/10.3390/electronics12010117